using rank propagation and probabilistic counting for link ...webmining.spd.louisville.edu ›...
TRANSCRIPT
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Using rank propagation and Probabilisticcounting for Link-Based Spam Detection
Luca Becchetti1 Carlos Castillo1Debora Donato1Stefano Leonardi1 and Ricardo Baeza-Yates2
1 Universita di Roma ldquoLa Sapienzardquo ndash Rome Italy2 Yahoo Research ndash Barcelona Spain and Santiago Chile
August 20th 2006
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation
2 Spam pages characterization
3 Truncated PageRank
4 Counting supporters
5 Experiments
6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information
+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn
+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation
2 Spam pages characterization
3 Truncated PageRank
4 Counting supporters
5 Experiments
6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information
+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn
+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information
+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn
+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information
+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn
+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn
+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
What is on the Web
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now
Graphic wwwmilliondollarhomepagecom
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (keywords + links)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Web spam (mostly keywords)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fake search engine
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem ldquonormalrdquo pages that are spam
Some content is introduced
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Problem borderline pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Definitions
Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance
for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]
any attempt to deceive a search enginersquos relevancy algorithm
or simply
anything that would not be doneif search engines did not exist
[Perkins 2001]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Link farms
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Linkminusbased spam
[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages
ldquoin a number of these distributions outlier values areassociated with web spamrdquo
Research goal
Statistical analysis of link-based spam
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Idea count ldquosupportersrdquo at different distances
Number of different nodes at a given distance
UK 18 mill nodes EUINT 860000 nodes
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
00
01
02
03
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance149 clicks 100 clicks
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0minus10Top 40minus50Top 60minus70
Areas below the curves are equal if we are in the samestrongly-connected component
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths centrality betweenness clustering coefficient
(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank
Breadth-first and depth-first searchCount of neighbors
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General functional ranking
Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form
W =infinsum
t=0
damping(t)
NPt
There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice
damping(t) = (1minus α)αt
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links
damping(t) =
0 t le T
Cαt t gt T
V No extra reading of the graph after PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for
9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation
1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Truncated PageRank vs PageRank
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
1
10minus8
10minus6
10minus4
10minus9
10minus8
10minus7
10minus6
10minus5
10minus4
10minus3
Normal PageRank
Tru
ncat
ed P
ageR
ank
T=
4
Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
ldquoORrdquo operation
100010
Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for
4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for
13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
General algorithm
Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k
6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for
10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability ε
by the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)
Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node varies
This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Our estimator
Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have
P[Xi = 1] = 1minus (1minus ε)neighbors(node)
Estimator neighbors(node) = log(1minusε)
(1minus ones(node)
k
)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Adaptive estimator
if we knew neighbors(node) and chose ε = 1neighbors(node) we
would get
ones(node) (
1minus 1
e
)k 063k
Adaptive estimation
Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or less
less than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Convergence
5 10 15 200
10
20
30
40
50
60
70
80
90
100
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Error rate
1 2 3 4 5 6 7 80
01
02
03
04
05
512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi
576 btimesi1152 btimesi
512 btimesi768 btimesi
832 btimesi
960 btimesi
1152 btimesi
1216 btimesi1344 btimesi 1408 btimesi
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Test collection
UK collection
185 million pages downloaded from the UK domain
5344 hosts manually classified (6 of the hosts)
Classified entire hosts
V A few hosts are mixed spam and non-spam pages
X More coverage sample covers 32 of the pages
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Automatic classifier
We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4
We measured
Precision = of spam hosts classified as spam
of hosts classified as spam
Recall = of spam hosts classified as spam
of spam hosts
and the two types of errors in spam classification
False positive rate = of normal hosts classified as spam
of normal hosts
False negative rate = of spam hosts classified as normal
of spam hosts
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Single-technique classifier
Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank
Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank
Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=5)
Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg
TrustRank 082 050 21 50
Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55
Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Comparison of single-technique classifier (M=30)
Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg
TrustRank 080 049 23 51
Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57
Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Combined classifier
Spam class False FalsePruning Rules Precision Recall Pos Neg
M=5 49 087 080 20 20M=30 31 088 076 18 24
No pruning 189 085 079 26 21
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Summary of classifiers
(a) Precision and recall of spam detection
(b) Error rates of the spam classifiers
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Conclusions
V Link-based statistics to detect 80 of spam
X No magic bullet in link analysis
X Precision still low compared to e-mail spam filters
V Measure both home page and max PageRank page
V Host-based counts are important
Next step combine link analysis and content analysis
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Thank you
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Baeza-Yates R Boldi P and Castillo C (2006)
Generalizing PageRank Damping functions for link-basedranking algorithms
In Proceedings of SIGIR Seattle Washington USA ACMPress
Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)
Using rank propagation and probabilistic counting forlink-based spam detection
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press
Benczur A A Csalogany K Sarlos T and Uher M(2005)
Spamrank fully automatic link spam detection
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Fetterly D Manasse M and Najork M (2004)
Spam damn spam and statistics Using statistical analysis tolocate spam web pages
In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France
Flajolet P and Martin N G (1985)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences 31(2)182ndash209
Gibson D Kumar R and Tomkins A (2005)
Discovering large dense subgraphs in massive graphs
In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Gyongyi Z and Garcia-Molina H (2005)
Web spam taxonomy
In First International Workshop on Adversarial InformationRetrieval on the Web
Gyongyi Z Molina H G and Pedersen J (2004)
Combating web spam with trustrank
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann
Newman M E Strogatz S H and Watts D J (2001)
Random graphs with arbitrary degree distributions and theirapplications
Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-
Using rankpropagation and
Probabilisticcounting forLink-Based
Spam Detection
L BecchettiC CastilloD Donato
S Leonardi andR Baeza-Yates
Motivation
Spam pagescharacterization
TruncatedPageRank
Countingsupporters
Experiments
Conclusions
Ntoulas A Najork M Manasse M and Fetterly D (2006)
Detecting spam web pages through content analysis
In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland
Palmer C R Gibbons P B and Faloutsos C (2002)
ANF a fast and scalable tool for data mining in massivegraphs
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press
Perkins A (2001)
The classification of search engine spam
Available online athttpwwwsilverdisccoukarticlesspam-classification
- Motivation
- Spam pages characterization
- Truncated PageRank
- Counting supporters
- Experiments
- Conclusions
-