clustering spam

18
Clustering Spam MIT Spam Conference 2008 Phil Tom

Upload: wanda-todd

Post on 30-Dec-2015

71 views

Category:

Documents


3 download

DESCRIPTION

Clustering Spam. MIT Spam Conference 2008 Phil Tom. Simple Clustering Algorithm. Clustering pseudocode. Expand clusters to include similar messages: Identical originating IP addresses. Identical subject lines. Identical message bodies. for each cluster in clusters expand cluster - PowerPoint PPT Presentation

TRANSCRIPT

Clustering Spam

MIT Spam Conference 2008

Phil Tom

Simple Clustering Algorithm

Expand clusters to include similar messages:

1. Identical originating IP addresses.

2. Identical subject lines.

3. Identical message bodies.

for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster

Clustering pseudocode

Dimensional Model

update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?)

Expand Cluster By IP

update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25

Expand Cluster By Body

update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10)

Expand Cluster By Subject

Test Data Set

• Dec 22, 2007 - Dec 29, 2007

• Single “Received:” header tag only

• No multi-part messages

• 1.7 million messages

• Roughly 20%

Cluster Results

Min Cluster Size Max Cluster Size Clusters Messages % of Messages1 10 26610 64510 3.7%

11 100 3221 79218 4.6%101 1000 156 39413 2.3%

1001 10000 26 72786 4.2%10001 100000 2 37945 2.2%

100001 1 1436206 83.0%Totals 30016 1730078

Messages per Cluster Size*Not including the big cluster

0

10000

20000

30000

1 60 125 303 26979

Cluster Size

Sum of Messages

Top Clusters by IPs

cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 1 | 1436206 | 99836 | 330852 | 325660 | 8940 | 177 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 59 | 11322 | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 953 | 722 | 149 | 333 | 275 | 16 | 1 175 | 307 | 2 | 306 | 208 | 179 | 26 379 | 240 | 7 | 9 | 184 | 4 | 1 18219 | 5581 | 15 | 5212 | 153 | 119 | 26 3924 | 2934 | 20 | 2934 | 150 | 1 | 1 144 | 377 | 22 | 377 | 125 | 3 | 1 242 | 307 | 4 | 3 | 124 | 5 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 209 | 156 | 4 | 155 | 105 | 96 | 19 198 | 1117 | 174 | 1100 | 101 | 4 | 1

The Big One

messages | subject | bodies | ips | networks | countries----------+---------+--------+--------+----------+----------- 1436206 | 99836 | 330852 | 325660 | 8940 | 177

messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------------- 254948 | 30854 | 62772 | 27464 | 1453 | United States 75969 | 5110 | 27366 | 27446 | 170 | Germany 114328 | 6558 | 39312 | 26758 | 147 | Spain 78378 | 4705 | 29291 | 25263 | 48 | Turkey 91527 | 4624 | 29926 | 20930 | 209 | United Kingdom 51708 | 3194 | 19983 | 16842 | 42 | Peru 52652 | 2848 | 19644 | 15533 | 148 | Columbia 39475 | 3059 | 13344 | 10129 | 152 | Chile 34827 | 5063 | 12790 | 9664 | 12 | Brazil 40144 | 4381 | 13368 | 9372 | 126 | Italy

Cluster 1 summary

Top 10 countries by IP count

Clustering the Big One

• Create clusters on subject and body

messages | cluster_id | ips | subjects | bodies ----------+------------+--------+----------+-------- 740447 | 34641 | 131024 | 34 | 136 fake watches 111122 | 34643 | 79419 | 330 | 59166 penis enlargement 76521 | 34642 | 59112 | 27 | 55129 online casino 55421 | 34644 | 44772 | 55 | 25023 fake name brand goods 27789 | 34653 | 7190 | 81 | 16225 viagra 26815 | 34646 | 11099 | 20 | 19680 valium 25679 | 34656 | 5990 | 14846 | 25644 online pharmacy 12953 | 34649 | 3391 | 45 | 5 stock investment 12924 | 34645 | 4149 | 3 | 5 porn 12919 | 34648 | 3483 | 9 | 12332 software 10071 | 34650 | 9240 | 17 | 9273 russian dating

1099737 messages 284493 unique IPs

Clustering the Big One (cont)

rolex gambling enlargement knockoffs porn valium software stocks dating viagragambling 11820enlargement 14869 20514knockoffs 9316 13173 14885porn 1779 873 925 705valium 245 67 94 57 14software 308 10 14 7 2 9stocks 719 783 895 641 63 3 0dating 2182 3058 3412 2106 189 14 0 175viagra 96 13 8 6 1 92 4 2 1pharmacy 123 30 35 17 1 89 4 2 8 52

Number of overlapping IPs between clusters

Am I Bot or Not?

cluster_id | messages | subjects | bodies | ips | networks | countries------------+----------+----------+--------+-------+----------+----------- 62 | 26623 | 451 | 25992 | 1313 | 57 | 2

• Subject content widely varied• Many blocks of consecutive IPs• Some blocks are entire or most of a /24

messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------- 1246 | 87 | 1246 | 5 | 3 | Canada 25377 | 443 | 24746 | 1308 | 54 | United States

Failure is SuccessDelivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 68 | 1065 | 2 | 1065 | 609 | 12 | 4

Subject Detail messages | subject ----------+------------------ 613 | Delivery failure 452 | failure delivery

• Delivery notification from legitimate mail servers• Not clustered with spam or sources of spam

Chinese Spam

All Chinese messages messages | ips | networks | clusters | country_name ----------+------+----------+----------+--------------- 92235 | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany

Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 59 | 11322 | 19 | 15 | 962 | 4 | 1 3534 | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 121 | 2347 | 10 | 10 | 1 | 1 | 1 456 | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | 1 4621 | 1944 | 3 | 4 | 5 | 1 | 1

Small Clusters

• Varied subjects and bodies.

• Manual clustering of “online pharmacy” spam

Coalesced clusters: messages | ips | subjects | bodies | clusters ----------+------+----------+--------+---------- 30333 | 9685 | 19453 | 30298 | 3651

Example subjects:Buy sugar pills online cheap!!!!11oneBuy sugar pills online cheap!!!1cos(0)Buy sugar pills online cheap!111pi^0

What’s Next?

• Improve the similarity metrics

• Cluster a population or random sample

• Add time to the analysis