claudiu musat, ionut grigorescu, carmen mitrica, alexandru trifan spam clustering using wave...
TRANSCRIPT
![Page 1: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/1.jpg)
Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN
Spam Clustering using
Wave Oriented K Means
![Page 2: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/2.jpg)
You’ll be hearing quite a lot about…
• Spam signatures– Previous approaches– Spam Features
• Clustering– K-Means– K-Medoids– Stream clustering
• Constraints
![Page 3: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/3.jpg)
You’ll be hearing quite a lot about…
• Spam signatures– Previous approaches– Spam Features
• Clustering– K-Means– K-Medoids– Stream clustering
• Constraints
![Page 4: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/4.jpg)
You’ll be hearing quite a lot about…
• Spam signatures– Previous approaches– Spam Features
• Clustering– K-Means– K-Medoids– Stream clustering
• Constraints
![Page 5: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/5.jpg)
And we’ll connect the dots
![Page 6: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/6.jpg)
But the essence is…
"A nation that forgets its past is doomed to repeat it."
Winston Churchill
![Page 7: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/7.jpg)
And finally some result charts
![Page 8: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/8.jpg)
• Strong relation with dentistry
• Necessary Evil ?
• Last resort
Spam signatures
![Page 9: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/9.jpg)
Spam signatures (2)
• Most annoying problem is that they are labor intensive
• An extension of filtering email by hand
• More automation is badly needed to make signatures work
![Page 10: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/10.jpg)
Spam features
• The ki of the spam business
• Its DNA
• Everything and yet nothing
• Anything that has a constant value in a given spam wave
![Page 11: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/11.jpg)
Email Layout
• We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout.
• We encoded the layout of a message in a string of tokens such as 141L2211.
• This later evolved in a message summary such as BWWWLWWNWWE
• To this day, message layout is the most effective feature• We also use variations of this feature for the MIME parts,
for the paragraph contents and so on.
![Page 12: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/12.jpg)
Other Spam Features - headers
• Subject length, the number of separators, the maximum length of any word
• The number of received fields(turned out we were drunk and high when we chose this one)
• Whether it had a name in the from field• A quite nice example is the stripped date format
– Take the date field– Strip it of all alpha-numeric characters– Store what’s left– “ , :: - ()” or “, :: +” or “, :: + ”
• Any more suggestions?
![Page 13: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/13.jpg)
Other Spam Features – body
• Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines; – Basically any part of the email layout that we felt was
more important than the average• The number of links/email addresses/phone numbers• Bayes poison• Attatchments• Etc.
![Page 14: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/14.jpg)
Combining features (1)
• One stick is easy to break• The Roman fasces symbolized power
and authority• The symbol of strength through unity
from the Roman Empire to the U.S.• The most obvious problem – our sticks
are different. – Strings, integers, bools– I’ll stress this later
fasces lictoriae (bundles of the lictors)
![Page 15: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/15.jpg)
Combining features (2)
• If it’s an A and at the same time a B then it’s spam
• The idea of combining features never died out
• Started with its relaxed form – adding scores– if it has “Viagra” in it – increase its
spam score by 10%.
• Evolution came naturally
National Guard Bureau insignia
![Page 16: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/16.jpg)
Why cluster spam?
• A “well doh” kind of slide• To extract the patterns we want
– How do we combine spam traits to get a reliable spam pattern ?– And which are the traits that matter most?
• Agglomerative clustering is just one of many options– Neural Networks– ARTMap worked beautifully on separating ham from spam
![Page 17: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/17.jpg)
So why agglomerative?
• Because the problem stated before is wrong• We don’t just want spam patterns.
– We want patterns for that spam wave alone• Most neural nets make a binary decision. We want a
plurality of classes.• Still there are other options, like SVM’s.
– They don’t handle well on clustering strings – We want something that accepts just about any
feature as long as you can compute a distance
![Page 18: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/18.jpg)
K-means and K-medoids
• So we chose the simplest of methods – the widely popular K-Means– In a given feature space each item to be classified is a point. – The distance between the points indicates the resemblance of
the original items.– From a given set of instances to be clustered, it creates k
classes based on their similarity• For spaces where the mean of two points cannot be
computed, there is a variety of k-means: k-medoids. – This actually solves the different stick problem– As usual by solving a problem we introduce a whole range of
others.
• Combining them
![Page 19: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/19.jpg)
An Example
• Is it a line or a square?
• What about string features?
![Page 20: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/20.jpg)
Our old model
• Focus mainly on correctly defining some powerful spam features
• We totally neglected the clustering part– So we used the good old fashioned k-means and k-
medoids.– And they have serious drawbacks– A fixed number of classes.– Work only with an offline corpus
• The results were... Unpredictable. • Luck played a major role.
![Page 21: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/21.jpg)
WOKM – Wave oriented K-Means
• By using the simple k-means we could only cluster individual sets of emails
• We now needed to cluster the whole incoming stream of spam
• We also want to store a history of the clusters we extract– And use that information to detect spam on the user
side.– And also to help us better classify in the future
• Remember Churchill?
![Page 22: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/22.jpg)
WOKM – How does it work ?
• Takes snapshots of the incoming spam stream
• Takes in only what is new
• Train it on those messages
• Store the clusters for future reference
![Page 23: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/23.jpg)
The spam corpus
• All the changes originate here– All messages have an associated distance– The distance from them to the closest stored cluster
in the cluster history
• New clusters must be closer than old ones• Constrained K-Means
– Wagstaff&Cardie, 2001– “must fit” or “must not fit” – A history constraint
![Page 24: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/24.jpg)
The training phase
• While a solution has not been found:– Unassigned all the given examples– Assign all examples
• Create a given number of clusters• Assign what you can• Create some more and repeat the process
– Recompute centers– Merge adjacent(similar) clusters
• Counters the cluster inflation brought by the assign phase
– Test solution
![Page 25: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/25.jpg)
What’s worth remembering
• Accepts just about any kind of feature – Booleans, integers and strings.
• K-means is limited because you have to know the number of classes a priori.– WOKM determines the optimum number of classes
automatically• New messages will not be assigned to clusters that are not
considered close enough• Has a fast novelty detection phase, so it can train itself only with
new spam.• Can use the triangle inequality to speed things up.• (Future work) Allows us to keep track of the changes spammers
make in the design of their products.
– By watching clusters that are close to each other
![Page 26: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/26.jpg)
Results
• Perhaps the most exciting results – the cross language spam clusters
![Page 27: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/27.jpg)
Results(2)
• Then in spanish
• We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages
![Page 28: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/28.jpg)
Results(3)
• Then again in french (different though)
![Page 29: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/29.jpg)
And finally the promised charts
![Page 30: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/30.jpg)
And finally the promised charts (2)
![Page 31: Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN Spam Clustering using Wave Oriented K Means](https://reader036.vdocument.in/reader036/viewer/2022081512/5697c01f1a28abf838cd1e5d/html5/thumbnails/31.jpg)
Thank you !
?