spam detection technique

Upload: koushik-mandal

Post on 03-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Spam Detection Technique

    1/15

    Koushik Mandal

    Jadavpur University

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    2/15

    What is spam?

    Spam is basically irrelevant information ormessages we get through email or bysearch engine results.

    Email Spam : Junk Email ,Unsolicited Commercial Email

    for making Advertisements and offers.

    Web Spam :

    A page created for the sole purpose ofattracting search engine referrals(to this page or some other target page)

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    3/15

    Problem of Spam

    Users dont want spam Lost productivityOffensive, Embarrassing Legitimate messages get lost in the sea of spam

    Spam isnt going away

    People buy from spammers Legislation has not been effective The SMTP protocol is inadequate

    It allows spammers to forge message information

    Spam is difficult to detect

    Spammers learn how to get past filtersLegitimate messages WILL be lost

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    4/15

    Spam Categories

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    5/15

    Spam Detection :

    EmailSpamAutomated Spam filtering :

    An instance of Document classificationproblems

    First document set predefines class(spam or legitimate)

    training set

    Second document set no class labels

    testing purpose

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    6/15

    Problem of Document Classification

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    7/15

    Nave Bayesian Approach

    4/13/2013

    Based on Bayes Theorem and total

    probability.

    the probability that an email is spam, given

    that it has certain words in it, is equal to the

    probability of finding those certain words inspam email, times the probability that any

    email is spam, divided by the probability of

    finding those words in any email.

  • 7/28/2019 Spam Detection Technique

    8/154/13/2013

    Nave Bayesian classifier is based on Bayestheorem and the theorem of total probability. For

    an email instance, the probability that it belongsto class C having a Vector of words X = (x1, x2,x3xN ) is

    Where J (Spam, Legitimate). In practice, theprobabilities P(X|Ci) are impossible to estimatewithout simplifying assumptions, because the

    possible values of X are too many .

  • 7/28/2019 Spam Detection Technique

    9/15

    Spam Detection :

    WebSpam

    4/13/2013

    Types of Spamming Techniques

    Term spamming

    Manipulating the text of web pages in order toappear relevant to queries

    Link spamming

    Creating link structures that boost page rank

    or hubs and authorities scores

  • 7/28/2019 Spam Detection Technique

    10/154/13/2013

    Link Spam

    Three kinds of web pages from aspammers point of view Inaccessible pages

    Accessible pages e.g., web log comments pages

    spammer can post links to his pages

    Own pages

    Completely controlled by spammer May span multiple domain names

  • 7/28/2019 Spam Detection Technique

    11/15

    Detecting Spam

    Term spamming

    Analyze text using statistical methods e.g.,

    Nave Bayes classifiers

    Similar to email spam filteringAlso useful: detecting approximate duplicate

    pages

    Link spamming Open research area

    One approach: TrustRank

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    12/15

    Trust Rank

    4/13/2013

    Basic principle: approximate isolation It is rare for a good page to point to a bad

    (spam) page

    Sample a set of seed pages from the

    web.Set trust of each trusted page to 1

    Propagate trust through links

    Each page gets a trust value between 0

    and 1 Use a threshold value and mark all pages

    below the trust threshold as spam

  • 7/28/2019 Spam Detection Technique

    13/15

    Anti-Trust Approach

    4/13/2013

    Broadly based on the same approximate

    isolation principle

    This principle also implies that the pagespointing to spam pages are very likely to bespam pages themselves.

    Anti-Trust is propagated in the reversedirection along incoming links, starting from aseed set of spam pages.

    A page can be classified as a spam page if ithas Anti-Trust Rank value more than a chosenthreshold value.

  • 7/28/2019 Spam Detection Technique

    14/15

    Q & A

    4/13/2013

  • 7/28/2019 Spam Detection Technique

    15/15

    Thank you

    4/13/2013