web spam detection with anti-trust rank

Web Spam Detection with Anti-Trust Rank

Vijay Krishnan

Rashmi Raj

Computer Science Department

Stanford University

The World Wide Web

The WebThe Web

•Huge

•Distributed content creation, linking (no coordination)

•Structured databases, unstructured text, semi-structured data.

•Content includes truth, lies, obsolete information, contradictions, …

PageRank

• Intuition: “a page is important if important pages link to it.”

• In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web.

(A few fixups needed.)

PageRank

• Web graph encoded by matrix M– NXN matrix (N = number of web pages)– Mij = 1/|O(j)| iff there is a link from j to i– Mij = 0 otherwise– O(j) = set of pages node i links to

• Define matrix A as follows – Aij = βMij + (1-β)/N, where 0<β<1– 1-β is the “tax” discussed in prior lecture

• Page rank r is first eigenvector of A– Ar = r

Many Random Walkers Model

• Imagine a large number M of independent, identical random walkers (MÀN)

• At any point in time, let M(p) be the number of random walkers at page p

• The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.

Economic Considerations

• Search has become the default gateway to the web

• Very high premium to appear on the first page of search results– e.g., e-commerce sites – advertising-driven sites

What is Web Spam?

• Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value

• Spam = web pages that are the result of spamming• This is a very broad defintion

– SEO industry might disagree!

– SEO = search engine optimization

• Approximately 10-15% of web pages are spam

Types of Spamming Techniques

• Term spamming– Manipulating the text of web pages in order to

appear relevant to queries

• Link spamming– Creating link structures that boost page rank or

hubs and authorities scores

Link Spam

• Three kinds of web pages from a spammer’s point of view– Inaccessible pages– Accessible pages

• e.g., web log comments pages• spammer can post links to his pages

– Own pages• Completely controlled by spammer• May span multiple domain names

Link Spam Detection

• Open research area

• One approach: TrustRank

Trust Rank

• Basic principle: approximate isolation– It is rare for a “good” page to point to a “bad” (spam)

page

• Sample a set of “seed pages” from the web.• Set trust of each trusted page to 1• Propagate trust through links• Each page gets a trust value between 0 and 1• Use a threshold value and mark all pages below

the trust threshold as spam

Anti-Trust Approach

• Broadly based on the same “approximate isolation principle”• This principle also implies that the pages pointing to

spam pages are very likely to be spam pages themselves.

• Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages.

• A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.

Seed Set selection

• Seed spam set chosen from pages with high page rank.

• Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation.

• Also some seed pages were chosen by an Oracle (Human Expert).

Results

• Overall Percentage of “spam” pages =0.28%.• Average page rank of “spam”/Average Page Rank

= 2.6.• % of “spam” pages in:• top 1000 Anti-Trust rank pages = 25.3%• Bottom 1000 Trust rank pages = 0.68%• Ratio of average page ranks of spam pages

returned by ATR vs. TR is roughly 6.

Results

Number of spam pages under different scenarios

1 1

7

68

684

4

39

253

12311721

10

100

937

47246569

10

99

905

626311885

1

10

100

1000

10000

100000

1 2 3 4 5

TrustRank AntiTrust (Seed=40) ATR (Seed=80) ATR (Seed=120)

References

• The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, 1998.

• Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB 2004.

• Topic-sensitive PageRank. Taher Haveliwala. In WWW 2002.

• The WebGraph dataset. Online at:• http://webgraph-data.dsi.unimi.it/

web spam detection with anti-trust rank

Documents