web spam detection with anti-trust rank
DESCRIPTION
Web Spam Detection with Anti-Trust Rank. Vijay Krishnan Rashmi Raj Computer Science Department Stanford University. The World Wide Web. Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi-structured data. - PowerPoint PPT PresentationTRANSCRIPT
Web Spam Detection with Anti-Trust Rank
Vijay Krishnan
Rashmi Raj
Computer Science Department
Stanford University
The World Wide Web
The WebThe Web
•Huge
•Distributed content creation, linking (no coordination)
•Structured databases, unstructured text, semi-structured data.
•Content includes truth, lies, obsolete information, contradictions, …
PageRank
• Intuition: “a page is important if important pages link to it.”
• In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web.
(A few fixups needed.)
PageRank
• Web graph encoded by matrix M– NXN matrix (N = number of web pages)– Mij = 1/|O(j)| iff there is a link from j to i– Mij = 0 otherwise– O(j) = set of pages node i links to
• Define matrix A as follows – Aij = βMij + (1-β)/N, where 0<β<1– 1-β is the “tax” discussed in prior lecture
• Page rank r is first eigenvector of A– Ar = r
Many Random Walkers Model
• Imagine a large number M of independent, identical random walkers (MÀN)
• At any point in time, let M(p) be the number of random walkers at page p
• The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.
Economic Considerations
• Search has become the default gateway to the web
• Very high premium to appear on the first page of search results– e.g., e-commerce sites – advertising-driven sites
What is Web Spam?
• Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value
• Spam = web pages that are the result of spamming• This is a very broad defintion
– SEO industry might disagree!
– SEO = search engine optimization
• Approximately 10-15% of web pages are spam
Types of Spamming Techniques
• Term spamming– Manipulating the text of web pages in order to
appear relevant to queries
• Link spamming– Creating link structures that boost page rank or
hubs and authorities scores
Link Spam
• Three kinds of web pages from a spammer’s point of view– Inaccessible pages– Accessible pages
• e.g., web log comments pages• spammer can post links to his pages
– Own pages• Completely controlled by spammer• May span multiple domain names
Link Spam Detection
• Open research area
• One approach: TrustRank
Trust Rank
• Basic principle: approximate isolation– It is rare for a “good” page to point to a “bad” (spam)
page
• Sample a set of “seed pages” from the web.• Set trust of each trusted page to 1• Propagate trust through links• Each page gets a trust value between 0 and 1• Use a threshold value and mark all pages below
the trust threshold as spam
Anti-Trust Approach
• Broadly based on the same “approximate isolation principle”• This principle also implies that the pages pointing to
spam pages are very likely to be spam pages themselves.
• Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages.
• A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.
Seed Set selection
• Seed spam set chosen from pages with high page rank.
• Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation.
• Also some seed pages were chosen by an Oracle (Human Expert).
Results
• Overall Percentage of “spam” pages =0.28%.• Average page rank of “spam”/Average Page Rank
= 2.6.• % of “spam” pages in:• top 1000 Anti-Trust rank pages = 25.3%• Bottom 1000 Trust rank pages = 0.68%• Ratio of average page ranks of spam pages
returned by ATR vs. TR is roughly 6.
Results
Number of spam pages under different scenarios
1 1
7
68
684
4
39
253
12311721
10
100
937
47246569
10
99
905
626311885
1
10
100
1000
10000
100000
1 2 3 4 5
TrustRank AntiTrust (Seed=40) ATR (Seed=80) ATR (Seed=120)
References
• The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, 1998.
• Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB 2004.
• Topic-sensitive PageRank. Taher Haveliwala. In WWW 2002.
• The WebGraph dataset. Online at:• http://webgraph-data.dsi.unimi.it/