learning url patterns for webpage de-duplication
DESCRIPTION
Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: [email protected]. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
Learning URL Patterns for Webpage De-duplicationAuthors: Hema Swetha Koppula…WSDM 2010Reporter: Jing ChiuEmail: [email protected]
112/04/21 1Data Mining & Machine Learning Lab
Outlines
•Introduction▫Duplicate URLs▫Problem Definition
•Related Works•Algorithms
▫URL Preprocessing▫Rule Generation
•Evaluation•Conclusions
112/04/21 2Data Mining & Machine Learning Lab
Introduction
•Duplicate URLs•Problem Definition
112/04/21 3Data Mining & Machine Learning Lab
• Making URLs search engine friendly▫ http://en.wikipedia.org/wiki/Casino_Royale▫ http://en.wikipedia.org/?title=Casino_Royale
• Session-id or cookie information present in URLs▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873
&cat=8▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813
&cat=8• Irrelevant or superfluous components in URLs
▫ http://www.amazon.com/Lord-Rings/dp/B000634DCW▫ http://www.amazon.com/dp/B000634DCW
• Webmaster construct URL representations with custom delimiters▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0Q
Q_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?
_fcls=1&_pcatid=1&_pid=43973351&_tab=2
Duplicate URLs
112/04/21 Data Mining & Machine Learning Lab 4
•Given a set of duplicate clusters and their corresponding URLs▫Learning Rules from URL strings which can
identify duplicates▫Utilizing learned Rules for normalizing
unseen duplicate URLs into a unique normalized URL
•Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL
Problem Definition
112/04/21 Data Mining & Machine Learning Lab 5
• Do not crawl in the dust: different urls with similar text▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld.▫Conference: International conference on World
Wide Web 2007▫DUST algorithm
Discovering substring substitution rules to transform URLs of similar content to one canonical URL
Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure
Related Works
112/04/21 Data Mining & Machine Learning Lab 6
• De-duping urls via rewrite rules▫ Authors: A. Dasgupta, R. Kumar, and A. Sasturkar▫ Conference: ACM SIGKDD international conference
on Knowledge discovery and data mining▫ Considering a broader set of rule types which
subsume the DUST rules DUST rules session-id rules irrelevant path components Complicate rewrites
▫ Algorithm learns rules from a cluster of URLs with similar page content such a cluster is referred to as a duplicate cluster or a
dup cluster
Related Works (cont.)
112/04/21 Data Mining & Machine Learning Lab 7
•URL Preprocessing▫Basic Tokenization▫Deep Tokenization
•Rule Generation▫Pair-wise Rule Generation▫Rule Generalization
Algorithms
112/04/21 Data Mining & Machine Learning Lab 8
•Basic Tokenization▫Using the standard delimiters specified in
theRFC 1738▫Extracted Tokens:
Protocol Hostname Path components Query-args
•Deep Tokenization▫Using unsupervised technique to learn
custom URL encodings used by webmasters
URL Preprocessing
112/04/21 Data Mining & Machine Learning Lab 9
URL Preprocessing (cont.)
112/04/21 Data Mining & Machine Learning Lab 10
• Definitions▫ URL▫ Rule
• Example▫ u1: http://360.yahoo.com/friends-lttU7d6kIuGq
u1 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2)
= −, k(3.3,1.1) = lttU7d6kIuGq}▫ u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ
u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ}
▫ Rule Context (C ):
c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ
Transformation (T): t(k(3.3,1.1)) = lttU7d6kIuGq.
Rule Generation
112/04/21 Data Mining & Machine Learning Lab 11
• Pair-wise Rule Generation▫ Target Selection▫ Source Selection
• Rule Generalization▫ Pair 1:
http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/mediaindex
▫ Pair 2: http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/mediaindex
▫ Rule 1: c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt,
c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex▫ Rule 2:
c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex
Rule Generation (cont.)
112/04/21 Data Mining & Machine Learning Lab 12
•Dataset
•Rule Numbers after each step
Evaluation
112/04/21 Data Mining & Machine Learning Lab 13
•Small dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 14
•Small dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 15
•Large dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 16
•Large dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 17
•Presented a set of scalable and robust techniques for de-duplication of URLs▫Basic and deep tokenization▫Rule generation and generalization
•Easy adaptability to MapReduce paradigm•Evaluate effectiveness on both small and
large dataset
Conclusion
112/04/21 Data Mining & Machine Learning Lab 18
•Questions?
Thanks for your attention
112/04/21 Data Mining & Machine Learning Lab 19
Algorithm 1
112/04/21 Data Mining & Machine Learning Lab 20
Algorithm 2
112/04/21 Data Mining & Machine Learning Lab 21
Algrithm 3
112/04/21 Data Mining & Machine Learning Lab 22
Algorithm 4
112/04/21 Data Mining & Machine Learning Lab 23
Algorithm 5
112/04/21 Data Mining & Machine Learning Lab 24
•URL: A URL u is defined as function ▫u : K → V ∪ {⊥}▫K: keys
k(x.i,y.j) x, y represent the position index from the
start and end of the URL i,j represent the deep token index
▫V: Values ▫A key not present in the URL is denoted by
⊥
Definitions of URL
112/04/21 Data Mining & Machine Learning Lab 25
•RULE: A Rule r is defined as a function ▫r : C → T ▫C: context
C : K → V ∪ {∗}▫T: transformation
T : K → V ∪ {⊥,K’} K’ = K ∪ ValueConversions ValueConversions = {Lowercase(K),
Uppercase(K), Encode(K), Decode(K), ...}
Definitions of Rule
112/04/21 Data Mining & Machine Learning Lab 26
Rule Coverage
112/04/21 Data Mining & Machine Learning Lab 27
MapReduce
112/04/21 Data Mining & Machine Learning Lab 28