learning url patterns for webpage de-duplication

28
Learning URL Patterns for Webpage De- duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: [email protected] 111/06/2 3 1 Data Mining & Machine Learning Lab

Upload: aileen

Post on 13-Jan-2016

41 views

Category:

Documents


3 download

DESCRIPTION

Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: [email protected]. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning URL Patterns for Webpage De-duplication

Learning URL Patterns for Webpage De-duplicationAuthors: Hema Swetha Koppula…WSDM 2010Reporter: Jing ChiuEmail: [email protected]

112/04/21 1Data Mining & Machine Learning Lab

Page 2: Learning URL Patterns for Webpage De-duplication

Outlines

•Introduction▫Duplicate URLs▫Problem Definition

•Related Works•Algorithms

▫URL Preprocessing▫Rule Generation

•Evaluation•Conclusions

112/04/21 2Data Mining & Machine Learning Lab

Page 3: Learning URL Patterns for Webpage De-duplication

Introduction

•Duplicate URLs•Problem Definition

112/04/21 3Data Mining & Machine Learning Lab

Page 4: Learning URL Patterns for Webpage De-duplication

• Making URLs search engine friendly▫ http://en.wikipedia.org/wiki/Casino_Royale▫ http://en.wikipedia.org/?title=Casino_Royale

• Session-id or cookie information present in URLs▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873

&cat=8▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813

&cat=8• Irrelevant or superfluous components in URLs

▫ http://www.amazon.com/Lord-Rings/dp/B000634DCW▫ http://www.amazon.com/dp/B000634DCW

• Webmaster construct URL representations with custom delimiters▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0Q

Q_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?

_fcls=1&_pcatid=1&_pid=43973351&_tab=2

Duplicate URLs

112/04/21 Data Mining & Machine Learning Lab 4

Page 5: Learning URL Patterns for Webpage De-duplication

•Given a set of duplicate clusters and their corresponding URLs▫Learning Rules from URL strings which can

identify duplicates▫Utilizing learned Rules for normalizing

unseen duplicate URLs into a unique normalized URL

•Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL

Problem Definition

112/04/21 Data Mining & Machine Learning Lab 5

Page 6: Learning URL Patterns for Webpage De-duplication

• Do not crawl in the dust: different urls with similar text▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld.▫Conference: International conference on World

Wide Web 2007▫DUST algorithm

Discovering substring substitution rules to transform URLs of similar content to one canonical URL

Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure

Related Works

112/04/21 Data Mining & Machine Learning Lab 6

Page 7: Learning URL Patterns for Webpage De-duplication

• De-duping urls via rewrite rules▫ Authors: A. Dasgupta, R. Kumar, and A. Sasturkar▫ Conference: ACM SIGKDD international conference

on Knowledge discovery and data mining▫ Considering a broader set of rule types which

subsume the DUST rules DUST rules session-id rules irrelevant path components Complicate rewrites

▫ Algorithm learns rules from a cluster of URLs with similar page content such a cluster is referred to as a duplicate cluster or a

dup cluster

Related Works (cont.)

112/04/21 Data Mining & Machine Learning Lab 7

Page 8: Learning URL Patterns for Webpage De-duplication

•URL Preprocessing▫Basic Tokenization▫Deep Tokenization

•Rule Generation▫Pair-wise Rule Generation▫Rule Generalization

Algorithms

112/04/21 Data Mining & Machine Learning Lab 8

Page 9: Learning URL Patterns for Webpage De-duplication

•Basic Tokenization▫Using the standard delimiters specified in

theRFC 1738▫Extracted Tokens:

Protocol Hostname Path components Query-args

•Deep Tokenization▫Using unsupervised technique to learn

custom URL encodings used by webmasters

URL Preprocessing

112/04/21 Data Mining & Machine Learning Lab 9

Page 10: Learning URL Patterns for Webpage De-duplication

URL Preprocessing (cont.)

112/04/21 Data Mining & Machine Learning Lab 10

Page 11: Learning URL Patterns for Webpage De-duplication

• Definitions▫ URL▫ Rule

• Example▫ u1: http://360.yahoo.com/friends-lttU7d6kIuGq

u1 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2)

= −, k(3.3,1.1) = lttU7d6kIuGq}▫ u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ

u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ}

▫ Rule Context (C ):

c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ

Transformation (T): t(k(3.3,1.1)) = lttU7d6kIuGq.

Rule Generation

112/04/21 Data Mining & Machine Learning Lab 11

Page 12: Learning URL Patterns for Webpage De-duplication

• Pair-wise Rule Generation▫ Target Selection▫ Source Selection

• Rule Generalization▫ Pair 1:

http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/mediaindex

▫ Pair 2: http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/mediaindex

▫ Rule 1: c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt,

c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex▫ Rule 2:

c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex

Rule Generation (cont.)

112/04/21 Data Mining & Machine Learning Lab 12

Page 13: Learning URL Patterns for Webpage De-duplication

•Dataset

•Rule Numbers after each step

Evaluation

112/04/21 Data Mining & Machine Learning Lab 13

Page 14: Learning URL Patterns for Webpage De-duplication

•Small dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 14

Page 15: Learning URL Patterns for Webpage De-duplication

•Small dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 15

Page 16: Learning URL Patterns for Webpage De-duplication

•Large dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 16

Page 17: Learning URL Patterns for Webpage De-duplication

•Large dataset

Evaluation (cont.)

112/04/21 Data Mining & Machine Learning Lab 17

Page 18: Learning URL Patterns for Webpage De-duplication

•Presented a set of scalable and robust techniques for de-duplication of URLs▫Basic and deep tokenization▫Rule generation and generalization

•Easy adaptability to MapReduce paradigm•Evaluate effectiveness on both small and

large dataset

Conclusion

112/04/21 Data Mining & Machine Learning Lab 18

Page 19: Learning URL Patterns for Webpage De-duplication

•Questions?

Thanks for your attention

112/04/21 Data Mining & Machine Learning Lab 19

Page 20: Learning URL Patterns for Webpage De-duplication

Algorithm 1

112/04/21 Data Mining & Machine Learning Lab 20

Page 21: Learning URL Patterns for Webpage De-duplication

Algorithm 2

112/04/21 Data Mining & Machine Learning Lab 21

Page 22: Learning URL Patterns for Webpage De-duplication

Algrithm 3

112/04/21 Data Mining & Machine Learning Lab 22

Page 23: Learning URL Patterns for Webpage De-duplication

Algorithm 4

112/04/21 Data Mining & Machine Learning Lab 23

Page 24: Learning URL Patterns for Webpage De-duplication

Algorithm 5

112/04/21 Data Mining & Machine Learning Lab 24

Page 25: Learning URL Patterns for Webpage De-duplication

•URL: A URL u is defined as function ▫u : K → V ∪ {⊥}▫K: keys

k(x.i,y.j) x, y represent the position index from the

start and end of the URL i,j represent the deep token index

▫V: Values ▫A key not present in the URL is denoted by

Definitions of URL

112/04/21 Data Mining & Machine Learning Lab 25

Page 26: Learning URL Patterns for Webpage De-duplication

•RULE: A Rule r is defined as a function ▫r : C → T ▫C: context

C : K → V ∪ {∗}▫T: transformation

T : K → V ∪ {⊥,K’} K’ = K ∪ ValueConversions ValueConversions = {Lowercase(K),

Uppercase(K), Encode(K), Decode(K), ...}

Definitions of Rule

112/04/21 Data Mining & Machine Learning Lab 26

Page 27: Learning URL Patterns for Webpage De-duplication

Rule Coverage

112/04/21 Data Mining & Machine Learning Lab 27

Page 28: Learning URL Patterns for Webpage De-duplication

MapReduce

112/04/21 Data Mining & Machine Learning Lab 28