ieee iri 16 - clustering web pages based on structure and style similarity

34
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Thamme Gowda @ thammegowda Dr. Chris Mattmann @ chrismattmann 1 CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY Information Retrieval and Data Science

Upload: thamme-gowda-narayanaswamy

Post on 21-Jan-2017

348 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA

Thamme Gowda@thammegowda

Dr. Chris Mattmann@chrismattmann

1

CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY

Information Retrieval and Data Science

Page 2: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

2

OUTLINE• Problem Statement• Method Overview• Steps

• Tree Edit Distance• Style Similarity• Shared Near Neighbor Clustering

• Evaluation• Challenges

Information Retrieval and Data Science

Page 3: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

3

PROBLEM STATEMENT

Information Retrieval and Data Science

• Scraping data from online marketplaces

• Start with homepage → categories →listing → Actual stuff (Detail page)

Page 4: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

4

1 2 3 4

8765

Page 5: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

USELESS

USELESS

5SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

1 2 3 4

8765

Page 6: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

USELESS

USELESS

6SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

CRAWLER: YESANALYSIS: NO

CRAWLER: YESANALYSIS: NO

CRAWLER: YESANALYSIS: NO

1 2 3 4

8765

Page 7: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

USELESS

USELESS

7SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

CRAWLER: YESANALYSIS: NO

CRAWLER: YESANALYSIS: NO

CRAWLER: YESANALYSIS: NO

USEFUL USEFUL USEFUL

1 2 3 4

8765

Page 8: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

8

METHOD OVERVIEW

Information Retrieval and Data Science

CLUSTERING

Page 9: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups”

– Wikipedia

• There are many ways to achieve this.

9Information Retrieval and Data Science

CLUSTERING

Page 10: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

10

HOW DO WE CLUSTER

Information Retrieval and Data Science

• Based on similarity between pages• Semantic similarity

• meaning of the web pages (keywords, topics,…)• Syntactic similarity

• Web page structure, CSS styles• This presentation has focus on syntactic aspect

Page 11: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• HTML ✓• CSS ✓• JavaScript ×

11Information Retrieval and Data Science

SIMILARITY CHECK

Page 12: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

12

METHOD : INPUT

Information Retrieval and Data Science

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

Page 13: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

13

METHOD : STEP #1

Information Retrieval and Data Science

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STRUCTURAL SIMILARITY

STRUCTURAL SIMILARITY

Page 14: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

14

STRUCTURAL SIMILARITY

Information Retrieval and Data Science

• Web pages are built with HTML

• HTML Doc → DOM tree• a labeled ordered tree• Structural similarity using

tree edit distance(TED)

HTML

HEAD BODY

TITLE DIV P

Page 15: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

15

MINIMUM TREE EDIT DISTANCE

Information Retrieval and Data Science

• Edit distance measure similar to strings, but on hierarchical data instead of sequences

• Number of editing operations required to transform one tree into another.

• Three basic editing operations: INSERT, REMOVE and REPLACE.

• An useful measure to quantify how similar (or dissimilar) two trees are.

Page 16: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

● Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

16

MINIMUM TREE EDIT DISTANCE*

Information Retrieval and Data Science

1 2

3 4

Page 17: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

17

METHOD : STEP #2

Information Retrieval and Data Science

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STYLE SIMILARITY

STYLE SIMILARITY

Page 18: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure -

• Jaccard Similarity on CSS class names

18Information Retrieval and Data Science

STYLE SIMILARITY

Page 19: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

19

METHOD : STEP #3

Information Retrieval and Data Science

AGGREGATED = k.STRUCTURAL+ (1-k).STYLE

STRUCTURAL

STYLE

Page 20: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

20

METHOD : STEP #4

Information Retrieval and Data Science

SIMILARITY MATRIX CLUSTERS

CLUSTERING( SHARED NEAR NEIGHBOR)

Page 21: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

“If two data points share a threshold number of neighbors, then they must belong to the same cluster” *

21Information Retrieval and Data Science

SHARED NEAR NEIGHBOR (SNN) ALGORITHM

* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.

Web Pages

Page 22: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• Guessing k in k-means is hardMeaningful question - “Make clusters of 90% similarity”

instead of “Make 10 clusters”• Mean / Average of documents in a cluster?

• Average of DOM Trees?• Average of CSS styles?

• Circular / Spherical / Globular shapes?

22Information Retrieval and Data Science

WHAT’S GOOD ABOUT SNN ALGORITHM

Page 23: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

23

METHOD : LAST STEP*

Information Retrieval and Data Science

LABELING

CLUSTERS CATEGORIES /USABLE CLUSTERS

Page 24: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

24

METHOD : LAST STEP*

Information Retrieval and Data Science

LABELING

CLUSTERS CATEGORIES /USABLE CLUSTERS

* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE

Page 25: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

25

SOME APPLICATIONS?

Information Retrieval and Data Science

• Separate the interesting web pages?• Drop uninteresting/noisy web pages• Categorical treatment of clusters

• Extract Structured data using XPath• Automated extraction using alignment

Page 26: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

26Information Retrieval and Data Science

WORKFLOW: PART #1

Page 27: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

27Information Retrieval and Data Science

WORKFLOW: PART #2

Page 28: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

DATASET : 1310 Web Pages from http://armslist.com

• 987 Ad detail pages• 311 Ad listing pages• 12 others – index, contact, FAQs etc

PARAMETERS:• 50% weightage for CSS style 50% weight for HTML structure• Series of experiments on various thresholds : 85%, 90%, 95%

Information Retrieval and Data Science

EVALUATION

28

Page 29: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Information Retrieval and Data Science

EVALUATION

29

PARAMETERS:SIMILARITY = 90%SHARED NEIGHBORS = 90%

Page 30: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Information Retrieval and Data Science

EVALUATION

30

PARAMETERS:SIMILARITY = 95%SHARED NEIGHBORS = 95%

Page 31: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Information Retrieval and Data Science

EVALUATION

31

PARAMETERS:SIMILARITY = 85%SHARED NEIGHBORS = 85%

Page 32: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• TED very expensive• Zhang-Shasha’s TED

• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})

• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)

Information Retrieval and Data Science

CHALLENGES

32

Number of HTML Tags

Tim

e Co

mpl

exity

Page 33: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Information Retrieval and Data Science

ACKNOWLEDGMENTSDARPA MEMEX

33

* Photo Credits : http://memex.jpl.nasa.gov/

Page 34: IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

• Source Code https://github.com/USCDataScience

/autoextractor

• Tutorialhttps://git.io/vwS69

• Follow up• Thamme Gowda - @thammegowda• Chris Mattmann - @chrismattmann

34Information Retrieval and Data Science

THANK YOU