clustering output of apache nutch using apache spark
TRANSCRIPT
![Page 1: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/1.jpg)
Clustering the output of Apache Nutch using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
![Page 2: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/2.jpg)
About● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch○ Now - a grad student @ University of Southern California○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles○ Director @ Apache Software Foundation○ Chief Architect, NASA JPL
2
![Page 3: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/3.jpg)
Overview
● Problem Statement● Clustering - a solution● Structure and Style Similarity● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and
GraphX● A demo
3
![Page 4: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/4.jpg)
Audience
● Who crawls the web● Who extracts data from web● Who filters webpages● likes to know -
○ web page structure and style similarity○ shared near neighbor clustering
4
![Page 5: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/5.jpg)
Problem Statement
● Scraping data from online marketplaces● Start with homepage → categories
→listing pages → Actual stuff (Detail page)●
5
![Page 6: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/6.jpg)
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
![Page 7: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/7.jpg)
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
![Page 8: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/8.jpg)
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
8
![Page 9: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/9.jpg)
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
USEFUL FOR ANALYSIS
USEFUL FOR ANALYSIS
USEFUL FOR ANALYSIS
9
![Page 10: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/10.jpg)
Question : How do we solve this?
Answer : Cluster the web pages
10
![Page 11: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/11.jpg)
Why Cluster?
● Separate the interesting web pages?○ Drop uninteresting/noisy web pages○ Categorical treatment of clusters
● Extract Structured data using XPath○ Automated extraction using alignment
11
![Page 12: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/12.jpg)
Goal
● Group web pages that are similar● Similar in terms of
○ CSS Styles○ DOM Structure
● Toolkit for experimentation with various thresholds○ % of similarity in style and/or structure○ Nice visualizations
12
![Page 13: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/13.jpg)
How do we cluster?
● Based on similarity between pages● Semantic similarity
○ meaning of the web pages● Syntactic similarity
○ Web page structure, css styles● This session has focus on syntactic aspect
13
![Page 14: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/14.jpg)
Structural similarity
● Web pages are built with HTML● HTML Doc → DOM tree● a labeled ordered tree● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
![Page 15: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/15.jpg)
(Minimum) Tree Edit Distance● Edit distance measure similar to strings, but on
hierarchical data instead of sequences ● Number of editing operations required to transform one
tree into another.● Three basic editing operations: INSERT, REMOVE and
REPLACE.● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
![Page 16: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/16.jpg)
Example: Tree Edit Distance*
● Edit operations● Normalized
distance
* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.
16
![Page 17: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/17.jpg)
Style Similarity
● Have you noticed ? ○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”● Simple measure -
○ Jaccard Similarity on CSS class names○
17
![Page 18: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/18.jpg)
Web pages consists of : ● HTML ✓● CSS ✓● JavaScript ×
18
![Page 19: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/19.jpg)
Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
![Page 20: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/20.jpg)
Implementation
20
![Page 21: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/21.jpg)
Implementation
● Read Nutch’s Segements○ sparkContext.sequneceFile(...)
● Filter web pages○ Robust content type detection -- Tika
● Structural Similarity○ HTML to DOM Tree -- NeckoHtml○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
![Page 22: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/22.jpg)
Implementation …● Style Similarity
○ Query CSS class names using Xpath● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with multiple thresholds
22
![Page 23: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/23.jpg)
Clustering● Shared Near Neighbor Clustering
○ Jarvis et al , 1973● With improvements
○ Graph based Implementation ■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
![Page 24: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/24.jpg)
What’s good about this algorithm?● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic○ Thresholds - numbers , percent of match
24
![Page 25: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/25.jpg)
Shared Near Neighbor Algorithm
“If two data points share a threshold number of neighbors, then they must belong to the same cluster”
25
![Page 26: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/26.jpg)
Clustering Implementation
● Similarity Matrix to Graph○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors○○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration○ Repeat
26
![Page 27: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/27.jpg)
Shared Near Neighbor Clustering on Apache Spark GraphX
27
![Page 28: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/28.jpg)
Challenges● Tree Edit Distance is very expensive
28
![Page 29: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/29.jpg)
What’s ahead on the road?● Integrate to Apache Nutch● Auto Extraction
○ Unsupervised learning on structure of pages and scrape the actual data of the web page
● Faster Tree Edit Distance○ May be with approximation techniques
29
![Page 31: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/31.jpg)
Summary● Example Scenario ● Similarity measures● Clustering as a solution● Demo
31
![Page 32: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/32.jpg)
Acknowledgements
● Dr. Chris Mattmann ○ My mentor○ Professor, Director at IRDS @ USC - http://irds.usc.edu○ Director, Apache Software Foundation
● DARPA Memex project
32
![Page 33: Clustering output of Apache Nutch using Apache Spark](https://reader030.vdocument.in/reader030/viewer/2022021507/58753e581a28abb8208b4695/html5/thumbnails/33.jpg)
Thank You! ● Source Code
● Tutorial
● Follow up○ Thamme Gowda - @thammegowda○ Chris Mattmann - @chrismattmann
33