aws summit berlin 2012 talk on web data commons
TRANSCRIPT
![Page 1: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/1.jpg)
Large-Scale Analysis of Web Pages - on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group
AWS Summit 2012 | Berlin
![Page 2: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/2.jpg)
Our Starting Point
2
![Page 3: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/3.jpg)
Our Starting Point
• Websites now embed structured data in HTML
2
![Page 4: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/4.jpg)
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
2
![Page 5: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/5.jpg)
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
![Page 6: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/6.jpg)
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
Question: How are Vocabularies and Formats used?
![Page 7: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/7.jpg)
Web Indices
• To answer our question, we need to access to raw Web data.
3
![Page 8: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/8.jpg)
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
3
![Page 9: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/9.jpg)
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
• Google and Bing have indices, but do not let outsiders in
3
![Page 10: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/10.jpg)
• Non-Profit Organization
4
![Page 11: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/11.jpg)
• Non-Profit Organization
• Runs crawler and provides HTML dumps
4
![Page 12: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/12.jpg)
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
4
![Page 13: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/13.jpg)
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
• Available on AWS Public Data Sets
4
![Page 14: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/14.jpg)
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
5
![Page 15: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/15.jpg)
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
• Preliminary analysis: 1 GB / hour / CPU possible
• 8-CPU Desktop: 8 months
• 64-CPU Server: 1 month
• 100 8-CPU EC2-Instances: ~ 3 days
5
![Page 16: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/16.jpg)
Common Crawl Dataset Size
![Page 17: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/17.jpg)
1 CPU, 1 h
Common Crawl Dataset Size
![Page 18: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/18.jpg)
1000 € PC, 1 h
1 CPU, 1 h
Common Crawl Dataset Size
![Page 19: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/19.jpg)
1000 € PC, 1 h
1 CPU, 1 h
5000 € Server, 1 h
Common Crawl Dataset Size
![Page 20: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/20.jpg)
1000 € PC, 1 h
1 CPU, 1 h
5000 € Server, 1 h
Common Crawl Dataset Size
17 € EC2 Instances, 1 h
![Page 21: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/21.jpg)
AWS Setup
• Data Input: Read Index Splits from S3
7
![Page 22: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/22.jpg)
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
7
![Page 23: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/23.jpg)
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
7
![Page 24: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/24.jpg)
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
• Result Output: Write to S3
7
![Page 25: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/25.jpg)
AWS Setup
• Data Input: Read Index Splits from S3
• Job Coordination: SQS Message Queue
• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)
• Result Output: Write to S3
• Logging: SDB
7
![Page 26: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/26.jpg)
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
![Page 27: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/27.jpg)
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
![Page 28: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/28.jpg)
S3
SQS
42
EC2
...
42 43 ... CC R42 R43 ...WDC
• Each input file queued in SQS
• EC2 Workers take tasks from SQS
• Workers read and write S3 buckets
![Page 29: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/29.jpg)
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
![Page 30: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/30.jpg)
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
• Available data largely determined by major player support
![Page 31: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/31.jpg)
Results - Types of Data
9
0 50 100 150 200
5e+0
35e
+04
5e+0
55e
+06
Type
Entit
y C
ount
(log
)
Microdata 02/2012RDFa 02/2012RDFa 2009/2010Microdata 2009/2010
Website Structure 23 %
Products, Reviews 19 %
Movies, Music, ... 15 %
Geodata 8 %
People, Organizations 7 %
2012 Microdata Breakdown
• Available data largely determined by major player support
• “If Google consumes it, we will publish it”
![Page 32: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/32.jpg)
Results - Formats
10
• URLs with embedded Data: +6%
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
![Page 33: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/33.jpg)
Results - Formats
10
• URLs with embedded Data: +6%
• Microdata +14% (schema.org?)
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
![Page 34: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/34.jpg)
Results - Formats
10
• URLs with embedded Data: +6%
• Microdata +14% (schema.org?)
• RDFa +26% (Facebook?)
RDFa Microdata geo hcalendar hcard hreview XFN
Format
Perc
enta
ge o
f UR
Ls
01
23
4 2009/201002−2012
![Page 35: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/35.jpg)
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
11
![Page 36: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/36.jpg)
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
11
![Page 37: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/37.jpg)
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
• Have a look!
11
![Page 38: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/38.jpg)
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
12
![Page 39: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/39.jpg)
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
12
![Page 40: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/40.jpg)
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
• * At first, we underestimated SDB cost
12
![Page 41: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/41.jpg)
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
13
![Page 42: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/42.jpg)
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
13
![Page 43: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/43.jpg)
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
• AWS great for massive ad-hoc computing power and complexity reduction
13
![Page 44: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/44.jpg)
Takeaways• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl datasets
• AWS great for massive ad-hoc computing power and complexity reduction
• Choose your architecture wisely, test by experiment, for us EMR was too expensive.
13
![Page 45: AWS Summit Berlin 2012 Talk on Web Data Commons](https://reader035.vdocument.in/reader035/viewer/2022081403/554a3786b4c905863d8b4635/html5/thumbnails/45.jpg)
Thank You!
Web Resources: http://webdatacommons.orghttp://hannes.muehleisen.org
Questions?Want to hire me?