hadoop world - oct 2009
DESCRIPTION
Review of the different things that nytimes.com has been up to w/ Hadoop from the simple to the less simple.TRANSCRIPT
![Page 1: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/1.jpg)
Cheap Parlor Tricks, Counting, and Clustering
Derek GottfridThe New York Times October 2009
![Page 2: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/2.jpg)
Evolution of Hadoop @ NYTimes.com
![Page 3: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/3.jpg)
Early Days - 2007 Solution looking for a problem
![Page 4: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/4.jpg)
SolutionWouldn’t it be cool to use lots of EC2 instances
(it’s cheap; nobody will notice)
Wouldn’t it be cool to use Hadoop
(MapReduce Google style is awesome)
![Page 5: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/5.jpg)
Found a Problem Freeing up historical archives of NYTimes.com 1851-1922
![Page 6: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/6.jpg)
Problem Bits Articles are served as PDFs
Really need PDFs from 1851-1981
PDFs are dynamically generated
Free = more traffic
Real deadline
![Page 7: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/7.jpg)
BackgroundWhat goes into making a PDF of a NYTimes.com article?
Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
![Page 8: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/8.jpg)
Simple Answer Pre-generate all 11 million PDFs and serve them statically.
![Page 9: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/9.jpg)
Solution Copy all the source data to S3
Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs
Store the output PDFs in S3
Serve the PDFs out of S3 w/ a signed query string
![Page 10: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/10.jpg)
A Few Details Limited HDFS - everything loaded in and out of S3
Reduce = 0 - only used for some stats and error reporting
![Page 11: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/11.jpg)
Breakdown 4.3 TB of source data into S3
11M PDFS - 1.5 TB output
$240 for EC2 - 24hrs x 100 machines
![Page 12: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/12.jpg)
TimesMachinehttp://timesmachine.nytimes.com
![Page 13: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/13.jpg)
Currently - 2009 All that darn data - Web Analytics
![Page 14: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/14.jpg)
Data Registration / Demographic
Articles 1851 - today
Usage Data / Web Logs
![Page 15: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/15.jpg)
Counting Classic cookie tracking - let’s add it up
Total PV
Total unique users
PV per user
![Page 16: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/16.jpg)
A Few Details Using EC2 - 20 Machines
Hadoop 0.20.0
12+TB of data
Straight MR in Java
![Page 17: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/17.jpg)
Usage Data
July 2009
???M Page Views ??M Unique Users
![Page 18: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/18.jpg)
Merging Data Usage data combined with demographic data.
![Page 19: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/19.jpg)
Twitter Click Backs By Age Group
July 2009
![Page 20: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/20.jpg)
Merging Data Usage data with article meta data
![Page 21: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/21.jpg)
Usage Data combined with Article Data
July 2009
40 Articles
![Page 22: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/22.jpg)
Usage Data combined with Article Data
July 2009
40 Articles
![Page 23: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/23.jpg)
Products Coming soon...
![Page 24: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/24.jpg)
Clustering Moving beyond simple counting and joining
Join usage data, demographic information, and article meta data
Apply simple k-means clustering
![Page 25: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/25.jpg)
Clustering
![Page 26: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/26.jpg)
Clustering
![Page 27: Hadoop World - Oct 2009](https://reader036.vdocument.in/reader036/viewer/2022062511/54c4c5134a795943578b45a7/html5/thumbnails/27.jpg)
Conclusion Large scale computing is transformative for NYTimes.com.