hadoop - simple. scalable
TRANSCRIPT
![Page 1: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/1.jpg)
Hadoop
Simple. Scalable.
![Page 3: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/3.jpg)
Java. Clojure. Ruby.
Cloudera Certified
![Page 4: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/4.jpg)
posscon.org
April 15, 16, and 17
![Page 5: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/5.jpg)
Agenda
OverviewMassively Large Data Sets and the problems thereinDistributed File SystemMapReducePig
![Page 6: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/6.jpg)
Overview
![Page 7: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/7.jpg)
Doug Cutting
Genius
![Page 8: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/8.jpg)
Favorite Hadoop Story
New York Times
![Page 9: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/9.jpg)
4 Terabytes of Source Articles.
![Page 10: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/10.jpg)
24 Hours.
![Page 11: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/11.jpg)
5.5 Terabytes of PDFs.
![Page 12: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/12.jpg)
Did it again.
![Page 13: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/13.jpg)
$240.
![Page 14: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/14.jpg)
Infoporn from Yahoo
73 hours490 TB Shuffling280 TB Output4000 Nodes16 PB Disk Space32K Cores64 TB RAM
![Page 15: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/15.jpg)
Hadoop solves...
![Page 16: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/16.jpg)
Analyzing Massively Large Datasets
![Page 17: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/17.jpg)
Two Problems
You have to distribute.
![Page 18: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/18.jpg)
Data Storage
Capacity has increased rapidly beyond read speeds. Datasets
won't fit on one disk. Tolerate node failure.
![Page 19: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/19.jpg)
Data Analysis
Combine data from many machines. Tolerate node failure.
![Page 20: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/20.jpg)
How Hadoop solves these problems.
![Page 21: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/21.jpg)
Send Code to Data. Not Data to Code.
![Page 22: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/22.jpg)
Data Storage
HDFS
![Page 23: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/23.jpg)
Name Node. Data Nodes.
Master - Slave Relationship
![Page 24: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/24.jpg)
Shard massive files across multiple machines.
MB, GB, and TB
![Page 25: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/25.jpg)
Tolerant of Node Failure
Files replicated across at least 3 nodes.
![Page 26: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/26.jpg)
HDFS behaves like a normal file system.
No true appends yet.
![Page 27: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/27.jpg)
Demonstration.
![Page 28: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/28.jpg)
Data Analysis
MapReduce
![Page 29: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/29.jpg)
Job Tracker. Task Nodes.
Master - Slave Relationship.
![Page 30: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/30.jpg)
map
![Page 31: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/31.jpg)
Demonstration
![Page 32: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/32.jpg)
pmap
![Page 33: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/33.jpg)
Demonstration
![Page 34: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/34.jpg)
reduce
![Page 35: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/35.jpg)
Demonstration
![Page 36: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/36.jpg)
(reduce (pmap))
![Page 37: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/37.jpg)
Demonstration.
![Page 38: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/38.jpg)
MapReduce
Java
![Page 39: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/39.jpg)
Nobody likes it.
:-)
![Page 40: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/40.jpg)
MapReduce
Ruby. Python. Unix Utilities.
![Page 41: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/41.jpg)
MapReduce
Clojure
![Page 42: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/42.jpg)
Hadoop Ecosystem
Pigkeeper. Hive. Cascading.
![Page 43: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/43.jpg)
Pig
![Page 44: Hadoop - Simple. Scalable](https://reader033.vdocument.in/reader033/viewer/2022052619/55555163b4c9052b208b4cb4/html5/thumbnails/44.jpg)
HBase