o’reilly – hadoop: the definitive guide ch.1 meet hadoop may 28 th, 2010 taewhi lee
TRANSCRIPT
![Page 1: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/1.jpg)
O’Reilly – Hadoop: The Definitive GuideCh.1 Meet Hadoop
May 28th, 2010Taewhi Lee
![Page 2: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/2.jpg)
2
Outline
Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS
– Grid Computing
– Volunteer Computing
The Apache Hadoop Project
![Page 3: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/3.jpg)
3
‘Digital Universe’ Nears a Zettabyte
Digital Universe: the total amount of data stored in the world’s computers Zettabyte: 1021 bytes >> Exabyte >> Petabyte >> Terabyte
![Page 4: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/4.jpg)
4
Flood of Data
NYSE generates 1TB new trade data / day
![Page 5: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/5.jpg)
5
Flood of Data
Facebook hosts 10 billion photos (1 petabyte)
![Page 6: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/6.jpg)
6
Flood of Data
Internet Archive stores 2 petabytes of data
![Page 7: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/7.jpg)
7
Individuals’ Data are Growing Apace
It becomes easier to take more and more photos
![Page 8: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/8.jpg)
8
Individuals’ Data are Growing Apace
LifeLog, my life in a terabyte
SQL
Capture and encoding
Microsoft Research’s MyLifeBits Project
![Page 9: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/9.jpg)
9
Amount of Public Data Increases
Available Public Data Sets on AWS– Annotated Human Genome– Public database of chemical structures– Various census data and labor statistics
![Page 10: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/10.jpg)
10
Large Data!
How to store & analyze large data?
“More data usually beats better algorithms”
![Page 11: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/11.jpg)
11
Outline
Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS
– Grid Computing
– Volunteer Computing
The Apache Hadoop Project
![Page 12: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/12.jpg)
12
Current HDD
How long it takes to read all the data off the disk?
capacity 1TB
transfer rate
100MB/s
How about using multiple disks?
![Page 13: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/13.jpg)
13
Problems with Multiple Disks
Hardware Failure
Doing tasks need to combine the dis-tributed data
What Hadoop Provides– Reliable shared storage (HDFS)– Reliable analysis system (MapReduce)
![Page 14: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/14.jpg)
14
Outline
Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS
– Grid Computing
– Volunteer Computing
The Apache Hadoop Project
![Page 15: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/15.jpg)
15
RDBMS
* Low latency for point queries or updates** Update times of a relatively small amount
of data
*
**
![Page 16: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/16.jpg)
16
Grid Computing
Shared storage (SAN)
Works well for predominantly CPU-intensive jobs Becomes a problem when nodes need to access
large data
![Page 17: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/17.jpg)
17
Volunteer Computing
Volunteers donate CPU time from their idle computers
Work units are sent to computers around the world
Suitable for very CPU-intensive work with small data sets
Risky due to running work on untrusted ma-chines
![Page 18: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/18.jpg)
18
Outline
Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS
– Grid Computing
– Volunteer Computing
The Apache Hadoop Project
![Page 19: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/19.jpg)
19
Brief History of Hadoop
Created by Doug Cutting Originated in Apache Nutch (2002)
– Open source web search engine, a part of the Lucene project
NDFS (Nutch Distributed File System, 2004) MapReduce (2005)
Doug Cutting joins Yahoo! (Jan 2006) Official start of Apache Hadoop project (Feb 2006) Adoption of Hadoop on Yahoo! Grid team (Feb
2006)
![Page 20: O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee](https://reader036.vdocument.in/reader036/viewer/2022081420/56649e985503460f94b9b0b5/html5/thumbnails/20.jpg)
20
The Apache Hadoop Project
PigChukw
aHive HBase
MapReduce HDFSZoo
Keeper
Core Avro