hadoop distributed file systemsdevans/7343... · a distributed file system (dfs) is a file system...

22
Distributed File Systems & Hadoop Kevin Queenan

Upload: others

Post on 08-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Distributed File Systems &Hadoop

Kevin Queenan

Page 2: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

What is a Distributed File System (DFS)?

Page 3: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Simply...

A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine.

Page 4: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

What is Hadoop?

Page 5: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Apache Hadoop is...

A framework, ecosystem, or set of open-source software tools that allows for the distributed housing and processing of extremely large data sets contained across numerous clusters of commodity grade hardware.

Page 6: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Why does Hadoop exist?

Page 7: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Consider current industry trends...

Data at a massive scale -> TB and PB

Facebook ingested 20 TB of data per day in 2011

NYSE generated 1TB of data per day in 2010

This data is also heterogeneous:

Images, social network activity, log files, IOT sensors, etc

Page 8: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

TB and PB

80% unstructured20% structuredHeterogeneous data consisting of log files, audio, video, images, etc

Good, bad, undefined, incomplete?

Time sensitive, real-time, etc

Page 9: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Challenge: Read 1TB of data

1 machine

4 I/O channels

Each channel operates @ 100 MB/s

Time taken?

45 minutes

10 machines

4 I/O channels

Each channel operates @ 100 MB/s

Time taken?

4.5 minutes

Page 10: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Where was Hadoop developed?

Page 11: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop Origins

Three Google white papers:1. GFS2. MapReduce3. BigTable

HDFS

MapReduce

HBase

Page 12: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop is the faithful, open-source implementation of Google’s MapReduce, GFS, and BigTable

Hadoop’s primary architect is Doug Cutting who is also credited with creating Apache Lucene

The project began while Doug Cutting was working for Yahoo! on a project named Nutch

Cutting’s son named a yellow stuffed elephant Hadoop which Doug adopted for the project

Page 13: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop’s Design Axioms

1. Store and process massive amounts of data (order of PB)2. Performance must scale linearly3. Failure is expected4. Easily manageable 5. Self-healing file system6. Run on commodity, off-the-shelf hardware

Page 14: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Fundamental tenet of relational databases involves a db schema -> inherently structured

What about the massive amount of unstructured data we need to house and process?

Scaling commercial relational databases is incredibly expensive and limited

Hadoop cost per user is approx $250/TB

RDBMS cost per user is approx $100,000 - $200,000/TB

Hadoop vs RDBMS

Page 15: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Hadoop Architecture

Page 16: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Master/Slave Model

Master

NameNode (HDFS)

JobTracker (MapReduce)

Slave

DataNode (HDFS)

TaskTracker (MapReduce)

Page 17: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

NameNodeFile metadata:/user/kevin/data1.txt -> 1,2,3

r = 3

hdfs-site.xml

DataNode

2, 3

DataNode

1, 3

DataNode

1, 2, 3

DataNode

1, 2

Page 18: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Underlying Filesystem

Each physical drive in each slave DataNode machine is formatted either ext3 or ext4

HDFS can be considered to be an abstract filesystem in the sense that fixed blocks of data are sent to slave DataNodes from the master NameNode

Page 19: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

MapReduce

Page 20: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Data Processing Paradigm

MapReduce is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm

Page 21: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored
Page 22: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored

Thanks for your time!