hadoopintro
TRANSCRIPT
-
8/12/2019 hadoopintro
1/31
An introduction to
-
8/12/2019 hadoopintro
2/31
Hello
Processing against a 156 nodecluster
Certified Hadoop Developer
Certified Hadoop SystemAdministrator
-
8/12/2019 hadoopintro
3/31
Goals
Why should you care?
What is it? How does it work?
-
8/12/2019 hadoopintro
4/31
Data Everywhere
Every two days now we create as
much information as we did from the
dawn of civilization up until 2003
-Eric Schmidt
then CEO of GoogleAug 4, 2010
-
8/12/2019 hadoopintro
5/31
Data Everywhere
-
8/12/2019 hadoopintro
6/31
Data Everywhere
-
8/12/2019 hadoopintro
7/31
Data Everywhere
-
8/12/2019 hadoopintro
8/31
The Hadoop Project
Originally based on papers published byGoogle in 2003 and 2004
Hadoop started in 2006 at Yahoo! Top level Apache Foundation project
Large, active user base, user groups Very active development, strongdevelopment team
-
8/12/2019 hadoopintro
9/31
Who Uses Hadoop?
-
8/12/2019 hadoopintro
10/31
Hadoop Components
HDFS
Storage
Self-healing
high-bandwidthclustered storage
MapReduce
Processing
Fault-tolerant
distributedprocessing
-
8/12/2019 hadoopintro
11/31
Typical Cluster
3-4000 commodity servers Each server 2x quad-core 16-24 GB ram
4-12 TB disk space 20-30 servers per rack
-
8/12/2019 hadoopintro
12/31
2 Kinds of Nodes
Master Nodes Slave Nodes
-
8/12/2019 hadoopintro
13/31
Master Nodes NameNode only 1 per cluster
metadata server and database
SecondaryNameNode helps withsome housekeeping
JobTracker only 1 per cluster
job scheduler
-
8/12/2019 hadoopintro
14/31
Slave Nodes DataNodes
1-4000 per cluster
block data storage
TaskTrackers 1-4000 per cluster
task execution
-
8/12/2019 hadoopintro
15/31
HDFS Basics
HDFS is a filesystem written in Java Sits on top of a native filesystem
Provides redundant storage for massiveamounts of data
Use cheap(ish), unreliable computers
-
8/12/2019 hadoopintro
16/31
HDFS Data Data is split into blocks and stored on
multiple nodes in the cluster
Each block is usually 64 MB or 128 MB(conf)
Each block is replicated multiple times(conf)
Replicas stored on different data nodes Large files, 100 MB+
-
8/12/2019 hadoopintro
17/31
NameNode
A single NameNode stores all metadata Filenames, locations on DataNodes of
each block, owner, group, etc.
All information maintained in RAM forfast lookup
Filesystem metadata size is limited tothe amount of available RAM on the
NameNode
-
8/12/2019 hadoopintro
18/31
SecondaryNameNode
The Secondary NameNode is not afailover NameNode
Does memory-intensive administrativefunctions for the NameNode
Should run on a separate machine
-
8/12/2019 hadoopintro
19/31
Data Node
DataNodes store file contents
Stored as opaque blocks on theunderlying filesystem Different blocks of the same file will be
stored on different DataNodes Same block is stored on three (or more)
DataNodes for redundancy
-
8/12/2019 hadoopintro
20/31
Self-healing DataNodes send heartbeats to the
NameNode
After a period without any heartbeats,a DataNode is assumed to be lost
NameNode determines which blockswere on the lost node
NameNode finds other DataNodeswith copies of these blocks These DataNodes are instructed to
copy the blocks to other nodes
Replication is actively maintained
-
8/12/2019 hadoopintro
21/31
HDFS Data Storage NameNode holds
file metadata
DataNodes hold theactual data Block size is 64
MB, 128 MB, etc Each block
replicated three
times
NameNodefoo.txt: blk_1, blk_2, blk_3
bar.txt: blk_4, blk_5
DataNodesblk_1 blk_2
blk_3 blk_5
blk_1 blk_3
blk_4
blk_1 blk_4
blk_5
blk_2 blk_4
blk_2 blk_3
blk_5
-
8/12/2019 hadoopintro
22/31
What is MapReduce?
MapReduce is a method for distributinga task across multiple nodes
Automatic parallelization anddistribution
Each node processes data stored onthat node (processing goes to the data)
-
8/12/2019 hadoopintro
23/31
Features of MapReduce
Fault-tolerance
Status and monitoring tools A clean abstraction for programmers
-
8/12/2019 hadoopintro
24/31
JobTracker MapReduce jobs are controlled by a
software daemon known as the
JobTracker
The JobTracker resides on a masternode
Assigns Map and Reduce tasks to other nodes on the cluster
These nodes each run a software daemon known as theTaskTracker
The TaskTracker is responsible for actually instantiating the Map orReduce task, and reporting progress back to the JobTracker
-
8/12/2019 hadoopintro
25/31
Two Parts
Developer specifies two functions: map() reduce()
The framework does the rest
-
8/12/2019 hadoopintro
26/31
map()
The Mapper reads data in the form ofkey/value pairs
It outputs zero or more key/value pairs
map(key_in, value_in) ->
(key_out, value_out)
-
8/12/2019 hadoopintro
27/31
reduce()
After the Map phase all the intermediatevalues for a given intermediate key are
combined together into a list
This list is given to one or moreReducers
The Reducer outputs zero or more finalkey/value pairs These are written to HDFS
-
8/12/2019 hadoopintro
28/31
map() Word Countmap(String input_key, String input_value)
foreach word w in input_value
emit(w, 1)
(1234, to be or not to be)
(5678, to see or not to see)
(to,1),(be,1),(or,1),(not,1),(to,1),(be,1), (to,1),(see,1),
(or,1),(not,1),(to,1),(see,1)
-
8/12/2019 hadoopintro
29/31
reduce() Word Countreduce(String output_key, List middle_vals)
set count = 0
foreach v in intermediate_vals:
count += vemit(output_key, count)
(to, [1,1,1,1])
(be,[1,1])
(or,[1,1])
(not,[1,1])
(see,[1,1])
(to, 4)
(be,2)
(or,2)
(not,2)
(see,2)
-
8/12/2019 hadoopintro
30/31
Resources
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/
http://www.cloudera.com/resources/?media=Video
-
8/12/2019 hadoopintro
31/31
Questions?