hadoopintro

8/12/2019 hadoopintro

1/31

An introduction to


2/31

Hello

Processing against a 156 nodecluster

Certified Hadoop Developer

Certified Hadoop SystemAdministrator


3/31

Goals

Why should you care?

What is it? How does it work?


4/31

Data Everywhere

Every two days now we create as

much information as we did from the

dawn of civilization up until 2003

-Eric Schmidt

then CEO of GoogleAug 4, 2010


5/31

Data Everywhere


6/31

Data Everywhere


7/31

Data Everywhere


8/31

The Hadoop Project

Originally based on papers published byGoogle in 2003 and 2004

Hadoop started in 2006 at Yahoo! Top level Apache Foundation project

Large, active user base, user groups Very active development, strongdevelopment team


9/31

Who Uses Hadoop?


10/31

Hadoop Components

HDFS

Storage

Self-healing

high-bandwidthclustered storage

MapReduce

Processing

Fault-tolerant

distributedprocessing


11/31

Typical Cluster

3-4000 commodity servers Each server 2x quad-core 16-24 GB ram

4-12 TB disk space 20-30 servers per rack


12/31

2 Kinds of Nodes

Master Nodes Slave Nodes


13/31

Master Nodes NameNode only 1 per cluster

metadata server and database

SecondaryNameNode helps withsome housekeeping

JobTracker only 1 per cluster

job scheduler


14/31

Slave Nodes DataNodes

1-4000 per cluster

block data storage

TaskTrackers 1-4000 per cluster

task execution


15/31

HDFS Basics

HDFS is a filesystem written in Java Sits on top of a native filesystem

Provides redundant storage for massiveamounts of data

Use cheap(ish), unreliable computers


16/31

HDFS Data Data is split into blocks and stored on

multiple nodes in the cluster

Each block is usually 64 MB or 128 MB(conf)

Each block is replicated multiple times(conf)

Replicas stored on different data nodes Large files, 100 MB+


17/31

NameNode

A single NameNode stores all metadata Filenames, locations on DataNodes of

each block, owner, group, etc.

All information maintained in RAM forfast lookup

Filesystem metadata size is limited tothe amount of available RAM on the

NameNode


18/31

SecondaryNameNode

The Secondary NameNode is not afailover NameNode

Does memory-intensive administrativefunctions for the NameNode

Should run on a separate machine


19/31

Data Node

DataNodes store file contents

Stored as opaque blocks on theunderlying filesystem Different blocks of the same file will be

stored on different DataNodes Same block is stored on three (or more)

DataNodes for redundancy


20/31

Self-healing DataNodes send heartbeats to the

NameNode

After a period without any heartbeats,a DataNode is assumed to be lost

NameNode determines which blockswere on the lost node

NameNode finds other DataNodeswith copies of these blocks These DataNodes are instructed to

copy the blocks to other nodes

Replication is actively maintained


21/31

HDFS Data Storage NameNode holds

file metadata

DataNodes hold theactual data Block size is 64

MB, 128 MB, etc Each block

replicated three

times

NameNodefoo.txt: blk_1, blk_2, blk_3

bar.txt: blk_4, blk_5

DataNodesblk_1 blk_2

blk_3 blk_5

blk_1 blk_3

blk_4

blk_1 blk_4

blk_5

blk_2 blk_4

blk_2 blk_3

blk_5


22/31

What is MapReduce?

MapReduce is a method for distributinga task across multiple nodes

Automatic parallelization anddistribution

Each node processes data stored onthat node (processing goes to the data)


23/31

Features of MapReduce

Fault-tolerance

Status and monitoring tools A clean abstraction for programmers


24/31

JobTracker MapReduce jobs are controlled by a

software daemon known as the

JobTracker

The JobTracker resides on a masternode

Assigns Map and Reduce tasks to other nodes on the cluster

These nodes each run a software daemon known as theTaskTracker

The TaskTracker is responsible for actually instantiating the Map orReduce task, and reporting progress back to the JobTracker


25/31

Two Parts

Developer specifies two functions: map() reduce()

The framework does the rest


26/31

map()

The Mapper reads data in the form ofkey/value pairs

It outputs zero or more key/value pairs

map(key_in, value_in) ->

(key_out, value_out)


27/31

reduce()

After the Map phase all the intermediatevalues for a given intermediate key are

combined together into a list

This list is given to one or moreReducers

The Reducer outputs zero or more finalkey/value pairs These are written to HDFS


28/31

map() Word Countmap(String input_key, String input_value)

foreach word w in input_value

emit(w, 1)

(1234, to be or not to be)

(5678, to see or not to see)

(to,1),(be,1),(or,1),(not,1),(to,1),(be,1), (to,1),(see,1),

(or,1),(not,1),(to,1),(see,1)


29/31

reduce() Word Countreduce(String output_key, List middle_vals)

set count = 0

foreach v in intermediate_vals:

count += vemit(output_key, count)

(to, [1,1,1,1])

(be,[1,1])

(or,[1,1])

(not,[1,1])

(see,[1,1])

(to, 4)

(be,2)

(or,2)

(not,2)

(see,2)


30/31

Resources

http://hadoop.apache.org/

http://developer.yahoo.com/hadoop/

http://www.cloudera.com/resources/?media=Video


31/31

Questions?

hadoopintro

Documents