lecture 2 part 1

What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.

What is Rack, Cluster, Nodes and Commodity Hardware?

HDFS - Hadoop Distributed File System

Using HDFS commands

MapReduce

Higher-level languages over Hadoop: Pig and Hive

HBase – Overview

HCatalog

What is Hadoop and its components?

What is the commodity server/Hardware?

Why HDFS ?

What is the responsibility of NameNode in HDFS?

What is Fault Tolerance?

What is the default replication factor in HDFS?

What is the heartbeat in HDFS?

What are JobTracker and TaskTracker?

Why MapReduce programming model?

Where do we have Data Locality in MapReduce?

Why we need to use Pig and Hive?

What is the difference between Hbase and HCatalog

Software platform that lets one easily write and run applications that process vast amounts of data.

It includes

Storage : the Hadoop Distributed File System (HDFS)

Access : HBase

Processing: MapReduce

• A replacement for existing data warehouse systems

• A File system

• An online transaction processing (OLTP) system

• Replacement of all programming logic

• A database

It is written with large clusters of computers in mind

and is built around the following assumptions:

Hardware will fail.

delivering a high throughput of data

A typical HDFS file is gigabytes to terabytes in size.

high aggregate data bandwidth and scale in a single cluster.

support tens of millions of files in a single instance.

write-once-read-many access model.

Moving Computation is Cheaper than Moving Data.

Portability is important.

1. Client machines: Distributed

Data Processing

Data storage

2. The Master nodes: Manage

HDFS

Map Reduce

3. The slave nodes: Do

All the dirty work

The three major categories of machine roles in a Hadoop deployment are :

Distributed Data Processing

Client

Job Traker Name NodeSecondary

Name Node

Data Node &Task Tracker






Distributed Data storage

Processing(Map Reduce)

Storage(HDFS)

running the

computations

storing the data

ByCommunicate

masterreceive

instructions

Namenode :

stores and manages all cluster’s metadata

so it is the single point of contact to Hadoop

Jobtracker :

runs on the Namenode

perform the map reduce of the jobs

Secondarynamenode:

maintains the backup of metadata present

on the :

• Namenode

• file system change history.

The Master nodes


Map Reduce

Client

Job Traker Name NodeSecondary Name

Node








HDFS

Datanode: contain the actual data.

Default Block Size 64MB

Tasktracker: perform task on the local data, assigned

by the Jobtracker.

The Slave nodesDistributed Data Processing

Map Reduce

Client


Node








HDFS

installed with all the cluster settings

load data into the cluster

submit Map Reduce jobs

view the results of the job

single physical server


Map Reduce

Client


Node








HDFS

Yahoo! on 1000-node cluster

“Cheap” Commodity Server Hardware No need for super-computers, use commodity unreliable hardware Not

desktops!

Typically in 2 level architecture

• Nodes are commodity PCs

• 30-40 nodes/rack

• Uplink from rack is 3-4 gigabit

• Rack-internal is 1 gigabit

1 gigabit

3-4 gigabit

The NameNode keeps track of the file metadata- which files are in the system and how each file is broken down into blocks.

The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.

After a client calls the JobTracker to begin a data processing job, the JobTrackerpartitions the workand assigns different map and reduce tasks to each TaskTracker in the cluster.

1

2

3

4

5

File F 300 MB

Block 1 (64 MB)

Block 2 (64 MB)

Block 3 (64 MB)

Block 4 (64 MB)

Block 5 (Remaining 44 MB)

Files added to HDFS are split into fixed- size blocks

block size is configurable, but default to 64 MB

Namenode maintains metadata info about files (blocks)

datanode stores the actual data (blocks)

Each block is replicated N times (Default = 3)

HDFS block

4

Benefits of replication

Availability: data isn’t lost when a node fails

Reliability: HDFS compares replicas and fixes data corruption

Performance: allows for data locality

We have more

than one4

4

The NameNode endeavors to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives. When a block becomes over replicated, the NameNodechooses a replica to remove. The NameNode will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the DataNode with the least amount of available disk space. The goal is to balance storage utilization across DataNodes without reducing the block’s availability.

Many datanode (1000s)

Centralized namenode

1

2

3 45

12

3

4

5

2

4

5

Block replication

Given Data: Huge file containing all emails sent to customer service.

Sample Scenario: How many times did our customers type the word “Refund” into emails sent to customer service?

Typical workflow:

• Load data into the cluster (HDFS writes)

• Analyzing the data (Map Reduce)

• Store results in the cluster (HDFS writes)

• Read the results from the cluster (HDFS reads)

• [-ls <path>]

• [-du <path>]

• [-cp <src> <dst>]

• [-rm <path>]

• [-put <localsrc> <dst>]

• [-copyFromLocal <localsrc> <dst>]

• [-moveFromLocal <localsrc> <dst>]

• [-get [-crc] <src> <localdst>]

• [-cat <src>]

• [-copyToLocal [-crc] <src> <localdst>]

• [-moveToLocal [-crc] <src> <localdst>]

• [-mkdir <path>]

• [-touchz <path>]

• [-test -[ezd] <path>]

• [-stat [format] <path>]

• [-help [cmd]]

1. Create a directory in HDFS at given path(s).

2. List the contents of a directory.

3. Remove a file or directory in HDFS.Remove files specified as argument. Deletes directory only when it is empty

Remove folder: hadoop fs -rm -r /folder

4. Upload a file in HDFS.Copy single src file, or multiple src files from local file system to the Hadoop data file system

5. Download a file in HDFS.Copies/Downloads files to the local file system

6. See contents of a file

./filename.exet

lecture 2 part 1

Education

local data

actual data

data locality

data processing map

node secondary

high throughput of data

vast amounts of data

cluster settings load