hadoop distributed file system(hdfs) : behind the scenes
DESCRIPTION
The presentation sail you through the deep concepts of HDFS architecture, Where HDFS fits in Hadoop, What is HDFS Architecture and What is its role...TRANSCRIPT
Hadoop Distributed File System :
Behind the Scenes
WHAT IS
WHERE HDFS FITS IN HADOOP?
LET’S FIRST UNDERSTAND BUZZWORDSIN THE HADOOP
WORLD
REPLICATION
FAULT TOLERANCE
LOAD BALANCING
RELIABILITY
CLUSTERING
IT’S TIME FOR DEEP DIVE…
HDFS ARCHITECTUREName NodeData NodeTask TrackerJob trackerImage and JournalHDFS ClientCheckpoint NodeBackup Node
Name Node
Job Tracker Checkpoint
Data Node 1 DataNode 2 DataNode N………..
Task Tracker Task Tracker Task Tracker
Backup Node
Image Journal
HDFS Client
NAME NODE
Job Tracker
Inode Image
Checkpoint
Journal
Inode - Files and directories are represented on the NameNode, which record attributes like permissions, modification and access times, namespace and disk space quotas.
Image - The inode data and the list of blocks
belonging to each file
Checkpoint - The persistent record of the image stored in the local host’s native file system
Journal - Write-ahead commit log for changes to the file system that must be persistent.
DATA NODE
On Start Up…
Data NodeNameNode
DATA NODE
Data Node Name Node
TotalStorage Capacity
Fraction Storage
#Data TransfersIn Progress
Commands
HDFS CLIENT
IMAGE & JOURNAL
Flush & Sync Operation
CHECKPOINT NODE
BACKUP NODE
FILE I/O OPERATIONS
Single Writer
Multiple Reader
DATA WRITE OPERATION
DN1
DN4
DN2
DN3
Client Name Node
client DN1 DN2 DN3
setup
packet1
packet2
packet3
packet4
packet5
close
DATA WRITE/READ OPERATION
DN1
Client Name Node
Single Writer Multiple Reader Model
Lease Management (Soft Limit and Hard Limit)
Pipelining, Buffering and Hflush
Checksum for data integrity
Choosing nodes for read operation
BLOCK PLACEMENT
DN1 DN2 DN3 DN4 DN5
RACK1
DN6 DN7 DN8 DN9 D10
RACK2
/
Journal
Inode Image
checkpoint
D11 D12 D13 D14 D15
RACK3
Client
Name NodeAdd(data)
Data Nodes for Replica
REPLICATION MANAGEMENT
DN1 DN2 DN3 DN4 DN5
RACK1
DN6 DN7 DN8 DN9 D10
RACK2
/
D11 D12 D13 D14 D15
RACK3
Over ReplicatedUnder Replicated
Journal
Inode Image
checkpoint
Name Node
Balancing the disk space utilization on individual data nodes.
Based on utilization threshold. Utilization balancing follows block placement
policy.
BALANCER
Scanner verifies the data integrity based on checksum.
SCANNER