dfs and hdfs - wmich.edu · 2019-11-22 · 11/21/19 3 7 •introduction •architecture namenode,...
TRANSCRIPT
11/21/19
1
1.Definition of DFS2.how it works3. main concepts: Distribution, replication, fault tolerance and high concurrency concept
CS6030 Zirui Yang
DFS and HDFS
1
• File management system is used by the operating system to access the files and folders stored in one computer.
• Distributed file system is a system that can handle accessing data across multiple clusters (nodes).
What is DFS(Distributed File System)?
2
• Distributed file system is used to manage files and data blocks across different clusters and racks. It will enhance fault tolerance and access concurrency by replicating data blocks on different clusters to ensure fault tolerance and parallelism.
• Distribution, Replication• Advantages: fault tolerance and high
concurrency concept.
The functions of DFS and How is works
3
11/21/19
2
4
5
6
11/21/19
3
7
• Introduction• Architecture
NameNode, DataNodes, HDFS Client
• File I/O Operations and Replica Management
HDFS: Hadoop Distributed File System
8
Introduction
HDFS
The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. These are achieved by replicating file content on multiple machines(DataNodes).
9
11/21/19
4
Outline
• Introduction• Architecture
NameNode, DataNodes, HDFS Client
File I/O Operations and Replica ManagementFile Read and Write, Block Placement, Replication management, Balancer
10
• A file can be made of several DATA blocks, and they are stored across a cluster of one or more machines with data storage capacity.
• Each block of a file is replicated across a number of machines, To prevent loss of data.
Architecture
11
Architecture
12
11/21/19
5
NameNode and DataNodes• HDFS stores file system metadata and application data
separately.• Metadata refers to file metadata(attributes such as
permissions, modification, access times, namespace and disk space quotas.
• HDFS stores metadata on a dedicated server, called the NameNode.(Master)
• Application data are stored on other servers called DataNodes.(Slaves)
Architecture
13
• Single Namenode:• Maintain the namespace tree(a hierarchy of files and
directories) it have operations like opening, closing, and renaming files and directories.
• Determine the mapping of file blocks to DataNodes (the physical location of file data).
• Collect block reports from Datanodes on block locations.• Replicate missing data blocks.
Architecture
14
• DataNodes:• The DataNodes are responsible for serving read and write
requests from the file system’s clients.
• The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
• Data nodes periodically send block reports to Namenode.
Architecture
15
11/21/19
6
Architecture
16
• Hadoop HDFS Data Read/Write Operation• To write/read a file in HDFS, a client needs to
interact with namenode (master). namenodeprovides the address of the datanodes (slaves),then client will start writing/reading the data
Architecture
17
Architecture
18
11/21/19
7
Architecture
19
Thank you!
20
• HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodescontaining replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Architecture
21
11/21/19
8
• NameNode and DataNode communication: Heartbeats.
• DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.
Architecture
22
• HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodescontaining replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Architecture
23
Architecture
24