hadoop distributed file system (hdfs)eldawy/18fcs226/slides/cs226-10-05-hdfs.pdfhadoop distributed...
TRANSCRIPT
![Page 1: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/1.jpg)
Hadoop Distributed File
System (HDFS)
10/05/2018 1
![Page 2: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/2.jpg)
HDFS Overview
A distributed file system
Built on the architecture of Google File
System (GS)
Shares a similar architecture to many other
common distributed storage engines such as
Amazon S3 and Microsoft Azure
HDFS is a stand-along storage engine and
can be used in isolation of the query
processing engine
10/05/2018 2
![Page 3: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/3.jpg)
HDFS Architecture
10/05/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
3
![Page 4: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/4.jpg)
What is where?
10/05/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
File and directory names
Block ordering and locations
Capacity of data nodes
Architecture of data nodes
Block data
Name node location
4
![Page 5: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/5.jpg)
Analogy to Unix FS
10/05/2018
The logical view is similar
/
usermary
chu
etc hadoop
5
![Page 6: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/6.jpg)
Analogy to Unix FS
10/05/2018
The physical model is comparable
Unix HFDS
File1
List of iNodes
Block 1
Block 2
Block 3
…
File1
List of block locations
Meta data
B B B
B B B
B B B
B
B B B
B B
6
![Page 7: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/7.jpg)
HDFS Create
10/05/2018
Data nodes
File creator
Name node
7
![Page 8: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/8.jpg)
HDFS Create
10/05/2018
Data nodes
File creatorCreate(…)
Name node
The creator process calls the create
function which translates to an RPC
call at the name node
8
![Page 9: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/9.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creatorCreate(…)
The master node creates three initial
blocks
1. First block is assigned to a random
machine
2. Second block is assigned to another
random machine in the same rack of
the first machine
3. Third block is assigned to a random
machine in another rack
1 2 3
9
![Page 10: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/10.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creatorOutputStream
1 2 3
10
![Page 11: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/11.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
11
![Page 12: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/12.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
12
![Page 13: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/13.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
13
![Page 14: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/14.jpg)
HDFS Create
10/05/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
When a block is filled up, the
creator contacts the name node
to create the next block
Next block
14
![Page 15: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/15.jpg)
Notes about writing to HDFS
Data transfers of replicas are pipelined
The data does not go through the name node
Random writing is not supported
Appending to a file is supported but it creates
a new block
10/05/2018 15
![Page 16: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/16.jpg)
Self-writing
10/05/2018
Name node
Data nodes
File
creator
If the file creator is running on one
of the data nodes, the first replica
is always assigned to that node
16
![Page 17: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/17.jpg)
Reading from HDFS
Reading is relatively easier
No replication is needed
Replication can be exploited
Random reading is allowed
10/05/2018 17
![Page 18: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/18.jpg)
HDFS Read
10/05/2018
Data nodes
File readeropen(…)
Name node
The reader process calls the open
function which translates to an RPC
call at the name node
18
![Page 19: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/19.jpg)
HDFS Read
10/05/2018
Data nodes
File readerInputStream
Name node
The name node locates the first block
of that file and returns the address of
one of the nodes that store that block
The name node returns an input
stream for the file
19
![Page 20: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/20.jpg)
HDFS Read
10/05/2018
Data nodes
File reader
InputStream#read(…)
Name node
20
![Page 21: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/21.jpg)
HDFS Read
10/05/2018
Data nodes
File reader
Name node
When an end-of-block is
reached, the name node
locates the next block
Next block
21
![Page 22: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/22.jpg)
HDFS Read
10/05/2018
Data nodes
File reader
Name node
seek(pos)
InputStream#seek operation locates
a block and positions the stream
accordingly
22
![Page 23: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/23.jpg)
Self-reading
10/05/2018
Data nodes
File
reader
Name node
1. If the block is locally stored
on the reader, this replica is
chosen to read
2. If not, a replica on another
machine in the same rack is
chosen
3. Any other random block is
chosen
Open,
seek
23
When self-reading occurs,
HDFS can make it much faster
through a feature called
short-circuit
![Page 24: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/24.jpg)
Notes About Reading
The API is much richer than the simple
open/seek/close API
You can retrieve block locations
You can choose a specific replica to read
The same API is generalized to other file
systems including the local FS and S3
Review question: Compare random access
read in local file systems to HDFS
10/05/2018 24
![Page 25: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/25.jpg)
HDFS Special Features
Node decomission
Load balancer
Cheap concatenation
10/05/2018 25
![Page 26: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/26.jpg)
Node Decommission
10/05/2018 26
B B B
B B B
B B B
B
B B B
B B
B B B
B
![Page 27: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/27.jpg)
Load Balancing
10/05/2018 27
B B B
B B B
B B B
B
B B B
B B
![Page 28: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/28.jpg)
Load Balancing
10/05/2018 28
B B B
B B B
B B B
B
B B B
B B
Start the load balancer
![Page 29: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/29.jpg)
Cheap Concatenation
10/05/2018 29
Name node
File 1
File 2
File 3
Concatenate File 1 + File 2 + File 3 File 4
Rather than creating new blocks, HDFS can just
change the metadata in the name node to delete
File 1, File 2, and File 3, and assign their blocks to a
new File 4 in the right order.
![Page 30: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/30.jpg)
HDFS API
10/05/2018 30
FileSystem
DistributedFileSystemLocalFileSystem S3FileSystem
Path Configuration
![Page 31: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/31.jpg)
HDFS API
10/05/2018 31
Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);
// To get the local FSfs = FileSystem.getLocal (conf);
// To get the default FSfs = FileSystem.get(conf);
Create the file system
![Page 32: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/32.jpg)
HDFS API
10/05/2018 32
FSDataOutputStream out = fs.create(path, …);
Create a new file
fs.delete(path, recursive);fs.deleteOnExit(path);
Delete a file
fs.rename(oldPath, newPath);
Rename a file
![Page 33: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/33.jpg)
HDFS API
10/05/2018 33
FSDataInputStream in = fs.open(path, …);
Open a file
in.seek(pos);in.seekToNewSource(pos);
Seek to a different location
![Page 34: Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on](https://reader030.vdocument.in/reader030/viewer/2022041019/5ecdee503782cf14d07d7b49/html5/thumbnails/34.jpg)
HDFS API
10/05/2018 34
fs.concat(destination, src[]);
Concatenate
fs.getFileStatus(path);
Get file metadata
fs.getFileBlockLocations(path, from, to);
Get block locations