ghislain fourny big data for engineers fall 2019 · ghislain fourny big data for engineers fall...

176
Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems Kheng Ho Toh / 123RF Stock Photo

Upload: others

Post on 12-Jun-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

Ghislain Fourny

Big Data for Engineers Fall 20194. Distributed file systems

Kheng Ho Toh / 123RF Stock Photo

Page 2: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

2

So far...

We've

rehearsed

relational

databases

Page 3: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

3

So far...

We've

looked into

scaling out

Page 4: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

4

So far...

We've

seen

Object storage

Page 5: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

5

So far...

We've

looked into

the

Key-Value Model

Page 6: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

6

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 3.x

Page 7: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

There is

Big Dataand

Big Data

Anna Liebiedieva / 123RF Stock Photo

Vadym Kurgak / 123RF Stock Photo

Page 8: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

8

Use cases

A huge amount of large files?

Page 9: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

9

Use cases

vs.

A huge amount of large files?

A large amount of huge files?

Page 10: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

10

Use cases

vs.

Billions of TB files

Millions of PB files

Object Storage

File Storage

Page 11: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

11

Where does the data come from?

Raw Data

Sensors

Measurements

Events

Logs

Oleg Dudko / 123RF Stock Photo

Page 12: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

12

Where does the data come from?

Raw Data Derived Data

Sensors

Measurements

Events

Logs

Aggregated data

Intermediate data

Oleg Dudko / 123RF Stock Photo

Anton Starikov / 123RF Stock Photo

Page 13: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

13

Technologies and models

Key-Value Store File System

Object Storage Block Storage

Billions of

<TB files

Millions of

<PB filesvs.

Page 14: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

14

Technologies and models

Key-Value Store File System

Object Storage Block Storage

Billions of

<TB files

Millions of

<PB filesvs.

Page 15: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

15

Technologies and models

Key-Value Model File System

Object Storage Block Storage

Billions of

<TB files

Millions of

<PB filesvs.

Page 16: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

16

Technologies and models

Key-Value Model File System

Object Storage Block Storage

Billions of

<TB files

Millions of

<PB filesvs.

Page 17: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

17

Distributed file systems: inception

FS

17

Page 18: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

18

GFS genesis

Characteristics

Page 19: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

19

GFS genesis

Characteristics

Requirements

Page 20: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

20

GFS genesis

Characteristics

File System Design

Requirements

Page 21: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

21

Fault tolerance and robustness

Vitaly Korovin / 123RF Stock Photo

It might fail

Local disk

Page 22: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

22

Fault tolerance and robustness

Vitaly Korovin / 123RF Stock Photo

It might fail

nodes will fail

Kheng Ho Toh / 123RF Stock Photo

Local disk

Cluster with 100s to10,000s of machines

22

Page 23: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

23

Fault tolerance and robustness

Monitoring

Kheng Ho Toh / 123RF Stock Photo

Page 24: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

24

Fault tolerance and robustness

Monitoring

Error detection

Kheng Ho Toh / 123RF Stock Photo

Page 25: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

25

Fault tolerance and robustness

Monitoring

Error detection

Automatic Recovery

Kheng Ho Toh / 123RF Stock Photo

Page 26: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

26

Fault tolerance and robustness

Fault tolerance

Monitoring

Error detection

Automatic Recovery

Kheng Ho Toh / 123RF Stock Photo

Page 27: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

27

File read model

Random access Scan the file

vs.

Page 28: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

28

File update model

Random access Upsert/append only

vs.

Page 29: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

29

File update model

immutable

Append

suitable for

Sensors

Logs

Intermediate data

_____

_____

_____

Page 30: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

30

Appends

Append only

100s of clients

in parallel

atomic

Page 31: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

31

Performance requirements

Top priority:

Throughput

Page 32: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

32

Performance requirements

? !

Top priority:

Throughput

Secondary:

Latency

Page 33: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

33

The progress made (1956-2018): Logarithmic

Picture: Ash Waechter/123RF

10,000x

8x

Throughput LatencyCapacity

(per unit of volume)

200,000,000,000x

Page 34: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

34

The progress made (1956-2018): Logarithmic

Picture: Ash Waechter/123RF

200,000,000,000x

Throughput Latency

Parallelize!

Capacity

(per unit of volume)

10,000x

8x

Page 35: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

35

The progress made (1956-2018): Logarithmic

Picture: Ash Waechter/123RF

Throughput Latency

Batch processing!

200,000,000,000x

Capacity

(per unit of volume)

10,000x

8x

Page 36: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

36

Hadoop

Page 37: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

37

Hadoop

Initiated in

2006

Page 38: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

38

Hadoop

Primarily:

• Distributed File System (HDFS)

• MapReduce

• Wide column store (HBase)

Covere

d in this

lectu

re

Page 39: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

39

Hadoop

Inspired by Google's

• GFS (2003)

• MapReduce (2004)

• BigTable (2006)

Covere

d in this

lectu

re

Page 40: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

40

Size timeline

Date Size reported by Yahoo

April 2006 188

May 2006 300

October 2006 600

April 2007 1,000

February 2008 10,000 (index generation)

March 2009 24,000 (17 clusters)

June 2011 42,000 (100+ PB)

November 2016 100,000? (600PB)

Page 41: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

41

Distributed file systems: the model

Page 42: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

42

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

File Systems (Logical Model)

Key-Value Storage

Page 43: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

43

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

File Systems (Logical Model)

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

Key-Value Model File Hierarchy

vs.

Page 44: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

44

Block Storage (Physical Storage)

111010010110101…

Object Storage

Page 45: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

45

Block Storage (Physical Storage)

111010010110101…

1 2 3

4 5 6

7 8

Object Storage Block Storage

Page 46: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

46

Terminology

HDFS: Block

GFS: Chunk

Page 47: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

47

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

Page 48: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

48

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

Page 49: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

49

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

12

3

Page 50: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

50

Files and blocks

Lorem Ipsum

Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante

12 3 4

5

6

7

8

12

3

1

2 3 4

Page 51: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

51

Why blocks?

Page 52: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

52

Why blocks?

1. Files bigger than a disk

PBs!

Page 53: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

53

Why blocks?

1. Files bigger than a disk

PBs!

2. Simpler level of abstraction

Page 54: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

54

Single machine vs. distributed

Page 55: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

55

The right block size

Simple file system

4 kB

Page 56: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

56

Poll

https://eduapp-app1.ethz.ch/

Go now to:

or install EduApp 2.x

Page 57: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

57

The right block size

Simple file system Distributed file system

4 kB

64 MB – 128 MB

Page 58: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

58

The right block size

Relational Database Distributed file system

4 kB – 32 kB

64 MB – 128 MB

Page 59: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

59

HDFS Architecture

Page 60: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

60

How do we connect the many machines?

Page 61: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

61

Peer-to-peer architecture

Page 62: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

62

Master-slave architecture

Slave

Master

Slave Slave Slave Slave Slave

Page 63: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

63

HDFS server architecture

Page 64: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

64

HDFS server architecture

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 65: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

65

From the file perspectiveNamenode

File...

Page 66: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

66

From the file perspective

File...

...divided into 128MB chunks...

Namenode

Page 67: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

67

From the file perspective

File...

...divided into 128MB chunks...

... replicated for fault tolerance

Namenode

Page 68: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

68

Concurrently accessed

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 69: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

69

Hadoop implementation

(Packaged code)

Page 70: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

70

HDFS Architecture

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 71: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

71

HDFS Architecture: NameNode

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 72: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

72

NameNode: all system-wide activity

Page 73: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

73

NameNode: all system-wide activity

Memory

1 File namespace

(+Access Control)

Page 74: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

74

NameNode: all system-wide activity

Memory

/dir/file1

/dir/file2

/file3

File to block mapping

1 File namespace

(+Access Control)

2

Page 75: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

75

NameNode: all system-wide activity

Memory

Block locations

/dir/file1

/dir/file2

/file3

File to block mapping

1 File namespace

(+Access Control)

2

3

Page 76: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

76

HDFS Architecture

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 77: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

77

HDFS Architecture: DataNode

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

Page 78: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

78

DataNode

Page 79: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

79

DataNode

Page 80: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

80

DataNode

Blocks are stored on the

local disk

Page 81: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

81

DataNode

Proximity to hardware facilitates disk failure detection

Page 82: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

82

Block IDs

64 bits

e.g., 7586700455251598184

Page 83: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

83

Subblock granularity: Byte Range

Page 84: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

84

Communication

Datanode

Namenode

Datanode

Client

Page 85: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

85

Summary

Datanode

Namenode

Datanode

Client

Client Protocol

DataTransfer

Protocol

DataNode

Protocol

Page 86: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

86

Communication

Datanode

Namenode

Datanode

Client

Page 87: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

87

Client Protocol (RPC)

Client

Metadata operations

Namenode

Page 88: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

88

Client Protocol (RPC)

Client

Metadata operations

DataNode location

Namenode

Page 89: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

89

Client Protocol (RPC)

Client

Metadata operations

DataNode location

Block IDs Namenode

Page 90: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

90

Client Protocol (RPC)

NamenodeClient

Metadata operations

DataNode location

Block IDs

Java API available

90

Page 91: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

91

Communication

Datanode

Namenode

Datanode

Client

Control

Page 92: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

92

DataNode Protocol (RPC)

Datanode

Datanode always

initiates connection!

Namenode

Page 93: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

93

DataNode Protocol (RPC)

Datanode

Datanode always

initiates connection!

Registration

Namenode

Page 94: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

94

DataNode Protocol (RPC)

Datanode

Heartbeat

Datanode always

initiates connection!

every 3s

custo

miz

able

Registration

Namenode

Page 95: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

95

DataNode Protocol (RPC)

Datanode

HeartbeatBlock operations

Datanode always

initiates connection!

every 3s

custo

miz

able

Registration

Namenode

Page 96: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

96

DataNode Protocol (RPC)

Datanode

Heartbeat

BlockReportBlock operations

Datanode always

initiates connection!

every 3s

every 6h

custo

miz

able

Registration

Namenode

Page 97: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

97

DataNode Protocol (RPC)

Datanode

Heartbeat

BlockReportBlock operations

Datanode always

initiates connection!

every 3s

every 6h

custo

miz

able

Registration

BlockReceived

Namenode

Page 98: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

98

DataNode Protocol (RPC)

Datanode

Namenode

Heartbeat

BlockReportBlock operations

Java API available

every 3s

every 6h

custo

miz

able

Registration

BlockReceived

Page 99: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

99

DataNode Protocol

Datanode

Heartbeat

BlockReportBlock operations

Datanode always

initiates connection!

every 3s

every 6h

custo

miz

able

Registration

BlockReceived

Namenode

Page 100: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

100

Communication

Datanode

Namenode

Datanode

Client

Control

Control

Page 101: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

101

DataTransfer Protocol (Streaming)

DataNodeClient

Data blocks

DataNodeDataNode

101

Page 102: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

102

DataTransfer Protocol (Streaming)

DataNodeClient

Data blocks

DataNodeDataNode

Replication

pipelining

(write only)

102

Page 103: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

103

DataTransfer Protocol (Streaming)

DataNodeClient

Data blocks

DataNodeDataNode

Replication

pipelining

(write only)

103

Page 104: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

104

Communication

Datanode

Namenode

Datanode

Client

Control

Control

ControlData

Control

Page 105: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

105

Summary

Datanode

Namenode

Datanode

Client

Client Protocol

DataTransfer

Protocol

DataNode

Protocol

Page 106: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

106

Metadata functionality

Create directory

Delete directory

Write file

Append to file

Read file

Delete file

Page 107: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

107

Client reads a file

Page 108: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

108

Client reads a file

Asks for file1

Page 109: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

109

Client reads a file

Get block locations

Multiple DataNodes for each block,

sorted by distance

2

Page 110: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

110

Client reads a file

Read

3Input

Stream

Page 111: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

111

Client writes a file

Page 112: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

112

Client writes a file

Create1

Page 113: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

113

Client writes a file

DataNodes for first block

2

Page 114: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

114

Client writes a file

Organizes pipeline3

Page 115: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

115

Client writes a file

Sends data over4

Page 116: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

116

Client writes a file

Ack5

Page 117: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

117

Client writes a file

DataNodes for second block

2

Page 118: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

118

Client writes a file

Organizes pipeline3

Page 119: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

119

Client writes a file

Sends data over4

Page 120: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

120

Client writes a file

Ack5

Page 121: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

121

This is all done simultaneously under DFSOutputStream (streaming through)

Sends data over4

DataNodes for nth block

2

5

3 Pipeline

Ack

Page 122: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

122

Client writes a file

Close/release lock

6

Page 123: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

123

Client writes a file

Checking for

minimal

replication

(DataNode protocol)

7

Page 124: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

124

Client writes a file

Ack

8

Page 125: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

125

Client writes a file

replicates further asynchronously 9

Page 126: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

126

Replicas

Page 127: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

127

Replicas

Number of replicas

specified

per filedefault:3

Page 128: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

128

Replica placement: what to consider?

Reliability

Read/Write Bandwidth

Block distribution

Page 129: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

129

Replica placement: Reminder on topology

Cluster

Rack

Node

Page 130: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

130

Replica placement: Distance

BA

D(A,B)=2

Page 131: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

131

Replica placement: Distance

BA

D(A,B)=4

Page 132: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

132

Replica placement

Replica 1: same node as client (or random), rack A

Page 133: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

133

Replica placement

Replica 1: same node as client (or random), rack A

Replica 2: a node in a different rack B

Page 134: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

134

Replica placement

Replica 1: same node as client (or random), rack A

Replica 2: a node in a different rack B

Replica 3: a node in same rack B

Page 135: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

135

Replica placement

Replica 1: same node as client (or random), rack A

Replica 2: a node in a different rack B

Replica 3: a node in same rack B

Replica 4 and beyond: random, but if possible:

• at most one replica per node

• at most two replicas per rack

Page 136: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

136

Replica placement

Client

Page 137: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

137

Why replicas 2+3 on other rack?

Client

Page 138: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

138

If replicas 1+2 were on same rack...

Block concentration on same rack (2/3)

Page 139: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

139

Performance and availability

Page 140: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

140

The NameNode is a single point of failure

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

/dir/file1

/dir/file2

/file3

Page 141: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

141

The namenode is a single point of failure...

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

/dir/file1

/dir/file2

/file3

What if it fails?

Page 142: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

142

NameNode: all system-wide activity

Memory

Block locations

/dir/file1

/dir/file2

/file3

File to block mapping

1 File namespace

(+Access Control)

2

3

Page 143: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

143

1. You want to persist

Memory

1/dir/file1

2

3 not persisted

Page 144: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

144

1. You want to persist

Namespace

file

Persistent Storage

Memory

1/dir/file1

2

3 not persisted

Page 145: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

145

1. You want to persist

Namespace

file

Persistent Storage

Memory

1/dir/file1

/dir/file2

2

3 not persisted

Edit log

Page 146: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

146

1. You want to persist

Namespace

file

Persistent Storage

Memory

1/dir/file1

/dir/file2

/file3

2

3 not persisted

Edit log

Page 147: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

147

2. You want to backup

Namespace

fileEdit log

Persistent Storage

Shared driveBackup drives

Glacier

Page 148: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

148

The namenode is a single point of failure...

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

/dir/file1

/dir/file2

/file3

What if it fails?

Page 149: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

149

The namenode is a single point of failure...

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

/dir/file1

/dir/file2

/file3

We need to start

it up again!

Page 150: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

150

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Edit log

Page 151: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

151

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

Edit log

Page 152: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

152

Edit log

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

Page 153: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

153

Edit log

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Page 154: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

154

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

Page 155: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

155

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

Page 156: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

156

Namenodes: Startup

Namespace

file

Persistent Storage

Memory

Filesystem

/dir/file1

/dir/file2

/file3

Block locations

Edit log

Page 157: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

157

Starting a namenode...

... takes

30 minutes!

Page 158: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

158

Starting a namenode...

Can we do

better?

Page 159: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

159

3. Checkpoints with Secondary NameNode

Old namespace

fileEdit log

New namespace file

Page 160: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

160

4. High Availability (HA): Backup NameNodes

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode

/dir/file1

/dir/file2

/file3

Namenode

/dir/file1

/dir/file2

/file3

Maintains mappings and locations

in memory like the namenode.

Page 161: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

161

5. High Availability (HA): Standby NameNodes

Active

Namenode

Standby

Namenode

Standby

Namenode

Page 162: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

162

5. Federated DFS

Datanode Datanode Datanode Datanode Datanode Datanode

Namenode /foo

/foo/file1

/foo/file2

Namenode /bar

/bar/file1

/bar/file2

Page 163: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

163

Using HDFS

Page 164: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

164

HDFS Shell

$ hadoop fs <args>

Page 165: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

165

HDFS Shell

$ hadoop fs <args>

$ hdfs dfs <args>

Page 166: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

166

HDFS Shell

$ hadoop fs <args>

$ hdfs dfs <args>local

filesystem

Page 167: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

167

HDFS Shell: POSIX-like

$ hadoop fs –ls

$ hadoop fs –cat /dir/file

$ hadoop fs –rm /dir/file

$ hadoop fs –mkdir /dir

Page 168: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

168

HDFS Shell: upload and download

$ hadoop fs –copyToLocal /user/hadoop/file

localfile

$ hadoop fs –copyFromLocal

localfile1 localfile2

/user/hadoop/hadoopdir

Page 169: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

169

HDFS Shell: Configuration

core-site.xml

<properties>

<property>

<name>fs.defaultFS</name>

<value>hdfs://host:8020</value>

<description>NameNode hostname</description>

</property>

</properties>

Page 170: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

170

HDFS Shell: Configuration

hdfs-site.xml

<properties>

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Replication factor</description>

</property> <property>

<name>dfs.namenode.name.dir</name>

<value>/grid/hadoop/hdfs/nn</value>

<description>NameNode directory</description>

</property>

<property>

<name>dfs.datanode.name.dir</name>

<value>/grid/hadoop/hdfs/nn</value>

<description>DataNode directory</description>

</property>

</properties>

Page 171: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

171

Populating HDFS: Apache Flume

Collects, aggregates, moves log data(into HDFS)

_____ _____

__ _____ ___

_____

Page 172: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

172

Populating HDFS: Apache Sqoop

Imports from a relational database

Page 173: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

173

GFS

Page 174: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

174

GFS vs. HDFS: Terminology

NameNode

DataNode

Block

FS Image

Edit log

HDFS

Master

Chunkserver

Chunk

Checkpoint image

Operation log

GFS

Page 175: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

175

HDFS vs. GFS: Block size

GFS/Apache HDFS

64 MB

128 MB

Cloudera HDFS

Page 176: Ghislain Fourny Big Data for Engineers Fall 2019 · Ghislain Fourny Big Data for Engineers Fall 2019 4. Distributed file systems KhengHo Toh/ 123RF Stock Photo

176

Pointers

Official documentation

http://hadoop.apache.org/docs/r3.2.1/

GFS Paper

On course website

Java API

https://hadoop.apache.org/docs/r3.2.1/api/ind

ex.html