native erasure coding support inside hdfs presentation

Post on 07-Jan-2017

435 Views

Category:

Presentations & Public Speaking

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HDFS Erasure CodingZhe Zhang

zhezhang@cloudera.com

Replication is Expensive

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

§ 200% storage overhead

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

§ HDFS inherits 3-way replication from Google File System - Simple, scalable and robust

§ 200% storage overhead§ Secondary replicas rarely accessed

Replication is Expensive

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

Erasure Coding Saves Storage

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 0Replication:XOR Coding: 1 0

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0

2 extra bits

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

§ Half the storage overhead

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage§ Simplified Example: storing 2 bits

§ Same data durability - can lose any 1 bit

§ Half the storage overhead§ Slower recovery

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)

- A PB of new data every 1~2 days - All “sealed” data stored in EC

Erasure Coding Saves Storage§ Facebook

- f4 stores 65PB of BLOBs in EC§ Windows Azure Storage (WAS)

- A PB of new data every 1~2 days - All “sealed” data stored in EC

§ Google File System - Large portion of data stored in EC

Roadmap

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication:

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication:

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication: Data Durability = 2

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

3-way Replication: Data Durability = 2

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

useful data

3-way Replication: Data Durability = 2

redundant data

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Replica

DataNode0 DataNode1 DataNode2

Block

NameNode

Replica Replica

useful data

3-way Replication: Data Durability = 2

Storage Efficiency = 1/3 (33%)

redundant data

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Y = 0 ⊕ 1 = 1

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Y = 0 ⊕ 1 = 1

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

useful data redundant data

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

XOR:Data Durability = 1

Storage Efficiency = 2/3 (67%)

useful data redundant data

X Y X ⊕ Y

0 0 00 1 11 0 11 1 0

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)

Very flexible!

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3)

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4)

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?

Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%

EC in Distributed StorageBlock Layout:

128~256MFile 0~128M … 640~768M0~128M 128~256M

EC in Distributed StorageBlock Layout:

128~256MFile … 640~768M

0~128M

bloc

k 0

DataNode 0

0~128M 128~256M

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

EC in Distributed StorageBlock Layout:

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

EC in Distributed StorageBlock Layout:

Data Locality !

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

EC in Distributed StorageBlock Layout:

Data Locality !

Small Files "

File … 640~768M

0~128M

bloc

k 0

DataNode 0

128~ 256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~ 768M

bloc

k 5

DataNode 5 DataNode 6

parity

Contiguous Layout:

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~128M 128~256M

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~1M 1~2M 5~6M

0~128M 128~256M

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

0~1M 1~2M 5~6M6~7M

0~128M 128~256M

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

bloc

k 5

DataNode 5 DataNode 6

parity

Striped Layout:0~1M 1~2M 5~6M6~7M

Data Locality "

Small Files !

Parallel I/O !

0~128M 128~256M

EC in Distributed Storage

Spectrum:

ReplicationErasureCoding

Striping

Contiguous

Ceph

Ceph

Quancast File System

Quancast File System

HDFS Facebook f4

Windows Azure

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

40.08%36.03%

23.89%

2.03%11.38%

86.59% file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

Choosing Block Layout•Medium: 1~6 blocks•Small files: < 1 block•Assuming (6,3) coding • Large: > 6 blocks (1 group)

64.61%

9.33%

26.06%

1.85%1.86%

96.29%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

40.08%36.03%

23.89%

2.03%11.38%

86.59% file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

3.20%

20.75%

76.05%

0.00%0.36%

99.64%file count

space usage

Dominated by small files

small medium large

Cluster C Profile

Choosing Block Layout

Striping

Contiguous

Replication Erasure Coding

Phase 1.1

Phase

1.2

Phase 2 (Future work)

Phase 3 (Future work)

CurrentHDFS

Generalizing Block NameNode

Generalizing Block NameNodeMapping Logical and Storage Blocks

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Hierarchical Naming Protocol:

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Client Parallel Writing

streamer

queue

streamer … streamer

DataNode DataNode DataNode

Coordinator

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

Client Parallel Reading

… DataNodeDataNode DataNode DataNode DataNode

parity

Reconstruction on DataNode§ Important to avoid delay on the critical path

- Especially if original data is lost § Integrated with Replication Monitor

- Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms

§ New ErasureCodingWorker component on DataNode

Roadmap§ Background of EC

- Redundancy Theory - EC in Distributed Storage Systems

§ HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction

§ Hardware-accelerated Codec Framework

Acceleration with Intel ISA-L§ 1 legacy coder

- From Facebook’s HDFS-RAID project § 2 new coders

- Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Microbenchmark: Codec Calculation

Microbenchmark: HDFS I/O

Conclusion

Conclusion§ Erasure coding expands effective storage space by ~50%!

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):

- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)

Conclusion§ Erasure coding expands effective storage space by ~50%!§ HDFS-EC phase I implements erasure coding in striped block layout§ Upstream effort (HDFS-7285):

- Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)

§ Phase II will support contiguous block layout for better locality

Acknowledgements§ Cloudera

- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel

- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks

- Jing Zhao, Tsz Wo Nicholas Sze § Huawei

- Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan)

- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Just merged to trunk!

Questions?

Just merged to trunk!

Questions?

Just merged to trunk!

Erasure Coding: A type of Error Correction Coding

EC in Distributed Storage

Spectrum:

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

128~256MFile 0~128M … 640~768M

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

128~256MFile 0~128M … 640~768M

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

Small Files "

128~256MFile 0~128M … 640~768M

EC in Distributed Storage

0~128M

128~256M

DataNode0

bloc

k 0

bloc

k 1 …

DataNode1

640~768M

DataNode5bl

ock

5

ContiguousDataNode6 DataNode8

data parity

Block Layout:

Data Locality !

Small Files "

128~256MFile … 640~768M

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

Small Files !

EC in Distributed Storage

0~1M……

1~2M……

DataNode0

bloc

k 0

DataNode15~6M…

127~128M

DataNode5

StripingDataNode6 DataNode8

data parity

……

Block Layout:

Data Locality "

Small Files !

Parallel I/O !

Client Parallel Writing

blockGroup

DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4

DFSStripedOutputStream

dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4

blk_1009 blk_1010 blk_1011 blk_1012 blk_1013

Coordinator

allocate new blockGroup

Client Parallel Reading

Stripe 0

Stripe 1

Stripe 2

DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3

(parity blocks)(data blocks)

all zero all zero

requested

requested requested requested

requested

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read

recovery read

top related