improving cloud storage cost and data resiliency with ... · pdf fileimproving cloud storage...
TRANSCRIPT
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Improving Cloud Storage Cost and Data Resiliency with Erasure Codes
Michael Penick
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Commodity Storage
Hosting storage FTP backup Goals Inexpensive (use “commodity” hardware) Resilient to failures Highly available Customizable
2
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
MogileFS
Open source distributed filesystem Written by Brad Fitzpatrick No single point of failure Automatic/Asynchronous file replication Shared-Nothing design (disks) Local filesystem agnostic Flat namespace
3
Tracker
Storage Node
MetadataDB
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
MogileFS
4
Tracker
Storage Node
MetadataDB
Clients
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
“NebulaFS”
Large file support Offsite Replication Self-healing Data retention C++ client (PHP and Perl SWIG wrappers) Metadata Sharding Range GETs
5
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
“NebulaFS”
6
Tracker / Storage Node
MySQL
Storage Node
MySQL Tracker /
Storage Node
Storage Node
Clients
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
FTP Backup
7
FTP Presentation (Net::FTPServer)
VFS DB
NebulaFS
Metadata DB Super Nodes Storage Nodes
NebulaFSAPI
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Widely Applicable
“Storage service” (REST) layer New Product Integrations Online File Folder (videos and images) Website Builder/ Photo Album Go Daddy Cloud Servers (snapshots) Email …
8
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Object Storage
9
RESTful Presentation (S3, GDCS)
VFS DB
NebulaFS
Metadata DB Super Nodes Storage Nodes
VFS
User DB
NebulaFSAPI
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Why?
10
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Why?
11
5.39% 8.51%
83.89%
1.20% 1.01% ~3.25 PB
Aries FTP WST/PA OFF VDC Other
1.80% 2.56%
38.44%
1.44%
55.44%
0.30%
~10.8 PB
Aries FTP WST/PA OFF VDC Email Other
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
The Problem
NebulaFS = Inexpensive, resilient, highly available storage
Problem: Disk drives fail...a lot. F = mean time to failure In a system of n devices our mean time failure
is: F/n Solution: Replicate the data
12
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Replication
13
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
Success!
Duplicate Replicate
Copy 1
Copy 2
Copy 3
Copy 4
Disk 1 … Disk 2 Disk n
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Replication
Simple and effective Durability:
99.99999% over 1 year (or 0.1 of 1 million objects) 99.99% over 3 years (or 100 of 1 million objects)
Problem: 100 % overhead per copy +300% overhead for 3 onsite and 1 offsite
copy There has to be a better way.
14
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Erasure Codes
Forward error correction code Add redundant data (codes) to message so that it
can be recovered Where’s EC used?
Optical media Media streaming File Systems (RAID-6, several distributed FS, …)
15
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Erasure code (write)
Divide 01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
Encode
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
k
m
Disk 1
Disk 2
…
Disk n
Copy 1
Copy .75
101010001010101010100101101001
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Erasure code (read)
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
Verify Decode
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
Disk 1
Disk 2
…
Disk n 101010001010101010100101101001
k
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Erasure codes
What? k – number of original pieces m – number of redundant pieces (codes)
How? k = 4, m = 3: only 75% overhead (3 failures) k = 10, m = 6: only 60% overhead (6 failures) k = 9, m = 3: only 33% overhead (3 failures)
AKA: k = 10, m = 6 10 of 16
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Trade-offs (positive)
Better resilience to failure Durability for 10 of 16:
99.9999999999% over 1 year (or 0.000001 of 1 million objects) 99.99999% over 3 years (or 0.1 of 1 million objects)
Durability for 9 of 12: 99.99999% over 1 year (or 0.1 of 1 million objects) 99.99% over 3 years (or 100 of 1 million objects)
Significant savings (includes a full offsite copy) 10 of 16: (4 – 2.60) / 4 = 35% savings (60 % w/o offsite) 9 of 12: (4 – 2.33) / 4 = 42% savings (67% w/o offsite)
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Trade-offs (negative)
Computationally expensive Increased number of IOPS Complexity (additional metadata) More nodes and connections
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Erasure Codes
Optimal erasure code Any k pieces of the message can recover the message Reed-Solomon (and Cauchy Reed-Solomon)
Libraries (Jerasure, Zfec, Luby, librs,…) Stability/Performance Evaluation
Paper – “A Performance Comparison of Open-Source Erasure Coding Libraries for Storage Applications” http://web.eecs.utk.edu/~plank/plank/papers/CS-08-
625.pdf
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries - zfec
Reed Solomon Written in C (Python and Haskell bindings) Download: http://pypi.python.org/pypi/zfec Documentation is the source code
22
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – zfec Encoding
23
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – zfec Decoding
24
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – zfec Decoding contd.
25
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – zfec Decoding contd.
26
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – zfec Decoding cond.
k = 6, m =2, erasures = { 0, 2, -1 }
index = { 6, 1, 7, 3, 4, 5 }
27
inpkts
coding 0
data 1
coding 1
data 3
data 4
data 5
outpkts
data 0
data 2
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries - Jerasure
Reed Solomon, Cauchy Reed Solomon, and Minimal Density Codes
Written in C (no bindings) Download:
http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
Good documentation and examples
28
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – Jerasure Encoding
29
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – Jerasure Decoding
30
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – Performance
31
0
500
1000
1500
2000
2500
w = 8 w = 16 w = 32
MB/s
Encoding
Jerasure RS Jearsure CRS zfec
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
EC Libraries – Performance
32
0 100 200 300 400 500 600 700 800 900
w = 8 w = 16 w = 32
MB/s
Decoding
Jerasure RS Jearsure CRS zfec
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library
EC library (Phase I) Read/Copy Write Repair
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library
Inputs/Outputs abstracted boost::asio (HTTP) PHP/Perl bindings Random access reads (i.e. Range GET) Data validated/corrected on-the-fly
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library Writes
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
k
m
Disk 1
Disk 2
…
Disk n 101010001010101010100101101001
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library Failures
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library Reads
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
01010101011010101011010101010101010101101010110010101011010011010101010110100101010101101010101….
Disk 1
Disk 2
…
Disk n 101010001010101010100101101001
k
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library Copy
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
Disk 1
Disk 2
…
Disk n 101010001010101010100101101001
k Disk 1
Disk 2
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library Repair
010101010110101010110101010101
010101011010101100101010110100
110101010101101001010101011010
101010000000000000000000000000
101010010101101010101010101010
101010001010101010100101101001
Disk 1
Disk 2
…
Disk n 101010001010101010100101101001
k
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – EC Library
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Reads/Writes
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Reads/Writes DB
Increased number of “file_device” entries Decreased number of “file” entries
Change the meaning of “class”
45
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Reads/Writes DB
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Write
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Read
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Integration – Recovery
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Lessons Learned
CRC32 can be slow Intel’s Slicing-by-8 Algorithm
Block size can limit your smallest file size Lighttpd doesn’t support “Transfer-Encoding:
chunked”
50
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Performance Test Setup
6 super nodes (tracker and storage node) 180 drives Drives not distributed i.e. not 30 drives per node EC strips maximally distributed
51
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Performance Test Results
0 10 20 30 40 50 60 70 80
1KB 1MB 16MB 32MB 64MB 128MB 256MB
MB
/s
File Size
Writes
ec_1_of_2 ec_6_of_9 ec_9_of_12 ec_10_of_16 replication
52
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Performance Test Results
0 20 40 60 80
100 120 140 160 180 200
1KB 1MB 16MB 32MB 64MB 128MB 256MB
MB
/s
File Size
Reads
ec_1_of_2 ec_6_of_9 ec_9_of_12 ec_10_of_16 replication
53
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Migrations
54
RESTful Presentation (S3, GDCS)
VFS DB
NebulaFS
Metadata DB Super Nodes Storage Nodes
User DB
Migration Script
VFS NebulaFSAPI
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Future
Finish Phase III Repairs Offsite copy
Net new growth Optimizations Open source
55
2012 Storage Developer Conference. © 2012 GoDaddy.com. All Rights Reserved.
Questions
56
Thank You!