scalable, reliable, and efficient object storage for hadoop · replacing hdfs option 2: drop-in...
TRANSCRIPT
![Page 1: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/1.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Scalable, reliable, and efficient object storage for Hadoop
Greg Dhuse Cleversafe
![Page 2: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/2.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
The case for object storage with Hadoop Specifically, dispersed object storage
Integration options Object interface vs. drop-in replacement for
HDFS Data-local computation and information
dispersal
2
Outline
![Page 3: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/3.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Popular open-source framework for distributed computation Commercialized by Cloudera and others
Originally just MapReduce, but branching out to other paradigms with YARN
Flourishing ecosystem Open source & commercial products
Mantra: Take the computation to the data
3
Why ?
![Page 4: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/4.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Hadoop Architecture
4
Compute
Storage
![Page 5: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/5.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Master-slave architecture: NameNode Point of failure: Previously a single point of
failure, now a clustered point of failure with HA
Scalability bottleneck: In the I/O path. NameNode federation helps, but introduces administrative headaches and increases failure footprint
Designed around replication High storage overhead
5
Limitations of HDFS
![Page 6: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/6.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
What is Object Storage?
Object storage = Data payload + (optional) Metadata + (optional) Indexing Data payload Typically large (1MB++) Efficient streaming writes, no random offset
writes Efficient reads Streaming, range, offset
6
![Page 7: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/7.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
What is Object Storage?
Object storage = Data payload + (optional) Metadata + (optional) Indexing Metadata (optional) Object name If indexed, can be treated like a file system
path at lookup time Typical file system metadata User-defined metadata
7
![Page 8: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/8.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
What is Object Storage?
Object storage = Data payload + (optional) Metadata + (optional) Indexing Indexing (optional) Index objects by arbitrary key Name, metadata field, compound keys
Index within objects Build indexes at write time or later ( ) No centralized database or index controllers
8
![Page 9: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/9.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Why Object Storage?
Flexibility Mix & match payload, metadata, indexing Many access methods (vs. HDFS)
Concurrency Payload is write once, read many Metadata & index can use lockless algorithms
Object interfaces are simpler But developers are used to POSIX
Many applications don’t need a file system May already use a database
9
![Page 10: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/10.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Object Access Flexibility
10
Site 1 Site 2 Site 3 Site n
Storage & compute nodes
Access Method
Application Layer
Embedded Client
Access Gateways
Object ID
Write by Name
Write by Object-ID
Application Server
or
or
Object & Metadata
Object & Metadata Object
data
Object data
REST S3, CDMI, legacy fs
(optional)Metadata
Database
Application Server
![Page 11: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/11.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Scalability Flat namespace Namespace is a mapping from node to
name range Used by the client and servers for lookup Objects are randomly named and evenly
distributed No master controllers, no databases, no
NameNode Nodes that are physically close appear
nearby in the namespace Store data on opposite sides of the
namespace for maximum separation Inspired by Distributed Hash Tables
11
SS1 SS2
SS3
Pillar 1
Pillar 3 Pillar 2
Pillar 4 SS4
![Page 12: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/12.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Dispersed Object Storage
Improved Efficiency & Reliability for Object Storage
Instead of replication or RAID, Information Dispersal An Information Dispersal Algorithm (IDA)
Receives K inputs, produces N outputs Can recover input from any K of the N outputs Where 1 ≤ K ≤ N
In essence: an IDA is forward error correction combined with a slicing and spreading operation
12
![Page 13: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/13.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
IDA example: K = 10, N = 16
13
Site 1
Site 2
Site 3
Site 4
8h$1 vD@- fMq& Z4$’ >hip )aj% l[au T0kQ %~fa Uh(k My)v 9hU6 >kiR &i@n pYvQ 4Wco
8h$1 vD@- >hip )aj% l[au %~fa 9hU6 >kiR pYvQ 4Wco
IDA
IDA
![Page 14: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/14.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Storage Efficiency
Bytes stored: N/K * object size 1.0 < factor< 2.0
Factor approaches 1.0 for fixed failure tolerance (N-K) as N increases Limited memory
constrains N in practice
14
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
10 30 50 70 90
N-K = 6
![Page 15: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/15.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Reliability
15
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100% 1,
080
5,40
0
10,8
00
21,6
00
27,0
00
32,4
00
37,8
00
43,2
00
48,6
00
54,0
00
59,4
00
64,8
00
70,2
00
75,6
00
81,0
00
86,4
00
91,8
00
97,2
00
102,
600
Prob
abili
ty
System Size (TB)
Annual Chance of Data Loss (Enterprise Drives)
20/16 Dispersal 8+2 RAID 6 4+1 RAID 5 3x Replication
![Page 16: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/16.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Apply the IDA to Everything
Metadata Metadata stored as dispersed objects Lockless updates
Indexing Index data stored as dispersed objects Many concurrent readers & writers Client-side caching
For details, come to: Scaling Metadata into the Exabytes
16
![Page 17: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/17.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Object Storage vs. HDFS
HDFS-0.23 and beyond Dispersed Object Storage
Scalability
Federation allows administrators to manually split namespace among a small number of nodes
Dispersed metadata allows all nodes to be treated equally. No special metadata nodes to spec, configure & monitor
Reliability / Availability High Availability fail-over to a backup NameNode
IDA offers configurable reliability & availability with seamless handling of multiple simultaneous failures
Cost Efficiency Replication (3x default) IDA provides better reliability and less cost
![Page 18: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/18.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
IDA Configuration: 26/18, 1.44x expansion factor
Mean Time to Data Loss (MTTDL): 390,000 years
Availability: 9 9’s of availability
Efficiency: 1.44x expansion factor = 28.2 PB raw Reduction of cluster to 9600 nodes from 20K
Reliability • 10 clusters with a total of 20K data nodes in 2009 (19.6 PB usable) Lost 19/329M blocks due to NameNode failures Yahoo! lost over 1 GB of data in 1 year, even with replication
Availability 25 clusters with 22 NameNode outages in 2009, only 8 of which would be avoided
with HA (2 9’s of availability) Efficiency Assuming default 3x replication and 64MB block size: 58.8 PB raw storage
HDFS at Yahoo! Based on a case study presented at Hadoop World 2011 by Suresh Srinivas of Hortonworks
![Page 19: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/19.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Object Storage & Hadoop
Option 1: Object-based access Benefit: Allows full flexibility of data, metadata,
and indexing Drawback: Interface is not compatible with
HDFS Requires changes to existing MapReduce jobs
Accomplished using custom InputFormat & OutputFormat
![Page 20: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/20.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so
existing jobs can run without modification Drawback: File system paradigm is less
flexible & powerful Accomplished by implementing Hadoop’s
FileSystem interface File system semantics are emulated using
object storage, metadata, and indexing For more details come to: Bridging POSIX-like
APIs and Cloud Storage
![Page 21: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/21.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Data Local Computation & the IDA
One problem: How can we do data-local computation on dispersed objects? Each object is split into N slices and can be
restored from any K Need to adhere to the mantra: Map tasks
should read locally, not from the network
Solution: At write time, project input so that after dispersal raw data falls in contiguous chunks on K storage nodes
21
![Page 22: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/22.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Data Local Computation & the IDA
At read time: Hadoop tasks read locally from raw slices, bypassing IDA reconstruction Fall back to full IDA reconstruction on error
Best of both worlds On a healthy system, reads for each
MapReduce task can be satisfied locally, almost identically to HDFS
Full reliability/availability of dispersal in case of failure
22
![Page 23: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/23.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Dispersal Pipeline for Hadoop
Segmentation IDA
Raw data stream
Segmentation metadata & 1MB+
segments
Storage nodes
Computationally useful slices
Data projection
Write cache
Compute optimized data
chunks
![Page 24: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/24.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
HDFS Data Layout
Chunk 1 Write 1 (64MB * 3x)
Chunk 1 Read for Task 1 (64MB)
![Page 25: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/25.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Dispersed Data Projection
Segment 1 Write 1 (1MB)
Chunk 1 Read for Task 1(64MB)
![Page 26: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/26.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Storage node
Storage Node Architecture
Hadoop MapReduce computation runs directly on storage nodes
Jobs are assigned to nodes for local data access Calculate custom InputSplits based on object namespace and
data projection
Dispersed Storage Object Storage
API
Hadoop MapReduce
Local data access
![Page 27: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/27.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Dispersed Object Storage for Hadoop
What changes With Option 1: Object storage interfaces used by
MapReduce jobs What stays the same With Option 2: Same FileSystem interface as
HDFS. Dispersed Object Storage used transparently from MapReduce jobs
Compatible with existing Hadoop distributions and many existing tools & projects
Compute administration done using existing tools
![Page 28: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/28.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Drawbacks & Challenges
IDA reduces the effective chunk size for objects that are smaller than one full chunk set: K*chunk size Example: K=10, 64MB chunk size, 640MB
chunk set size. A 500MB object would only have 50MB chunks
Also applies to the last chunk set of any object Write projection requires substantial memory or
an on-disk cache, which may hurt performance
28
![Page 29: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/29.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Questions & Answers
Questions?
29
![Page 30: Scalable, reliable, and efficient object storage for Hadoop · Replacing HDFS Option 2: Drop-in replacement for HDFS Benefit: Fully compatible with HDFS, so existing jobs can run](https://reader030.vdocument.in/reader030/viewer/2022011907/5f4f5e03aea5683bc36a5e74/html5/thumbnails/30.jpg)
2012 Storage Developer Conference. Copyright © 2012 Cleversafe. All Rights Reserved.
Bonus: Indexing & Hadoop
One bonus feature: Build & use Object Storage indexes from Hadoop jobs
Build indexes on data using Indexing APIs from MapReduce jobs Analyze and index data in
parallel using index APIs Search and query your
indexed data
Use indexes in MapReduce jobs to efficiently find the data you need to process Index data and metadata at
ingest or later using MapReduce
Query the index directly from MapReduce jobs to find the data you need to analyze
Perform targeted analysis on only the relevant data