Why Scale-Out Big Data Apps Need A New Scale-Out Storage Modern storage for modern business
Rob Whiteley, VP, Marketing, Hedvig April 9, 2015
• Big data pressures on storage infrastructure
• The rise of elastic software-defined storage (SDS)
• 6 SDS capabilities for big data
• 3 cases studies of SDS for big data
Agenda
Copyright 2015 Hedvig Inc. Confidential.
Big data pressures on storage infrastructure
Big data requires flexible infrastructure
Copyright 2015 Hedvig Inc. Confidential. 4
Big data
Time-to-market
Flexible infrastructure
Business executives Developers IT infrastructure & DevOps
According to Forrester . . .
Copyright 2015 Hedvig Inc. Confidential. 5
10X faster growth of enterprise
data than storage budgets
58% of orgs take days, weeks, or months
to provision storage
14% of orgs have “cloud-like”
provisioning capabilities
Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015.
Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.
Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.
Three truths and a lie about storage & big data
• Software-defined storage is the right direction.
• Hyperconverged provides the best economics. • Big data apps are repeating the sins of the 90s.
• Hyperscale helps virtualize Hadoop and NoSQL.
Copyright 2015 Hedvig Inc. Confidential. 6
The rise of elastic software-defined storage (SDS)
A big data inflection point in storage
Copyright 2015 Hedvig Inc. Confidential. 8
Price/ performance
Storage capabilities
Traditional Scale-up Scale-out Elastic
Before After Hardware-defined Software-defined
Scale-out Elastic
High-availability + RAID Distributed + Replication
Hyperconverged Hyperconverged + Hyperscale
The big data so-ware storage inflec3on point
Storage flavors
Storage features
Deployment flexibility
Three legs to the big data requirements stool
Hyperconverged
Software-defined storage
Monolithic arrays
Virtual SANs
Copyright 2015 Hedvig Inc. Confidential. 9
LINUX Hypervisor Windows
Storage cluster
Hadoop, NoSQL cluster
LINUX Hypervisor
Windows
Hadoop/NoSQL + Storage cluster
Hyperscale Hyperconverged
Storage client
Storage node
DC1 DC2 Cloud3
How SDS provides elastic storage for big data Big
data
iSCSI
Big data
NFS
Big data
Object
1 Admin provisions virtual volumes and script or apply storage policies
2 Virtual volume presents block, file, & object storage to big data hosts
3 Storage client captures guest I/O and communicates to underlying cluster
4 Cluster distributes and replicates data, applies compression & dedupe
5 Cluster autotiers & balances to optimize data locality & availability
Storage cluster
Copyright 2015 Hedvig Inc. Confidential. 11
= x86 or ARM server
SDS capabilities for big data 6
6 big data friendly SDS capabilities
1. I/O sequentialization 2. Tunable replication 3. DR replication 4. Disk failures and rebuilds 5. Data efficiency methods 6. Flash caching & flash pinning
Copyright 2015 Hedvig Inc. Confidential. 13
1. Random I/O to sequential writes
Copyright 2015 Hedvig Inc. 14
Big data node Application writes data in random blocks, and gets immediate ack from cluster. 1
Storage cluster sequentializes incoming blocks (in RAM+SSD) into larger chunks. 2
Storage cluster writes larger sequentialized data chunks to underlying disks in auto-balanced, and auto-distributed manner according to policy.
3
Storage client
Storage node
Example: Single write operation
15
Example Policy: 3 COPIES; AGNOSTIC
Big data node
Hedvig Controller software
Hedvig Cluster software
Application sends write to any storage cluster node. (round-robin) 1 Cluster node writes first aggregated blocks locally. Second copy written to first responding cluster node.
2
Ack sent back to big data node after majority quorum of acks. (2 ack’s in case of 3 copies) CHECKSUMMED!
3
Third copy is written semi-synchronously. Could also be synchronous if all servers are equidistant.
4 SSD/Flash
SAS/SATA
2. Granular replication of data
Storage containers
Big data node
Granular data chunks
Disk Platter Disk Platter Disk Platter
Chunks are distributed across all servers and containers in
the storage cluster.
Hedvig Controller software
Hedvig Cluster software
3. DR Policy: 3xDC-aware with 3 copies
DR Policy: Datacenter-aware (One copy per DC) Data Copies: 3 Sync-Acknowledgements: 2
Data Center A Data Center B Data Center C
Active Active
Hedvig Controller software
Hedvig Cluster software
4. Disk failures and rebuilds • Disks managed in protection groups. • Disk rebuilds initiated automatically upon disk failure across entire cluster. • No spare disks needed. • Quick wide-stripe rebuilds allow for largest disks. • Average 4TB disk rebuild time is under 20 minutes. • Easily support 6TB, 8TB and 10TB drives.
5. Thin provisioning, deduplication and compression • Thin provisioning for every virtual volume
• Inline compression and deduplication
• Global, system-wide deduplication – all attached storage nodes participate
• 60-75% data reduction – dedupe rates vary based on data type
• Dedupe cache can reside on Controller SSD/flash in application server
• Eliminate all duplicate I/O from network, dramatically lower latency and increase IOPS!
• Clone non-deduped volume with dedupe
Client-side SSD/flash dedupe read cache with dedupe map
Big data node + storage client
Storage node
Cluster SSD/flash read+write cache
6. 3 ways Hedvig uses SSD and PCIe flash
Client side read and dedupe cache on big data node
Primary storage as dedicated volume on storage nodes + flash pinning
Read/write cache on storage nodes
Big data node + storage client
Storage node
cases studies of SDS for big data 3
Three case studies
Copyright 2015 Hedvig Inc. Confidential. 22
Fortune 100 bank
Deploying Cassandra and MongoDB for developers with infrastructure self-provisioning for DevOps model. Multiple NoSQL deployments leading to islands of (elastic) storage and inability to self-provision or plug into bank’s orchestration tools. Building elastic SDS cluster on commodity infrastructure to lower cost per bit by and drive self-provisioning through RESTful APIs.
4th largest US law firm
Needs quick, reliable indexing of 100M active client docs in HP Autonomy. Needed a scale-out, flash-friendly solution to replace local SSDs, which are required to achieve sub-one second index queries. Getting 6x performance with SDS versus traditional hybrid array, which included flash tier; now has incremental commodity scalability.
Fortune 50 telecom
Seeks centralized, shared storage to virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands of (also elastic) storage and preventing IT’s virtualization-first policy. Virtualizing all three Hadoop distributions and deploying SDS as the “data lake” for scale-out storage and global dedupe across data sets.
Thank you!