Download - Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Why Scale-Out Big Data Apps Need A New Scale-Out Storage Modern storage for modern business

Rob Whiteley, VP, Marketing, Hedvig April 9, 2015

• Big data pressures on storage infrastructure

• The rise of elastic software-defined storage (SDS)

• 6 SDS capabilities for big data

• 3 cases studies of SDS for big data

Agenda

Copyright 2015 Hedvig Inc. Confidential.

Big data pressures on storage infrastructure

Big data requires flexible infrastructure

Copyright 2015 Hedvig Inc. Confidential. 4

Big data

Time-to-market

Flexible infrastructure

Business executives Developers IT infrastructure & DevOps

According to Forrester . . .


10X faster growth of enterprise

data than storage budgets

58% of orgs take days, weeks, or months

to provision storage

14% of orgs have “cloud-like”

provisioning capabilities

Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015.

Visit hedviginc.com for full research report. Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.

Source: Forrester Technology Adoption Profile: Meet Evolving Business Demands With Software-Defined Storage, March 2015. Visit hedviginc.com for full research report.

Three truths and a lie about storage & big data

• Software-defined storage is the right direction.

• Hyperconverged provides the best economics. • Big data apps are repeating the sins of the 90s.

• Hyperscale helps virtualize Hadoop and NoSQL.


The rise of elastic software-defined storage (SDS)

A big data inflection point in storage


Price/ performance

Storage capabilities

Traditional Scale-up Scale-out Elastic

Before After Hardware-defined Software-defined

Scale-out Elastic

High-availability + RAID Distributed + Replication

Hyperconverged Hyperconverged + Hyperscale

The big data so-ware storage inflec3on point

Storage flavors

Storage features

Deployment flexibility

Three legs to the big data requirements stool

Hyperconverged

Software-defined storage

Monolithic arrays

Virtual SANs


LINUX Hypervisor Windows

Storage cluster

Hadoop, NoSQL cluster

LINUX Hypervisor

Windows

Hadoop/NoSQL + Storage cluster

Hyperscale Hyperconverged

Storage client

Storage node

DC1 DC2 Cloud3

How SDS provides elastic storage for big data Big

data

iSCSI

Big data

NFS

Big data

Object

1 Admin provisions virtual volumes and script or apply storage policies

2 Virtual volume presents block, file, & object storage to big data hosts

3 Storage client captures guest I/O and communicates to underlying cluster

4 Cluster distributes and replicates data, applies compression & dedupe

5 Cluster autotiers & balances to optimize data locality & availability

Storage cluster


= x86 or ARM server

SDS capabilities for big data 6

6 big data friendly SDS capabilities

1.  I/O sequentialization 2.  Tunable replication 3.  DR replication 4.  Disk failures and rebuilds 5.  Data efficiency methods 6.  Flash caching & flash pinning


1. Random I/O to sequential writes

Copyright 2015 Hedvig Inc. 14

Big data node Application writes data in random blocks, and gets immediate ack from cluster. 1

Storage cluster sequentializes incoming blocks (in RAM+SSD) into larger chunks. 2

Storage cluster writes larger sequentialized data chunks to underlying disks in auto-balanced, and auto-distributed manner according to policy.

3

Storage client

Storage node

Example: Single write operation

15

Example Policy: 3 COPIES; AGNOSTIC

Big data node

Hedvig Controller software

Hedvig Cluster software

Application sends write to any storage cluster node. (round-robin) 1 Cluster node writes first aggregated blocks locally. Second copy written to first responding cluster node.

2

Ack sent back to big data node after majority quorum of acks. (2 ack’s in case of 3 copies) CHECKSUMMED!

3

Third copy is written semi-synchronously. Could also be synchronous if all servers are equidistant.

4 SSD/Flash

SAS/SATA

2. Granular replication of data

Storage containers

Big data node

Granular data chunks

Disk Platter Disk Platter Disk Platter

Chunks are distributed across all servers and containers in

the storage cluster.



3. DR Policy: 3xDC-aware with 3 copies

DR Policy: Datacenter-aware (One copy per DC) Data Copies: 3 Sync-Acknowledgements: 2

Data Center A Data Center B Data Center C

Active Active



4. Disk failures and rebuilds •  Disks managed in protection groups. •  Disk rebuilds initiated automatically upon disk failure across entire cluster. •  No spare disks needed. •  Quick wide-stripe rebuilds allow for largest disks. •  Average 4TB disk rebuild time is under 20 minutes. •  Easily support 6TB, 8TB and 10TB drives.

5. Thin provisioning, deduplication and compression •  Thin provisioning for every virtual volume

•  Inline compression and deduplication

•  Global, system-wide deduplication – all attached storage nodes participate

•  60-75% data reduction – dedupe rates vary based on data type

•  Dedupe cache can reside on Controller SSD/flash in application server

•  Eliminate all duplicate I/O from network, dramatically lower latency and increase IOPS!

•  Clone non-deduped volume with dedupe

Client-side SSD/flash dedupe read cache with dedupe map

Big data node + storage client

Storage node

Cluster SSD/flash read+write cache

6. 3 ways Hedvig uses SSD and PCIe flash

Client side read and dedupe cache on big data node

Primary storage as dedicated volume on storage nodes + flash pinning

Read/write cache on storage nodes

Big data node + storage client

Storage node

cases studies of SDS for big data 3

Three case studies


Fortune 100 bank

Deploying Cassandra and MongoDB for developers with infrastructure self-provisioning for DevOps model. Multiple NoSQL deployments leading to islands of (elastic) storage and inability to self-provision or plug into bank’s orchestration tools. Building elastic SDS cluster on commodity infrastructure to lower cost per bit by and drive self-provisioning through RESTful APIs.

4th largest US law firm

Needs quick, reliable indexing of 100M active client docs in HP Autonomy. Needed a scale-out, flash-friendly solution to replace local SSDs, which are required to achieve sub-one second index queries. Getting 6x performance with SDS versus traditional hybrid array, which included flash tier; now has incremental commodity scalability.

Fortune 50 telecom

Seeks centralized, shared storage to virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands of (also elastic) storage and preventing IT’s virtualization-first policy. Virtualizing all three Hadoop distributions and deploying SDS as the “data lake” for scale-out storage and global dedupe across data sets.

Thank you!

Download - Why Scale-Out Big Data Apps Need A New Scale- Out Storage€¦ · virtualize 3 Hadoop distributions: Hortonworks, Cloudera, and MapR. Multiple Hadoop deployments leading to islands

Top Related