solving big data problems

PRESENTATION TITLE GOES HERE

Solving Big Data Problems: Storage to the Rescue?

John Webster

Evaluator Group

22015 Data Storage Innovation Conference. © Insert Your Company Name. All Rights Reserved.

Agenda

Big Data Analytics Storage Maxims

The Fundamental JBOD and DAS Architecture

Overview of Disk-based Alternatives

What are the Advantages and Disadvantages?

The Solid State and In-memory Alternatives

Summary and Q&A

Note: References to specific vendors and products are used as real-world examples and do not imply an endorsement

04/15/23 2


Big Data Storage Maxim #1

Deliver storage performance at large scale and at low cost, and all at the same time(Think early stage Google, Facebook, Twitter)

04/15/23 3



Minimize the “distance” between processing and data storage

04/15/23 4



Big Data analytics is dominated by open source

04/15/23 5



Big Data analytics software developers manage data at the clustered server level. Storage vendors

manage data at the storage system level.

04/15/23 6


Shared Nothing, Asymmetrical Distributed Computing

NODE 1

NODE 2

NODE 3

NODE n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

CONTROL

DAS

Network Layer

1 Gb Ethernet

Compute Layer

Commodity Servers

Storage Layer

6-12 disks in each server

typically JBOD

Scale to thousands

of nodes

Only the Ethernet network is shared

In Hadoop, Control = Name Node; Node 1,2… = Data Node


Apache Hadoop: A Platform for All Applications?

Presentation & ApplicationEnable both existing and new applications to provide

value to the organization

OperationsEmpower existing operations and security tools to manage Hadoop

Metadata ManagementHCatalog

Batch Online Real-Time

In-Memory

OthersSQLScript

Map Reduce Pig Hive

HbaseAccumulo Storm Spark

Multitenant Processing: YARN(Hadoop Operating System)

Storage: HDFS(Hadoop Distributed File System)

DataAccess

DataManagement

Data Integration & Governance

Data WorkflowData Lifecycle

Falcon

Real-time and Batch Ingest

FlumeSqoop

WebHDFSNFS

AuthenticationAuthorizationAccountabilityData Protection

AcrossStorage: HDFS

Resources: YARN

Access: Hive,…

Pipeline: Falcon

Cluster: Knox

Provision, Manage & Monitor

Ambari

Scheduling

Oozie

Linux WindowsEnvironment

On Premise Virtualize

Commodity HWAppliance

Cloud/Hosted

Security Operations

Source: Hortonworks


HDFS as a Persistent Storage Layer

AdvantagesStorage performance at large scale and low costMinimize distance between data and computeNode failures toleratedOpen Source

DisadvantagesHadoop NameNode lacks active/active failover (i.e. it’s a SPOF)For data integrity and protection, HDFS creates three full clone copies of data

3x the storage for each file – slow and inefficientIf all three copies are corrupted, you’re still hosed (reload and start over)

No storage tiering (recognition of different storage types now available in 2.3)Limited ways to respond to corporate security and data governance policiesData in/out processes can take longer than the actual query processWhat is the single source of the truth?Inability to dis-aggregate storage from compute so that the two can be scaled

independently


NODE 1

NODE 2

NODE 3

NODE n

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

CONTROL

Network Layer

Compute Layer

Storage Layer SAN or NAS, but more commonly Scale-out

NAS

Shared Storage as Primary Storage

04/15/23 10


NODE 1

NODE 2

NODE 3

NODE n

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

CONTROL

Network Layer

Compute Layer

Storage Layer

Shared Storage as Secondary Storage

04/15/23 11

SAN/NAS/Object Storage


Hadoop On Scale-out Storage

Scale-out storage replaces node-level DASHDFS implemented as “over the wire” protocol or CDMI interface to underlying FSNameNode SPOF eliminatedDecoupled storage and compute layersData services, data protection, and DR by storage-resident servicesExamples include EMC Isilon, IBM Elastic Storage, Ceph

04/15/23 12


Shared Primary/Secondary Storage

Advantages

Addresses the enterprise storage management requirements

Data protection/disaster recovery/business continuance Data governance/compliance/archiving Single source of the truth

Disadvantages

Additional cost

Potential performance impact

Using a vendor specific solution introduces proprietary data/storage management software

04/15/23 13


What About SSD?

NODE 1

NODE 2

NODE 3

NODE n

DAS DAS DAS DAS

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

CONTROL

DAS

Network Layer

10+ Gb Ethernet

Compute Layer

Commodity Servers

Storage Layer

SSD in/attached

to each server

Scale to thousands

of nodes

Only the Ethernet network is shared

In Hadoop, Control = Name Node; Node 1,2… = Data Node


NODE 1

NODE 2

NODE 3

NODE n

1 2 3 4 5 6 7 8

B8

GM

R3 Link

Active

Link

Active

Link

Active

ConsolePwr

Active

Link

Active

CONTROL

Network Layer

Compute Layer

Storage Layer Scale-out Flash Storage

What About SSD?

04/15/23 15


What About In-Memory Computing?

TachyonUC Berkeley Amp Lab project

“Reliable, memory-centric storage for Big Data Analytics clusters” (i.e. memory as persistent data store across cluster nodes)

One in-memory data copy inside JVM, use operation “lineage” to re-compute data if failure

Initial use in Apache Spark environments

04/15/23 16


What About In-memory Computing?

Apache IgniteIn-memory “data fabric”

Distributed in-memory platform for computing and transacting on large-scale data sets in real-time

“Orders of magnitude faster than possible with traditional disk-based or flash technologies.”

Tier -1 storage?

Originated as GridGain Data Fabric

In-Memory Computing Summit 6/29-30 imcsummit.org

04/15/23 17


Summary and Q&A

The need for a longer-term, persistent storage layer is now recognized

For Hadoop, HDFS may or may not be that storage layer

Enterprise storage architects and administrators will be more directly involved in managing Big Data analytics storage over time

Now is the time to research and understand the options

04/15/23 18

solving big data problems

Technology

data node

big data storage maxim

storage vendors

big data problems

storage system level

das network layer

yarn hadoop

node node