modern infrastructure for business data lake

32
1 © Copyright 2015 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Modern Infrastructure for Business Data Lake

Upload: emc

Post on 06-Aug-2015

210 views

Category:

Business


1 download

TRANSCRIPT

Page 1: Modern infrastructure for business data lake

1© Copyright 2015 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.

Modern Infrastructure for

Business Data Lake

Page 2: Modern infrastructure for business data lake

2© Copyright 2015 EMC Corporation. All rights reserved.

Scale-out Converged Solutions for AnalyticsJulianna DeLua, VCE

Dan Beres, EMC Isilon

Page 3: Modern infrastructure for business data lake

3© Copyright 2015 EMC Corporation. All rights reserved.

AGENDA

History of Analytic Infrastructure

Why Scale-Out, Converged Solutions

Analytic Workflow vHadoop Test Results

Customer Use Cases and Feedback

Conclusion / Next Steps

Page 4: Modern infrastructure for business data lake

4© Copyright 2015 EMC Corporation. All rights reserved.

A Brief

History

of Analytic

Infrastructure

Page 5: Modern infrastructure for business data lake

5© Copyright 2015 EMC Corporation. All rights reserved.

VCE Confidential© 2015 VCE Company, LLC. All rights reserved.

2013 – Shared infrastructure? Let me know when you know “for sure” it works. In the meanwhile, a few industry pioneers / early adopters start POC with EMC / VCE

2014 – Extend converged system benefits with Isilon scale out – augment enterprise app/data with Hadoop, Splunk, no-SQL. Great performance!

2015 –Internet of things initiatives accelerate. Rapid technological advancements with architectural flexibility - Vscale

Page 6: Modern infrastructure for business data lake

6© Copyright 2015 EMC Corporation. All rights reserved.

The Private/Public Cloud“Infinite, inexpensive compute and storage”

ENABLED BY

Agile Product Development Culture

ANATOMY OF A MODERN DIGITAL BUSINESS

CAPABILITIES NEEDED

BUSINESS DRIVERS

• New systems of engagement• New business models• Internet of Things

Platform

Data Algorithms(Code)

“Catch people or things in the act and affect the outcome”

= $$$

Compelling, Unique User Experience/Model

ExistingSystemsA MAJOR PRESSING CHALLENGE

Analytics/BI

• How do we architect for agile data-driven business?

• Can we manage big, fast data?

• Value driven

CIO

• Meet future business needs while simplifying and taking cost out of legacy?

• Avoid lock-in again?

• People and organization

CEO/CMO

• How do we become an agile, digital business?

• Anticipate and delight customers?

• Partner collaboration

• Where/how do we start?

A MAJOR PRESSING CHALLENGE

Page 7: Modern infrastructure for business data lake

7© Copyright 2015 EMC Corporation. All rights reserved.

Sub-optimal environment—data locked in high volume, variety, or velocity.

Lack of service-enablement—difficulties in optimizing virtualized, multi-tenant service approach.

Compliance/security exposure—lack of encryption, exposure, and data loss.

Limited standardization—not using data center standards.

Downtime/SLA issues—not readily configurable to handle mixed workloads.

System utilization—inefficient islands of storage and systems, inability to reuse data for multiple solutions.

Long cycles for accessing and sharing information locked in unstructured data.

Cannot rapidly create value via technology-enabled XaaS.

Explicitly demonstrate security, compliance, and governance.

Inability to plan system progression that combine structured/unstructured – exacerbating silos of appliances and hardwares

Insufficient posture against outages and peak period of IT use.

Escalating deployment management and maintenance costs for growing data.

CUSTOMER PAINS TECHNICAL PROBLEMS

Typical Customer Pains and Technical problems

Page 8: Modern infrastructure for business data lake

8© Copyright 2015 EMC Corporation. All rights reserved.

CONVERSATIONS LEAD TO PLATFORM EVOLUTIONConversations

Downtime and response time issues missing

business SLA

• Increased flash use

• Continuous need for migration

• Network scale points

• Data mobility

• Hadoop, Splunk, PaaS, Cassandra, MongoDB, Legacy DB

• Aggregate/disaggregate pool of resources

• Control required for application proliferation

Faster time to drive value from innovation

multitude of applications

• Mobile and social offers

• Turn 360 degree insight to customer acquisitions

• Fulfillment, inventory and customer management

AWS is costing too much but business wants faster

go live and flexibility

Page 9: Modern infrastructure for business data lake

9© Copyright 2015 EMC Corporation. All rights reserved.

VCE VSCALE™ ARCHITECTUREFLEXIBLE SCALE-OUT THROUGH EXPANDEDMULTI-SYSTEM ARCHITECTURE

VCE VSCALETM FABRIC VCE VSCALETM FABRIC

9

MPP DB

Hadoop PROD & DR

In memory DBBI / DW

Enterprise App - SAP

Microsoft Email,

collaboration

Hadoop POC

Pivotal Cloud

Foundry

Video Surveillance

Page 10: Modern infrastructure for business data lake

10© Copyright 2015 EMC Corporation. All rights reserved.

Edge & Central Analytics Workflow

SwiftHTTPRAN | DAV

Isilon OneFSEasy to Grow Manage & AdministerAdditional Clients to More ContentMultiprotocol Access to Same Data

Log

OneFS

……..

FTP SyncIQ SyncIQ

HDFS

NFS SMB

HDFS

Glance

ExternalWAN

InternalWAN

Oracle

NFS

Mediation

AppServer

Page 11: Modern infrastructure for business data lake

11© Copyright 2015 EMC Corporation. All rights reserved.

vHadoop+Isilon Install & Deployment Guide

Page 12: Modern infrastructure for business data lake

12© Copyright 2015 EMC Corporation. All rights reserved.

“Fix These Problems….Prove it Out!”

Expensive and Won’t Scale– Hundreds of Servers to support less than 2PB Usable Storage (1:7 ratio)– “We have a guy with shopping carts walking down the rows replacing parts”– Additional Staging Area for Data before Ingesting into Hadoop– Can’t Scale Storage without Compute – Locked & Not Elastic

Lacks Enterprise Features– No Cost Effective Data Redundancy– Limited File-system Security, only Simple Authentication– Multiple Points of Failure– Maintaining Hadoop “PODs” involves significant downtime

Time To Results– Requires Significant time to ingest and copy Data– Building Production Hadoop “PODs” can take months – Network Infrastructure Saturation & Expense

Page 13: Modern infrastructure for business data lake

13© Copyright 2015 EMC Corporation. All rights reserved.

NFS

NFS

SMB

SMB

SWIFT

HDFS

SWIFT

RAN

RAN

FTP

EMC Isilon Enabled Workflows

Page 14: Modern infrastructure for business data lake

14© Copyright 2015 EMC Corporation. All rights reserved.

HDFSSMB, NFS, HTTP, FTP,

HDFS

nodeinfo

nodeinfo

nodeinfo

nodeinfo

nodeinfo

nodeinfo

nodeinfonodeinfo

nodeinfo

NodereplyNodereplyNodereplyNodereplyNodereplyNodereplyNodereplyNodereplyNodereply

file

file

file

file

file

file

file

file

NodereplyNodereplyNodereplyNodereplyNFS

NFS

SMB

SMB

name node

name node

name node

name node

name node

name node

name node

MAPReduce

MAPReduce

MAPReduce

MAPReduce

MAPReduce

MAPReduce

MAPReduce

MAPReduce

MAPReduce

data

node

data

node

Isilon

OriginalData

OriginalData

OneFS ComputeData

1X

EMC Isilon Enabled HadoopName node

Data

Compute

Page 15: Modern infrastructure for business data lake

15© Copyright 2015 EMC Corporation. All rights reserved.

Created and tuned Hadoop VMs to maximize Throughput– >90% Utilization of CPUs for Compute– Memory footprint reduced (MEM Page sharing across VMs)– Hadoop 2.0 with YARN does not need FLASH for HDFS

Incremental testing to validate Scalability – Validated 2:1 ratio Compute Node to Isilon Node (can also support 3:1) – 2 VMs per Compute Node for Optimal Performance on Dual Socket– Linear Scalability in performance by incrementally adding more compute

Validated Enterprise/Production Ready- Security Greater with AD Authorization and Access

No need to anonymize dataWhitepaper Created

- Deployment & Upgrade Of Hardware and Software in hours not days/weeks- Validated reduced data-center footprint & environmentals with UCS Blade Servers,

vHadoop & Isilon

Hadoop Test Findings

Page 16: Modern infrastructure for business data lake

16© Copyright 2015 EMC Corporation. All rights reserved.

1TB Hadoop Job Cycle ComparisonIsilon Significantly Reduces Time To Results

Traditional Hadoop+DAS

17:32 30:18 20:5020:50

Isilon Enabled vHadoop

18:51

Terasort Test on 1TB  DAS Isilon   BenefitMB/s Per Node 55.00 85.00   55%Compute Min 30.18 18.51   -39%TTR Min 89.30 18.51   -79%

Isilon Advantages• Eliminates All Data Movement• Allows for Virtualized Compute• Significantly Less Cost• 79% Faster TTR!

TTR- 89.3 Minutes!

Page 17: Modern infrastructure for business data lake

17© Copyright 2015 EMC Corporation. All rights reserved.

EMC Isilon – Only Security Compliant Datastore for Hadoop Highly resilient architecture

– Robust data protection options (DR, Snapshots, SyncIQ)– Clustered Multi-Point Name Node with Kerberos – SEC 17a-4 compliant WORM– Hadoop multi-tenancy with dedicated network and access zones

Hadoop on Isilon provides full ACLs for NFS, SMB, and HDFS– Each file/ directory has an Access Control List (ACL) consisting of one or more Access Control Entries

(ACE).– Each ACE assigns a set of permissions (read, write, delete) to a specific security identifier (user or

group).– Deny ACEs which remove permissions and override any “Allow ACEs”

Standard Hadoop only provides basic Unix-type “Simple” permissions– Effective permissions are determined based on the file owner (single user, single group, other/world)– Read and/or write permissions can be assigned to the owner, the group, and “everyone else”– What do you do when you need to assign read access to multiple groups (A, B & C)?– What do you do when you need to assign read access to the group A and read+write access to group

B?– How do you maintain permissions when files are copied from Windows NTFS shares?

Page 18: Modern infrastructure for business data lake

18© Copyright 2015 EMC Corporation. All rights reserved.

Supporting Documentation

Page 19: Modern infrastructure for business data lake

19© Copyright 2015 EMC Corporation. All rights reserved.

HCFS Certification: Process DetailCertification Step Duration

Partner Prep

Partner defines HDP test matrix (platforms, HDP components, HDFS APIs, HDP version and partner product version)

Partner provides sample product to Hortonworks so Engineering and Field teams are familiar with partner technology

Testing

HDFS Test Suite training - at Hortonworks HQ and online

Partner deploys, runs, analyzes, and reports HDFS Test Suite with technical support from Hortonworks

HDP Core Test Suite (Map/Reduce, YARN, Tez and Hbase, Hive and Pig) training – at Hortonworks HQ and online

Partner deploys, runs, analyzes, and reports HDP Core Test Suite with technical support from Hortonworks

Partner deploys, runs, analyzes, and reports on remaining HDP Component Test Suites with technical support from Hortonworks

Testing time allocation

Documentation

Joint review of test suite execution results

Hortonworks creates functional gap analysis document, need partner sign off

Documentation time allocation

Validation

Hortonworks validates test suite execution results and certifies HCFS for specified HDP version and partner product version

Total certification time allocation 90-180 days

Page 20: Modern infrastructure for business data lake

20© Copyright 2015 EMC Corporation. All rights reserved.

Scale-out Isilon for Scale-out Hadoop

ComputeNodes

Isilon is a scale-out system; Hadoop HDFS is partially similar

HDFS on Isilon functions as a Parallel file system

Each compute node performs I/O on every Isilon node in the Rack

I/O bandwidth and storage capacity can be increased linearly simply by adding Isilon nodes

Compute can be increased or decreased on the fly and can easily be virtualized

With a mesh network that is faster than the disks, data locality is irrelevant

IsilonNodes

Page 21: Modern infrastructure for business data lake

21© Copyright 2015 EMC Corporation. All rights reserved.

Hadoop Architecture – Traditional DAS Dozens of Hadoop Racks Requires Significant Investment Network Infrastructure

Rack Ethernet Switch

Compute

Shuffle+HDFS

SATA

10+ Gbps

Core Ethernet Switch

Compute

10 Gbps

Shuffle+HDFS

Compute…

Shuffle+HDFS

Rack Ethernet Switch

Compute

Shuffle+HDFS

SATA

10+ Gbps

Compute

10 Gbps

Shuffle+HDFS

Compute…

Shuffle+HDFS

The ratio of compute and disk space/performance is

fixed.

Non-local HDFS I/O (30-90% of HDFS I/O) will go through

Ethernet.

Local disk usage is shared between shuffle I/O (60% of all I/O during terasort) and

HDFS I/O.

Core Network Switches Are Additional Cost for

Hadoop+DAS(more Network traffic required)

Page 22: Modern infrastructure for business data lake

22© Copyright 2015 EMC Corporation. All rights reserved.

Hadoop Architecture – Isilon for HDFS Reduced traffic across the Core Ethernet switch--HDFS

traffic will only travel within a rack and across IB.

Isilon InfiniBand Switch

Rack Ethernet Switch

Compute

Shuffle

SATA

10+ Gbps

10 Gbps

Core Ethernet Switch

Compute

Shuffle

10 Gbps

… …

IB

Rack Ethernet Switch

Compute

Shuffle

SATA

10 Gbps

Compute

Shuffle

10 Gbps

… …

IB

The number of compute and Isilon nodes can be adjusted independently to achieve the optimal ratio of compute and I/O bandwidth

HDFS I/O ALWAYS comes through a rack-local Isilon node which collects data blocks from all other Isilon nodes across the InfiniBand fabric

(used only for MR copy phase) 10+ Gbps (used only for MR copy phase)

Shuffle I/O (65% of all I/O during terasort) remains on local storage.

Isilon HDFS

Isilon HDFS

Isilon HDFS

Isilon HDFS

Page 23: Modern infrastructure for business data lake

23© Copyright 2015 EMC Corporation. All rights reserved.

Traditional Hadoop - Layers

Page 24: Modern infrastructure for business data lake

24© Copyright 2015 EMC Corporation. All rights reserved.

Isilon+Hadoop – NO Layers

Page 25: Modern infrastructure for business data lake

25© Copyright 2015 EMC Corporation. All rights reserved.

ESG LAB REVIEW – VBLOCK SYSTEMS WITH VCE TECHNOLOGY EXTENSIONF FOR EMC ISILON

• Objectives

• Underscore business challenges and opportunities for progressing to enterprise Hadoop

• Establish requirements to be ready for production – Extensibility, Governance, Security, Availability, Performance and Multi-Use

• Perform benchmarks Vblock System 340 with EMC Isilon with Teragen suite

25

“By leveraging an industry-proven Integrated computing platform ( ICP) in VCE Vblock Systems and combining it with EMC Isilon and VMware vSphere Big Data Extensions, organizations get a fully integrated platform that meets and grows with their big data and analytics requirements.— Tony Palmer, Senior Lab Analyst, ESG

Page 26: Modern infrastructure for business data lake

26© Copyright 2015 EMC Corporation. All rights reserved.

TeraGen TeraSort TeraValidate0

200

400

600

800

1,000

1,200

1,400

Comparing Performance of Traditional Hadoop to VCE Vblock System with EMC Isilon (TeraSort Suite)

16 Traditional Hadoop Nodes (combined Compute and DAS)16 VCE Compute Nodes and EMC Isilon Storage

Job

Du

rati

on

(se

co

nd

s)

ESG LAB OBSERVATION ON TERAGEN BENCHMARKS

26

Page 27: Modern infrastructure for business data lake

27© Copyright 2015 EMC Corporation. All rights reserved.

VCE CUSTOMER BENEFITS

Page 28: Modern infrastructure for business data lake

28© Copyright 2015 EMC Corporation. All rights reserved.

VCE LOWERS OPERATIONAL COSTS

00 IT Staff

Cost

Facilities Infrastructure

After Vblock System Deployment

Before Vblock System Deployment

41%

13%38%

IDC Research Study OF VCE CUSTOMERS, SEPTEMBER 2013

Page 29: Modern infrastructure for business data lake

29© Copyright 2015 EMC Corporation. All rights reserved.

GAS AND UTILITY LEADER

• Situation• Largest provider of gas and electric energy in the US. Innovate to drive clean,

sustainable future. Better management of costs and risks using predictive models. Operational improvement and compliance management. Expected data growth and application complexity with smart meter data management.

• Solution• Vblock System 340 to be used for private and public cloud in the hybrid cloud

model to keep custom applications and sensitive data in-house while pushing others to public. Initiated with Pivotal to become software led company with Pivotal CF.

• Anticipated Business Benefits– Increase agility for applications deployment using Platform as a Service

(PaaS) and big data solution– Support 600+ new applications planned annually faster at lower cost– Improve disaster recovery readiness and data protection– Lower costs and detect issues by enabling field personnel – Increased customer satisfaction including cost savings via meter data

Drive to Clean energy transformation while managing cost and risk

29

Differentiators: Suited to Hybrid Cloud Model and

future expansion – upgrades and scaling.

Extending VCE-Pivotal-EMC relationship while being open to tap eco-system

Page 30: Modern infrastructure for business data lake

30© Copyright 2015 EMC Corporation. All rights reserved.

FOOD AND BEVERAGE GIANT

• Situation• Global food and beverage conglomerate to accelerate financial reporting and

reflect customer behaviors. Seeking a better alternative to third party cloud base model. Operational improvement and customer intimacy with leading brand recognition throughout the world. Data loading, processing and end-user impact crucial

• Solution• Use Vblock System for a shuffle and extend with VCE technology extension for

EMC Isilon to run Pivotal Hadoop and HAWQ. For Pivotal Greenplum, use VCE technology extension for compute (Cisco C240). Bring some of the core applications to the corporate IT.

• Anticipated Business Benefits– Streamline financial reporting process for goods coming from multiple

geographies while keeping up to data and support broadening user query– Exploit mobile applications for customer preferences and inventory management– Support product launches and marketing campaigns based on consumption logs,

brand preferences and social media– Improve disaster recovery readiness and data protection– Start with one project, gain momentum while ensuring readiness for the future

Financial reporting and marketing analysis Back to Private Cloud

Differentiators: Ability to match architecture to workloads. Reuse existing environment. Extensible for future

growth

Page 31: Modern infrastructure for business data lake

31© Copyright 2015 EMC Corporation. All rights reserved.

WHY VCE AND EMC FOR SCALE-OUT CONVERGED ANALYTIC SOLUTION?

• Adaptable, modular, and mission critical• Incremental scaling with your demand from

the broad VCE and EMC portfolio• Pre-tested, validated and certified by EMC and

VCE• Exploit end-to-end analytics on the SAME VCE

and EMC platform• Take advantage of broadening EMC partner

eco-system• Contact your EMC or VCE representatives • Contact : EMC – [email protected]

VCE - [email protected]

Page 32: Modern infrastructure for business data lake