1. beyond mission critical virtualizing big data and hadoop

38
© 2009 VMware Inc. All rights reserved Beyond Mission Critical: Virtualizing Big-Data and Hadoop Michael West Global Architect Big Data and Storage Field Engineering, VMware

Upload: chiou-nan-chen

Post on 26-Jan-2015

114 views

Category:

Technology


1 download

DESCRIPTION

VMWare Big Data Forum

TRANSCRIPT

Page 1: 1. beyond mission critical   virtualizing big data and hadoop

© 2009 VMware Inc. All rights reserved

Beyond Mission Critical: Virtualizing Big-Data and Hadoop

Michael West

Global Architect

Big Data and Storage Field Engineering, VMware

Page 2: 1. beyond mission critical   virtualizing big data and hadoop

2

Applications And Storage Are Becoming Increasingly Diverse

Virtual Storage Arrays

vSphere

SAN/NAS Object / BLOB

Traditional Applications

• Traditional enterprise storage• HW-based resiliency, QoS

Next Gen Cloud Apps

• Scale out, flash, DAS• Application specific storage

All SSDArray

Server-sideFlash

Page 3: 1. beyond mission critical   virtualizing big data and hadoop

3

The complexity enterprise IT and developers face today

An Idea for a cool app

Spec a server config

Justify server costs

Procurement process

Wait for HW to arrive

Wait for IT ops to Image the server

Install a Database

LOB Architecture approval

Central IT Architectural

approval

Justify more server for scale

testing

Wait for more HW

Configure ACLs and LBs

4-6 Months from Idea to Production!

New infrastructures

New Languages and Frameworks

New Devices and Domains

New Data types and requirements

Page 4: 1. beyond mission critical   virtualizing big data and hadoop

4

Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise

Page 5: 1. beyond mission critical   virtualizing big data and hadoop

5

Real-time analysis allows instant understanding of

market dynamics.

Retailers can have intimate understanding of their

customers needs and use direct targeted marketing.

Market Segment Analysis Personalized Customer Targeting`

Page 6: 1. beyond mission critical   virtualizing big data and hadoop

6

The Emerging Pattern of Big Data Systems: Retail Example

Real-TimeStreams

Exa-scale Data Store

Parallel DataProcessing

Real-TimeProcessing

MachineLearning

Data Science

Cloud Infrastructure

Analytics

Page 7: 1. beyond mission critical   virtualizing big data and hadoop

7

Storage: Plan for Peta-scale Data Storage and Processing

2000 2003 2006 2009 2012 20150.01

0.1

1

10

100

1000

Online Apps

AnalyticsPB ofData

Analytics Rapidly Outgrows Traditional Data Size by 100x

Page 8: 1. beyond mission critical   virtualizing big data and hadoop

8

Unprecedented Scale

“Data transparency, amplified by Social Networks

generates data at a scale never seen before”

- The Human Face of Big Data

We are creating an Exabyte of data every minute in 2013

Yottabyte by 2030

Page 9: 1. beyond mission critical   virtualizing big data and hadoop

9

A single GE Jet Engine produces

10 Terabytes of data in one hour – 90 Petabytes per year.

Enabling early detection of faults, common mode failures, product engineering feedback.

Post Mortem Proactively Maintained Connected Product

Page 10: 1. beyond mission critical   virtualizing big data and hadoop

10

Cloud Infrastructure Supports Mixed Big Data Workloads

MachineLearning HadoopReal-Time

Analytics

Cloud Infrastructure

MachineLearning

Hadoop

Real-TimeAnalytics

Management

Network/Security

Storage/Availability

Compute

Page 11: 1. beyond mission critical   virtualizing big data and hadoop

11

Cloud Infrastructure Supports Multiple Tenants

Cloud Infrastructure

Management

Network/Security

Storage/Availability

Compute

Web UserAnalytics

FinancialAnalysis

Historical CustomerBehavior

Page 12: 1. beyond mission critical   virtualizing big data and hadoop

12

Software-defined Datacenter: Compute

Agility / Rapid deployment

Lower Capex

Isolation for resource control and security

1

2

3

Operational efficiency4

Management

The Core Values of Virtualization Apply to Big Data

Network/Security

Storage/Availability

Compute

Page 13: 1. beyond mission critical   virtualizing big data and hadoop

13

Virtualizing Hadoop

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

Page 14: 1. beyond mission critical   virtualizing big data and hadoop

14

Serengeti

Virtual Hadoop Manager (VHM)

Hadoop Virtualization Extensions

(HVE)

Big Data Extensions: Core Components

Core is Open Source Tool to simplify virtualized

Hadoop deployment & operations

Serengeti

Virtualization changes for core Hadoop

Contributed back to Apache Hadoop

Advanced resource management on vSphere

Page 15: 1. beyond mission critical   virtualizing big data and hadoop

15

Strong Isolation between Workloads is Key

Hungry Workload 1

Reckless Workload 2

NosyWorkload 3

Cloud Infrastructure

Page 16: 1. beyond mission critical   virtualizing big data and hadoop

16

Hadoopbatch analysis

Big Data Family of Frameworks

File System/Data Store

Host Host Host Host Host Host

HBasereal-time queries

NoSQL Cassandra, Mongo, etcBig SQL

Impala,Pivotal HawQ

Computelayer

Virtualization

Host

OtherSpark,Shark,Solr,

Platfora,Etc,…

Page 17: 1. beyond mission critical   virtualizing big data and hadoop

17

Traditional Hadoop vs. Elastic Hadoop

Scale-out Network Storage

Traditional Hadoop:Converged

Compute/StorageElastic Compute

Scale-out Network Storage

Page 18: 1. beyond mission critical   virtualizing big data and hadoop

18

Management

Software-defined Datacenter: Storage

Requirements of Next Generation Storage

Network/Security

Storage/Availability

Compute

10x lower cost of storage

Handle explosive data growth

Support a variety ofapplication types

1

2

3

Solve the privacy andsecurity issues4

Page 19: 1. beyond mission critical   virtualizing big data and hadoop

19

Software-defined Storage Enables Fundamental Economics

0.5 1 2 4 8 16 32 64 128 $-

$0.50

$1.00

$1.50

$2.00

$2.50

$3.00

$3.50

$4.00

$4.50

$5.00

$5.50

Cost per GB

Petabytes Deployed

TraditionalSAN/NAS

DistributedObject

StorageHDFSMAPRCEPH

Scale-out NASIsilon, NTAP

Page 20: 1. beyond mission critical   virtualizing big data and hadoop

20

HDFS Model

ESX ESX ESX

JT

HDFS or MAPR VM HDFS or MAPR VM HDFS or MAPR VM

Local Disks

SAN/NAS Non-Hadoop VMs

Hadoop Compute VMs

JT: JobTrackerTT: TaskTrackerNN: NameNodeVHM: Virtual Hadoop Manager

NN

TT

TT

TT

VirtualCenter Management Server

DRS DRS DRSDRS DRS

VHM

Hadoop HDFS VMs

TT

TT

TT

JT

Page 21: 1. beyond mission critical   virtualizing big data and hadoop

21

Virtual Hadoop Manager

State, stats

(Slots used,Pending work)

Commands

(Decommission,Recommission)

Stats and VM configuration

Serengeti Job Tracker

vCenter DB

Manual/Auto

Power on/off

Virtual Hadoop Manager (VHM)

Job Tracker

Task Tracker

Task Tracker

Task Tracker

vCenter Server

SerengetiConfiguration

VCstate and stats

Hadoopstate and stats

VCactions

Hadoopactions

Algorithms

Cluster Configuration

Page 22: 1. beyond mission critical   virtualizing big data and hadoop

22

Big-Data using Local Disks

Host

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Servers withLocal Disks

16-24 core server12-24 SATA 2-4TB Disks10 GbE adapteriSCSI/NFS for SharedStorage for vMotion etc,…

High Performance 10GBE Switch per Rack

Page 23: 1. beyond mission critical   virtualizing big data and hadoop

23

Standard Deployment Configuration for D/C Separation

Virtualization Host

OS Image – VMDK

HadoopVirtualNode 1

Task-tracker

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4

VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK

… …

Page 24: 1. beyond mission critical   virtualizing big data and hadoop

24

Big Data Storage

Scale-out Network Storage

Elastic ComputeScale-out Network Storage

• Hadoop Protocol• Snapshots• Posix Apps• Full NFS Access• Replication• Erasure Coding

Page 25: 1. beyond mission critical   virtualizing big data and hadoop

25

Big Data with Scale-out-NAS

Big-Data using Scale-out NAS

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Scale-outNAS

Host

Host

Host

Host

Host

Host

Top of Rack Switch

Scale-outNAS

TempData

SharedData

IsilonScale-out

NAS

LocalDisk or SSDIn each Host

For Transient Data

Page 26: 1. beyond mission critical   virtualizing big data and hadoop

26

HadoopVirtualNode 2

NN

NN

NN

NN

NN

NN da

ta n

od

e

Isilon

Storage Configuration for Data/Compute Separation With Isilon

Virtualization Host

VMDKOS Image – VMDK

Shared storageSAN/NAS

OS Image – VMDK VMDK

VMDK

HadoopVirtualNode 1

Ext4

Job-tracker

Ext4

Temp

OS Image – VMDK

Ext4

Task-tracker

Ext4HadoopVirtualNode 3

Ext4

Task-tracker

Ext4

Page 27: 1. beyond mission critical   virtualizing big data and hadoop

27

Management

Software-defined Datacenter: Network and Security

Automate secure network provisioning

Network & Security Requirements for Big Data

Network/Security

Storage/Availability

Compute

3

New high bandwidth network designs

1

Leverage Software-defined network security

2

Page 28: 1. beyond mission critical   virtualizing big data and hadoop

28

Customer Success: Hadoop as a Service at FedEx

Scale-out Isilon Cluster- Shared Data- NAS + Hadoop

Elastic vSphere Cluster- Mixed Workloads- vSphere- Existing Rack Mount

Servers

Page 29: 1. beyond mission critical   virtualizing big data and hadoop

29

Agile Big Data at FedEx

• Trusted Isolation• Well known auditable

platform

Security

• Deploy in minutes• Optimize for shift in

workload characteristics

Agility

• Create true multi-tenancy

• Mixed workloads

Elasticity

Page 30: 1. beyond mission critical   virtualizing big data and hadoop

30

Breakthrough Use Cases

Web Log Analysis Initial exploration was around detection of mobile devices accessing the

website.

Analysis of 570 billion web server log entries took approximately 9 minutes to complete on a small cluster.

ZIP code Analysis Analysis of data to determine which ZIP codes are the highest source or

destination for shipments.

Shipment Analysis Analysis of shipment information to determine patterns

that may delay a package.

Page 31: 1. beyond mission critical   virtualizing big data and hadoop

31

Agility: Automation of Hadoop Cluster Management

Deploy

ResizeElastic scaling

CustomizeIncorporate best practices

Manage

Tune configuration

Run

Execute jobsAccess HDFS

Page 32: 1. beyond mission critical   virtualizing big data and hadoop

32

Monitoring

Agility: Ease of Management Due to Consolidation

Cluster setup and provisioning

Monitoring

HW procurement and sizing

Cluster setup and provisioning

HW procurement and sizing

Page 33: 1. beyond mission critical   virtualizing big data and hadoop

33

Elasticity: Mixed Workloads on a Shared Platform

Production

Test

Experimentation

Dept A: recommendation engine Dept B: ad targeting

Production

Test

Experimentation

Log files

Social data

Transaction data

Historical cust behavior

Page 34: 1. beyond mission critical   virtualizing big data and hadoop

34

Customer Success: Identified ROI in 2 Months

Company Profile: Recruiting through Social Media Search• Big Data Shop: Acquire and classify data, then present to customers

Hadoop Virtualization Stages:

• Stage 1: Used AWS EMR to quickly spin up clusters and shutdown to save cost

• Stage 2: Need to run Hadoop jobs 24/7. EMR too expensive and spikey performance based on VM and backing storage they are assigned.

• Stage 3: Cut costs with On Premise Virtual Hadoop with BDE

• 1 hour to read the doc and 30 node cluster available in a few minutes

• Stage 4: Mix Physical and Virtual:

• Physical servers with Local spinning disks for Compute and HDFS

• VMs with HA for support nodes (Zookeeper, Namenode, Jobtracker, Journal, Clients)

• Stage 5: Compute Nodes in VMs for CPU intensive workloads

• Summary: 192 cores, 1 TB RAM, 96 Spindles, 288 TB storage

Page 35: 1. beyond mission critical   virtualizing big data and hadoop

35

Summary

Page 36: 1. beyond mission critical   virtualizing big data and hadoop

36

Customers Winning from Consolidated Big Data Platforms

“Dedicated hardware makes no sense”

“Software-defined Datacenter enables rapid deployment multiple tenants and labs”

“Our mixed workloads include Hadoop, Database, ETL and

App-servers”

“Any performance penalties are minor”Management

Network/Security

Storage/Availability

Compute

Page 37: 1. beyond mission critical   virtualizing big data and hadoop

37

Cloud Infrastructure is Ready for Big Data – Are you?

Cloud Infrastructure

Page 38: 1. beyond mission critical   virtualizing big data and hadoop

38

Q&A