vmware serengeti - based on infochimps ironfan

42
© 2012 VMware Inc. All rights reserved Confidential Hadoop-as-a-Service CXO Big Data Seminar September 26, 2012

Upload: jim-kaskade

Post on 27-Jan-2015

105 views

Category:

Technology


0 download

DESCRIPTION

VMware's vitualized Hadoop, based on the Infochimps open source project, Ironfan.

TRANSCRIPT

Page 1: Vmware Serengeti - Based on Infochimps Ironfan

© 2012 VMware Inc. All rights reserved

Confidential

Hadoop-as-a-Service

CXO Big Data Seminar

September 26, 2012

Page 2: Vmware Serengeti - Based on Infochimps Ironfan

2 Confidential

Agenda

VMware Data Portfolio

Big Data and Virtualization Trends

Enterprise Hadoop Needs

Virtualized Hadoop for the Enterprise

Summary

Page 3: Vmware Serengeti - Based on Infochimps Ironfan

3 Confidential

Trends Driving Change in Enterprise IT

Cloud

• Offered “as-a-Service”

• Virtualization

New Application Types

• Mobile, SaaS, social

• Apps released early and often

Frameworks

• New application frameworks driving

• Increase in application development

Data Disruption

• Web orientation drives exponential data volumes

• Reduced latency and new types of data

Page 4: Vmware Serengeti - Based on Infochimps Ironfan

4 Confidential

The Database is Being Stretched

Big Data

Cloud Delivery

Flexible Data

Virtualized

Offered “-as-a-Service”

Petabytes vs. Gigabytes

Democratize BI

Multi-structured data

Developer productivity

Fast Data Global access patterns

Mobile app proliferation

Page 5: Vmware Serengeti - Based on Infochimps Ironfan

5 Confidential

Big, Fast and Flexible Data

FlexibleBigBig Data

Processing

Big Data Analytics

Serengeti

FastOLTP

workloads

Analytic workloads

Cloud Delivery Model

Data as a service for private and public clouds

OSS Relational

Document

Object

Key / Value

GemFire

vPostgres

GemFire

GemFire

Page 6: Vmware Serengeti - Based on Infochimps Ironfan

6 Confidential

Agenda

VMware Data Portfolio

Big Data and Virtualization Trends

Enterprise Hadoop Needs

Virtualized Hadoop for the Enterprise

Summary

Page 7: Vmware Serengeti - Based on Infochimps Ironfan

7 Confidential

Data is exploding & Hadoop is driving growth

Unstructured data driving growth Hadoop adoption is ramping

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Structured Unstructured

Complex unstructured data forecasted to outpace structured

relational data by 10x by 2020

Evaluating53%In-

production23%

Piloting18%

Testing2%

Don't know2%

Other2%

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy

• Gartner predicts +800% data growth over next 5 years• Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs

Page 8: Vmware Serengeti - Based on Infochimps Ironfan

8 Confidential

Log Processing / Click Stream Analytics

Machine Learning / sophisticated data mining

Web crawling / text processing

Extract Transform Load (ETL) replacement

Image / XML messageprocessing

Broad Application of Hadoop technology

General archiving / compliance

Financial Services

Mobile / Telecom

Internet Retailer

Scientific Research

Pharmaceutical / Drug Discovery

Social Media

Vertical Use CasesHorizontal Use Cases

Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.

Page 9: Vmware Serengeti - Based on Infochimps Ironfan

9 Confidential

The Future of Virtualization

VDC

Software-defined Datacenter Services

2008 2012 FUTURE

Time to Provision New Services

Workloads Virtualized

Weeks Days/Hours Minutes/Seconds

25% 60%

+

>90%

Page 10: Vmware Serengeti - Based on Infochimps Ironfan

10 Confidential

Virtualization enables a Common Infrastructure for Big Data

Single purpose clusters for various business applications lead to cluster

sprawl.

Virtualization Platform

Simplify

• Single Hardware Infrastructure

• Unified operations

Optimize

• Shared Resources = higher utilization

• Elastic resources = faster on-demand access

MPP DB HadoopHBase

Virtualization Platform

MPP DB

Hadoop

HBase

Cluster Sprawling

Cluster Consolidation

Page 11: Vmware Serengeti - Based on Infochimps Ironfan

11 Confidential

Agenda

VMware Data Portfolio

Big Data and Virtualization Trends

Enterprise Hadoop Needs

Virtualized Hadoop for the Enterprise

Summary

Page 12: Vmware Serengeti - Based on Infochimps Ironfan

12 Confidential

Hadoop Users

Data scientists, analysts, developers

• Line of business users

• Intimate with data and analysis, not IT

• Tasked with providing actionable intelligence that impacts the business

Concerns

• Obtain a Hadoop cluster on demand

• Minimize time to insight

• Require reasonable performance from Hadoop cluster

Page 13: Vmware Serengeti - Based on Infochimps Ironfan

13 Confidential

The IT Guy

Admins, architects, CIO

• Responsible for technology infrastructure, compliance, budget management

• Evaluates new technologies and recommends best practices

Concerns

• Keeping up with demands of the business

• Cost savings and consolidation

• Reliability

• Complexity of running and tuning Hadoop clusters

• Shortage of skills to do the above

Page 14: Vmware Serengeti - Based on Infochimps Ironfan

14 Confidential

Hadoop Journey in Enterprises

Stage 3: Big Data ProductionServe many departmentsOften part of mission critical workflowIntegrated with other big data services

Stage1: Piloting Often start with line of business Try 1 or 2 use cases to explore

the value of Hadoop

Stage 2: Hadoop Production Serve a few departments A few more use cases Core Hadoop + components

20 3000 node

Integrated

Scale

Page 15: Vmware Serengeti - Based on Infochimps Ironfan

15 Confidential

Agenda

VMware Data Portfolio

Big Data and Virtualization Trends

Enterprise Hadoop Needs

Virtualized Hadoop for the Enterprise

Summary

Page 16: Vmware Serengeti - Based on Infochimps Ironfan

16 Confidential

Why Virtualize Hadoop?

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

Page 17: Vmware Serengeti - Based on Infochimps Ironfan

17 Confidential

Project Serengeti

Open source project launched in June, 2012

Toolkit that leverage virtualization to simplify Hadoop deployment and operations

To learn more, projectserengeti.org

Deploy a Hadoop cluster in 10 Minutes

Customize Hadoop cluster

Use Your Favorite Hadoop Distribution

One stop command center

Serengeti

Page 18: Vmware Serengeti - Based on Infochimps Ironfan

18 Confidential

Rapid Deployment of a Hadoop Cluster with Serengeti

Done

Step 1: Deploy Serengeti virtual appliance on vSphere.

Step 2: A few simple commands to stand up Hadoop Cluster.

Page 19: Vmware Serengeti - Based on Infochimps Ironfan

19 Confidential

A Walk Through Serengeti

Page 20: Vmware Serengeti - Based on Infochimps Ironfan

20 Confidential

A Walk Through Serengeti

Page 21: Vmware Serengeti - Based on Infochimps Ironfan

21 Confidential

A Walk Through Serengeti

Scaling out a cluster

Advanced cluster creation

Page 22: Vmware Serengeti - Based on Infochimps Ironfan

22 Confidential

Customizing Your Hadoop Cluster

Choice of distros

Storage configuration

• Choice of shared storage or local disk

Resource configuration

High availability option

# of nodes

Also used to tune Hadoop config

… "distro":"apache", "groups":[ { "name": "master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”],

"storage": { "type": "SHARED", "sizeGB": 20}, "instanceType": "MEDIUM", "instanceNum": 1, "haFlag": 'on’}, {"name": "worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instanceType": "SMALL", "instanceNum": 5, "haFlag": 'off' …

Page 23: Vmware Serengeti - Based on Infochimps Ironfan

23 Confidential

Freedom of Choice and Open Source

Community Projects

Distributions

• Flexibility to choose from major distributions

• Support for multiple projects (work in progress)

• Open architecture to welcome industry participation

• Contributing Hadoop Virtualization Extensions (HVE) to open source community

Page 24: Vmware Serengeti - Based on Infochimps Ironfan

24 Confidential

Use Local Disk where it’s Needed

SAN Storage

$2 - $10/Gigabyte

$1M gets:

0.5 Petabytes

200,000 IOPS

8Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets:

1 Petabyte

200,000 IOPS

10Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets:

10 Petabytes

400,000 IOPS

250 Gbytes/sec

Page 25: Vmware Serengeti - Based on Infochimps Ironfan

25 Confidential

Virtual Storage Architecture Includes Local Disk

Shared Storage: SAN or NAS

• Easy to provision

• Automated cluster rebalancing

• Leverage high availability protection

Local Storage: Local Disks

• Local disk for Hadoop

• Scalable bandwidth, lower cost/GB

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Oth

er

VM

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Host

Ha

do

op

Ha

do

op

Oth

er

VM

Shared Storage Shared Storage

Local Storage

Page 26: Vmware Serengeti - Based on Infochimps Ironfan

26 Confidential

Hadoop Runs Well on Virtualization

TeraGen TeraSort TeraValidate0

50

100

150

200

250

300

350

400

450

Native

1 VM

2 VMs

4 VMs

Ela

psed t

ime,

seco

nds (

low

er

is b

ett

er)

Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf

Page 27: Vmware Serengeti - Based on Infochimps Ironfan

27 Confidential

Why Virtualize Hadoop?

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

Page 28: Vmware Serengeti - Based on Infochimps Ironfan

28 Confidential

High Availability for the Hadoop Stack

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI ReportingETL Tools

Man

agem

ent

Ser

ver

Zoo

keep

r (C

oord

inat

ion)

HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB

Hcatalog MDB

Server

HA for Hadoop stack is more than Name node HA

Page 29: Vmware Serengeti - Based on Infochimps Ironfan

29 Confidential

vMotion Reduces Planned Downtime

Description:

Enables the live migration of virtual machines from one host to another with continuous service availability.

Benefits:• Revolutionary technology that is the

basis for automated virtual machine movement

• Meets service level and performance goals

Page 30: Vmware Serengeti - Based on Infochimps Ironfan

30 Confidential

Hadoop Aware HA - Protection Against Unplanned Downtime

• Protection against host and VM failures

• Added application-aware HA for Hadoop NameNode (NN) and JobTracker (JT),

protecting against NN and JT failures

• Automatic failure detection and restart virtual machine in minutes, on any

available host in cluster

• In progress Hadoop Jobs will pause and resume when name node is up

Overview

Page 31: Vmware Serengeti - Based on Infochimps Ironfan

31 Confidential

vSphere Fault Tolerance Provides Continuous Protection

App

OS

App

OS

App

OSXXApp

OS

App

OS

App

OS

App

OS

X

VMware ESX VMware ESX

• Single identical VMs running in lockstep on separate hosts

• Zero downtime, zero data loss failover for all virtual machines in case of hardware failures

• Integrated with VMware HA/DRS

• No complex clustering or specialized hardware required

• Single common mechanism for all applications and operating systems

FTHAHAHAHA

Overview

Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

Page 32: Vmware Serengeti - Based on Infochimps Ironfan

32 Confidential

Achieve HA for the Entire Hadoop Stack

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI ReportingETL Tools

Ma

na

ge

me

nt

Se

rve

r

Zo

oke

ep

r (C

oord

inat

ion)

HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB Hcatalog MDB

Server

• Battle-tested high availability technology• Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT

Page 33: Vmware Serengeti - Based on Infochimps Ironfan

33 Confidential

Why Virtualize Hadoop?

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

Page 34: Vmware Serengeti - Based on Infochimps Ironfan

34 Confidential

Storage

Evolution of Hadoop on VMs

Compute

Current Hadoop:

Combined Storage/Compute

Storage

T1 T2

VM VM VM

VMVM

VM

Hadoop in VM- VM lifecycle

determinedby Datanode

- Limited elasticity- Limited to Hadoop

Multi-Tenancy

Separate Storage- Separate compute

from data- Elastic compute- Enable shared

workloads- Raise utilization

Separate Compute Clusters- Separate virtual clusters

per tenant- Stronger VM-grade security

and resource isolation- Enable deployment of

multiple Hadoop runtime versions

Slave Node

Page 35: Vmware Serengeti - Based on Infochimps Ironfan

35 Confidential

Ad hocdata mining

In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop)

Computelayer

Datalayer

HDFS

Host Host Host Host Host Host

Productionrecommendation engine

ProductionETL of log files

Virtualization platform

HDFS

Page 36: Vmware Serengeti - Based on Infochimps Ironfan

36 Confidential

Hadoopbatch analysis

Integrated Big Data Production – (Hadoop + other big data)

HDFS

Host Host Host Host Host Host

HBasereal-time queries

NoSQL –Cassandrakey-value

store

MPP DBMS –Analysis of

structured data

Computelayer

Datalayer

Virtualization platform

Page 37: Vmware Serengeti - Based on Infochimps Ironfan

37 Confidential

Short-livedHadoop compute cluster

Integrated Hadoop and Webapps – (Hadoop + Other Workloads)

HDFS

Host Host Host Host Host Host

Web serversfor ecommerce site

Computelayer

Datalayer

Hadoopcompute cluster

Virtualization platform

Page 38: Vmware Serengeti - Based on Infochimps Ironfan

38 Confidential

Agenda

VMware Data Portfolio

Big Data and Virtualization Trends

Enterprise Hadoop Needs

Virtualized Hadoop for the Enterprise

Summary

Page 39: Vmware Serengeti - Based on Infochimps Ironfan

39 Confidential

Simple, Reliable, Elastic Hadoop on Demand

Shrink and expand cluster on demand

Independent scaling of Compute and data

Strong multi-tenancy

Elasticity & Multi-tenancy

High availability for entire Hadoop stack

One click to setup

Battle-tested

High Availability

Rapid deployment

One stop command center

Easy to configure/reconfigure

Operational Simplicity

Hadoop-as-a-Service(Enterprise Grade EMR)

Page 40: Vmware Serengeti - Based on Infochimps Ironfan

40 Confidential

Virtualization Benefits across Hadoop Maturity Spectrum

Stage 3: Big Data ProductionMulti-tenancyElasticityBeyond core Hadoop

Stage1: Piloting Rapid deployment Time to insight

Stage 2: Hadoop Production High availablity Ease of operation Differentiated level of services

20 3000 node

Integrated

Scale

Page 42: Vmware Serengeti - Based on Infochimps Ironfan

42 Confidential

Thank You!