vmware serengeti - based on infochimps ironfan
DESCRIPTION
VMware's vitualized Hadoop, based on the Infochimps open source project, Ironfan.TRANSCRIPT
© 2012 VMware Inc. All rights reserved
Confidential
Hadoop-as-a-Service
CXO Big Data Seminar
September 26, 2012
2 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
3 Confidential
Trends Driving Change in Enterprise IT
Cloud
• Offered “as-a-Service”
• Virtualization
New Application Types
• Mobile, SaaS, social
• Apps released early and often
Frameworks
• New application frameworks driving
• Increase in application development
Data Disruption
• Web orientation drives exponential data volumes
• Reduced latency and new types of data
4 Confidential
The Database is Being Stretched
Big Data
Cloud Delivery
Flexible Data
Virtualized
Offered “-as-a-Service”
Petabytes vs. Gigabytes
Democratize BI
Multi-structured data
Developer productivity
Fast Data Global access patterns
Mobile app proliferation
5 Confidential
Big, Fast and Flexible Data
FlexibleBigBig Data
Processing
Big Data Analytics
Serengeti
FastOLTP
workloads
Analytic workloads
Cloud Delivery Model
Data as a service for private and public clouds
OSS Relational
Document
Object
Key / Value
GemFire
vPostgres
GemFire
GemFire
6 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
7 Confidential
Data is exploding & Hadoop is driving growth
Unstructured data driving growth Hadoop adoption is ramping
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Structured Unstructured
Complex unstructured data forecasted to outpace structured
relational data by 10x by 2020
Evaluating53%In-
production23%
Piloting18%
Testing2%
Don't know2%
Other2%
Source: Forrester Survey of 60 CIOs , September 2011
• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy
• Gartner predicts +800% data growth over next 5 years• Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs
8 Confidential
Log Processing / Click Stream Analytics
Machine Learning / sophisticated data mining
Web crawling / text processing
Extract Transform Load (ETL) replacement
Image / XML messageprocessing
Broad Application of Hadoop technology
General archiving / compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug Discovery
Social Media
Vertical Use CasesHorizontal Use Cases
Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
9 Confidential
The Future of Virtualization
VDC
Software-defined Datacenter Services
2008 2012 FUTURE
Time to Provision New Services
Workloads Virtualized
Weeks Days/Hours Minutes/Seconds
25% 60%
+
>90%
10 Confidential
Virtualization enables a Common Infrastructure for Big Data
Single purpose clusters for various business applications lead to cluster
sprawl.
Virtualization Platform
Simplify
• Single Hardware Infrastructure
• Unified operations
Optimize
• Shared Resources = higher utilization
• Elastic resources = faster on-demand access
MPP DB HadoopHBase
Virtualization Platform
MPP DB
Hadoop
HBase
Cluster Sprawling
Cluster Consolidation
11 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
12 Confidential
Hadoop Users
Data scientists, analysts, developers
• Line of business users
• Intimate with data and analysis, not IT
• Tasked with providing actionable intelligence that impacts the business
Concerns
• Obtain a Hadoop cluster on demand
• Minimize time to insight
• Require reasonable performance from Hadoop cluster
13 Confidential
The IT Guy
Admins, architects, CIO
• Responsible for technology infrastructure, compliance, budget management
• Evaluates new technologies and recommends best practices
Concerns
• Keeping up with demands of the business
• Cost savings and consolidation
• Reliability
• Complexity of running and tuning Hadoop clusters
• Shortage of skills to do the above
14 Confidential
Hadoop Journey in Enterprises
Stage 3: Big Data ProductionServe many departmentsOften part of mission critical workflowIntegrated with other big data services
Stage1: Piloting Often start with line of business Try 1 or 2 use cases to explore
the value of Hadoop
Stage 2: Hadoop Production Serve a few departments A few more use cases Core Hadoop + components
20 3000 node
Integrated
Scale
15 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
16 Confidential
Why Virtualize Hadoop?
Shrink and expand cluster on demand
Independent scaling of Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command center
Easy to configure/reconfigure
Operational Simplicity
17 Confidential
Project Serengeti
Open source project launched in June, 2012
Toolkit that leverage virtualization to simplify Hadoop deployment and operations
To learn more, projectserengeti.org
Deploy a Hadoop cluster in 10 Minutes
Customize Hadoop cluster
Use Your Favorite Hadoop Distribution
One stop command center
Serengeti
18 Confidential
Rapid Deployment of a Hadoop Cluster with Serengeti
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few simple commands to stand up Hadoop Cluster.
19 Confidential
A Walk Through Serengeti
20 Confidential
A Walk Through Serengeti
21 Confidential
A Walk Through Serengeti
Scaling out a cluster
Advanced cluster creation
22 Confidential
Customizing Your Hadoop Cluster
Choice of distros
Storage configuration
• Choice of shared storage or local disk
Resource configuration
High availability option
# of nodes
Also used to tune Hadoop config
… "distro":"apache", "groups":[ { "name": "master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”],
"storage": { "type": "SHARED", "sizeGB": 20}, "instanceType": "MEDIUM", "instanceNum": 1, "haFlag": 'on’}, {"name": "worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instanceType": "SMALL", "instanceNum": 5, "haFlag": 'off' …
23 Confidential
Freedom of Choice and Open Source
Community Projects
Distributions
• Flexibility to choose from major distributions
• Support for multiple projects (work in progress)
• Open architecture to welcome industry participation
• Contributing Hadoop Virtualization Extensions (HVE) to open source community
24 Confidential
Use Local Disk where it’s Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5 Petabytes
200,000 IOPS
8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
200,000 IOPS
10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets:
10 Petabytes
400,000 IOPS
250 Gbytes/sec
25 Confidential
Virtual Storage Architecture Includes Local Disk
Shared Storage: SAN or NAS
• Easy to provision
• Automated cluster rebalancing
• Leverage high availability protection
Local Storage: Local Disks
• Local disk for Hadoop
• Scalable bandwidth, lower cost/GB
Host
Ha
do
op
Oth
er
VM
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Oth
er
VM
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Shared Storage Shared Storage
Local Storage
26 Confidential
Hadoop Runs Well on Virtualization
TeraGen TeraSort TeraValidate0
50
100
150
200
250
300
350
400
450
Native
1 VM
2 VMs
4 VMs
Ela
psed t
ime,
seco
nds (
low
er
is b
ett
er)
Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
27 Confidential
Why Virtualize Hadoop?
Shrink and expand cluster on demand
Independent scaling of Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command center
Easy to configure/reconfigure
Operational Simplicity
28 Confidential
High Availability for the Hadoop Stack
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Man
agem
ent
Ser
ver
Zoo
keep
r (C
oord
inat
ion)
HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB
Hcatalog MDB
Server
HA for Hadoop stack is more than Name node HA
29 Confidential
vMotion Reduces Planned Downtime
Description:
Enables the live migration of virtual machines from one host to another with continuous service availability.
Benefits:• Revolutionary technology that is the
basis for automated virtual machine movement
• Meets service level and performance goals
30 Confidential
Hadoop Aware HA - Protection Against Unplanned Downtime
• Protection against host and VM failures
• Added application-aware HA for Hadoop NameNode (NN) and JobTracker (JT),
protecting against NN and JT failures
• Automatic failure detection and restart virtual machine in minutes, on any
available host in cluster
• In progress Hadoop Jobs will pause and resume when name node is up
Overview
31 Confidential
vSphere Fault Tolerance Provides Continuous Protection
App
OS
App
OS
App
OSXXApp
OS
App
OS
App
OS
App
OS
X
VMware ESX VMware ESX
• Single identical VMs running in lockstep on separate hosts
• Zero downtime, zero data loss failover for all virtual machines in case of hardware failures
• Integrated with VMware HA/DRS
• No complex clustering or specialized hardware required
• Single common mechanism for all applications and operating systems
FTHAHAHAHA
Overview
Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
32 Confidential
Achieve HA for the Entire Hadoop Stack
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Ma
na
ge
me
nt
Se
rve
r
Zo
oke
ep
r (C
oord
inat
ion)
HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB Hcatalog MDB
Server
• Battle-tested high availability technology• Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT
33 Confidential
Why Virtualize Hadoop?
Shrink and expand cluster on demand
Independent scaling of Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command center
Easy to configure/reconfigure
Operational Simplicity
34 Confidential
Storage
Evolution of Hadoop on VMs
Compute
Current Hadoop:
Combined Storage/Compute
Storage
T1 T2
VM VM VM
VMVM
VM
Hadoop in VM- VM lifecycle
determinedby Datanode
- Limited elasticity- Limited to Hadoop
Multi-Tenancy
Separate Storage- Separate compute
from data- Elastic compute- Enable shared
workloads- Raise utilization
Separate Compute Clusters- Separate virtual clusters
per tenant- Stronger VM-grade security
and resource isolation- Enable deployment of
multiple Hadoop runtime versions
Slave Node
35 Confidential
Ad hocdata mining
In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop)
Computelayer
Datalayer
HDFS
Host Host Host Host Host Host
Productionrecommendation engine
ProductionETL of log files
Virtualization platform
HDFS
36 Confidential
Hadoopbatch analysis
Integrated Big Data Production – (Hadoop + other big data)
HDFS
Host Host Host Host Host Host
HBasereal-time queries
NoSQL –Cassandrakey-value
store
MPP DBMS –Analysis of
structured data
Computelayer
Datalayer
Virtualization platform
37 Confidential
Short-livedHadoop compute cluster
Integrated Hadoop and Webapps – (Hadoop + Other Workloads)
HDFS
Host Host Host Host Host Host
Web serversfor ecommerce site
Computelayer
Datalayer
Hadoopcompute cluster
Virtualization platform
38 Confidential
Agenda
VMware Data Portfolio
Big Data and Virtualization Trends
Enterprise Hadoop Needs
Virtualized Hadoop for the Enterprise
Summary
39 Confidential
Simple, Reliable, Elastic Hadoop on Demand
Shrink and expand cluster on demand
Independent scaling of Compute and data
Strong multi-tenancy
Elasticity & Multi-tenancy
High availability for entire Hadoop stack
One click to setup
Battle-tested
High Availability
Rapid deployment
One stop command center
Easy to configure/reconfigure
Operational Simplicity
Hadoop-as-a-Service(Enterprise Grade EMR)
40 Confidential
Virtualization Benefits across Hadoop Maturity Spectrum
Stage 3: Big Data ProductionMulti-tenancyElasticityBeyond core Hadoop
Stage1: Piloting Rapid deployment Time to insight
Stage 2: Hadoop Production High availablity Ease of operation Differentiated level of services
20 3000 node
Integrated
Scale
41 Confidential
Serengeti Resources
Download and try Serengeti• projectserengeti.org
VMware Hadoop site• vmware.com/hadoop
Hadoop performance on vSphere• vmware.com/files/pdf
/VMW-Hadoop-Performance-vSphere5.pdf
Hadoop High Availability solution• vmware.com/files/pdf
/Apache-Hadoop-VMware-HA-solution.pdf
42 Confidential
Thank You!