big data/hadoop infrastructure considerations
DESCRIPTION
Planning big-data infrastructure, implications forTRANSCRIPT
![Page 1: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/1.jpg)
© 2009 VMware Inc. All rights reserved
Big Data in a Software-Defined Datacenter
Richard McDougall
Chief Architect, Storage and Application Services
VMware, Inc
@richardmcdougll
![Page 2: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/2.jpg)
2
Trend: New Data Growing at 60% Y/Y
Source: The Information Explosion, 2009
medical(imaging,(sensors(
cad/cam,(appliances,(machine(data,(digital(movies(
digital(photos(
digital(tv(
audio(
camera(phones,(rfid(
satellite(images,(logs,(scanners,(twi7er(
Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…
![Page 3: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/3.jpg)
3
Trend: Big Data – Driven by Real-World Benefit
![Page 4: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/4.jpg)
4
Hadoop batch analysis
Big Data Family of Frameworks
File System/Data Store
Host Host Host Host Host Host
HBase real-time queries
NoSQL Cassandra, Mongo, etc Big SQL
Impala, Pivotal HawQ
Compute layer
Distributed Resource Management
Host
Other Spark, Shark, Solr,
Platfora, Etc,…
![Page 5: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/5.jpg)
5
Big (Data) problems: making management easier
![Page 6: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/6.jpg)
6
Log Processing / Click Stream Analytics
Machine Learning / sophisticated data mining
Web crawling / text processing
Extract Transform Load (ETL) replacement
Image / XML message processing
Broad Application of Hadoop technology
General archiving / compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug Discovery
Social Media
Vertical Use Cases Horizontal Use Cases
Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
![Page 7: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/7.jpg)
7
How does Hadoop enable parallel processing?
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
! A framework for distributed processing of large data sets across clusters of computers using a simple programming model.
![Page 8: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/8.jpg)
8
Hadoop System Architecture
! MapReduce: Programming framework for highly parallel data processing
! Hadoop Distributed File System (HDFS): Distributed data storage
! Other Distributed Storage Options: • Alternatives to HDFS • MAPR (SW), Isilon (HW)
![Page 9: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/9.jpg)
9
Host%1 Host%2 Host%3
%Input%File
Input%File
Job Tracker Schedules Tasks Where the Data Resides
Job Tracker
Job
DataNode
Task%%Tracker
Split%1%–%64MB
Task%<%1 Split%2%–%64MB Split%3%–%64MB
Task%%Tracker
Task%%Tracker
DataNode DataNode
Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB
Task%<%2 Task%<%3
![Page 10: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/10.jpg)
10
Hadoop Data Locality and Replication
![Page 11: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/11.jpg)
11
Hadoop Topology Awareness
![Page 12: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/12.jpg)
12
Rules of Thumb: Sizing for Hadoop
! Disk: • Provide about 50Mbytes/sec of disk bandwidth per core • If using SATA, that’s about one disk per core
! Network • Provide about 200mbits of aggregate network bandwidth per core
! Memory • Use a memory:core ratio of about 4Gbytes:core
! Example: 100 node cluster • 100 x 16 cores = 1600 cores
• 1600 x 50Mbytes/sec = 80Gbytes/sec(!) • 1600 x 200mbits = 320gbits of network traffic
![Page 13: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/13.jpg)
13
Big Data Frameworks and Characteristics
Framework Scale of data
Scale of Cluster
Computable Data?
Local Disks?
File System: Gluster, Isilon, HDFS, etc,…
10s PB 10s to 100s Some Yes, for cost
Map-reduce: Hadoop
100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability
Big-SQL: HawQ,, Aster Data, Impala, …
PB’s 10s to 100s Some Yes, for cost and bandwidth
No-SQL: Cassandra, hBase, …
Trilions Of rows
10s to 100s Some Yes, for cost and availability
In-Memory: Redis, Gemfire, Membase, …
Billions of rows
10s-100s Yes Primarily Memory
![Page 14: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/14.jpg)
14
Big Data Virtual Infrastructure
![Page 15: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/15.jpg)
15
Virtualization enables a Common Infrastructure for Big Data
Single purpose clusters for various business applications lead to cluster sprawl.
Virtualization Platform
! Simplify • Single Hardware Infrastructure
• Unified operations
! Optimize • Shared Resources = higher utilization
• Elastic resources = faster on-demand access
MPP DB Hadoop HBase
Virtualization Platform
Scaleout SQL
Hadoop
HBase
Cluster Sprawling
Cluster Consolidation
![Page 16: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/16.jpg)
16
Native versus Virtual Platforms, 24 hosts, 12 disks/host
0
50
100
150
200
250
300
350
400
450
TeraGen TeraSort TeraValidate
Elap
sed
tim
e, s
econ
ds
(low
er is
bet
ter)
Native 1 VM 2 VMs 4 VMs
Native versus Virtual Platforms, 24 hosts, 12 disks/host
![Page 17: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/17.jpg)
17
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Mixed Workloads
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
![Page 18: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/18.jpg)
18
Ad hoc data mining
In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop)
Compute layer
Data layer
HDFS
Host Host Host Host Host Host
Production recommendation engine
Production ETL of log files
Virtualization platform
HDFS
![Page 19: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/19.jpg)
19
Short-lived Hadoop compute cluster
Integrated Hadoop and Webapps – (Hadoop + Other Workloads)
HDFS
Host Host Host Host Host Host
Web servers for ecommerce site
Compute layer
Data layer
Hadoop compute cluster
Virtualization platform
![Page 20: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/20.jpg)
20
Hadoop batch analysis
Integrated Big Data Production – (Hadoop + other big data)
HDFS
Host Host Host Host Host Host
HBase real-time queries NoSQL –
Cassandra, Mongo, etc
Big SQL – Impala
Compute layer
Data layer
Virtualization
Host
Other Spark, Shark, Solr,
Platfora, Etc,…
![Page 21: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/21.jpg)
21
Serengeti: Deploy a Hadoop Cluster in under 30 Minutes
Deploy vHelperOVF to vSphere
Select configuration template
Automate deployment
Select Compute, memory, storage and network
Done
Step 1: Deploy Serengeti virtual appliance on vSphere.
Step 2: A few simple commands to stand up Hadoop Cluster.
![Page 22: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/22.jpg)
22
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Mixed Workloads
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
![Page 23: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/23.jpg)
23
vMotion, HA, FT enables High Availability for the Hadoop Stack
HDFS (Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI Reporting ETL Tools
Man
agem
ent S
erve
r
Zook
eepr
(Coo
rdin
atio
n) HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB
Hcatalog MDB
Server
![Page 24: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/24.jpg)
24
Why Virtualize Hadoop?
! Shrink and expand cluster on demand
! Resource Guarantee ! Independent scaling of
Compute and data
Mixed Workloads
! No more single point of failure
! One click to setup
! High availability for MR Jobs
Highly Available
! Rapid deployment ! Unified operations
across enterprise
! Easy Clone of Cluster
Simple to Operate
![Page 25: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/25.jpg)
25
Containers with Isolation are a Tried and Tested Approach
Host Host Host Host Host Host
vSphere, DRS
Host
Hungry Workload 1 Reckless Workload 2
Nosy Workload 3
![Page 26: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/26.jpg)
26
Mixing Workloads: Three big types of Isolation are Required
! Resource Isolation • Control the greedy noisy neighbor • Reserve resources to meet needs
! Version Isolation • Allow concurrent OS, App, Distro versions
! Security Isolation • Provide privacy between users/groups
• Runtime and data privacy required
Host Host Host Host Host Host
vSphere, DRS
Host
![Page 27: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/27.jpg)
27
Elasticity Enables Sharing of Resources
![Page 28: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/28.jpg)
28 O
ther
VM
Oth
er V
M
Oth
er V
M
“Time Share”
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Host Host Host
Oth
er V
M
Oth
er V
M
Oth
er V
M
Oth
er V
M
Had
oop
Had
oop
Had
oop
Had
oop
Had
oop
Had
oop
While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.
VMware vSphere
vHelper
HDFS HDFS HDFS
![Page 29: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/29.jpg)
29
Big Data Impact to Storage
![Page 30: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/30.jpg)
30
Traditional Storage – VMDK’s on SAN/NAS
VMDK (block)
VMDK (block)
VMDK (block)
App VM DB VM
Boot Data Boot
VMFS (SAN or NFS)
EMC NTAP
Archive
VMCB VM Level Archive
• VMs have just a few self-contained VMDKs • Data is not shared between VMs • VMs have limited individual bandwidth needs
![Page 31: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/31.jpg)
31
Rapid Growth of Big Data Capacity Necessitates New storage
Source: The Information Explosion, 2009
medical(imaging,(sensors(
cad/cam,(appliances,(machine(data,(digital(movies(
digital(photos(
digital(tv(
audio(
camera(phones,(rfid(
satellite(images,(logs,(scanners,(twi7er(
Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta generation…
![Page 32: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/32.jpg)
32
Use Local Disk where it’s Needed
SAN Storage
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes
200,000 Disk IOPS 8Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
200,000 Disk IOPS 10Gbyte/sec
Local Storage
$0.05/Gigabyte
$1M gets: 10 Petabytes
400,000 Disk IOPS 250 Gbytes/sec
![Page 33: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/33.jpg)
33
Storage Economics
$-
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
$3.50
$4.00
$4.50
$5.00
$5.50
0.5 1 2 4 8 16 32 64 128
Cost per GB
Petabytes Deployed
Traditional SAN/NAS VMware vSAN
Distributed Object Storage
HDFS MAPR CEPH Scale-out NAS
Isilon, NTAP
![Page 34: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/34.jpg)
34
Scalable Storage Architecture: Hadoop Demands
! Each node needs approximately ~1 disk of bandwidth per core • 16 core node needs ~1GByte/sec of bandwidth
! A 1,000 node Hadoop cluster needs 1Terabyte/sec of bandwidth • Local disks: $800 of local disks per node
• Traditional SAN: Would require an est. $50k of SAN storage per node
Data Node
Com
pute
Com
pute
Com
pute
Data Node C
ompu
te
Com
pute
Com
pute
Data Node
Com
pute
Com
pute
Com
pute
Data Node
Com
pute
Com
pute
Com
pute
![Page 35: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/35.jpg)
35
Storage-Servers: Configurations and Capabilities
16-24 core server 12-24 SATA 2-4TB Disks 10 GbE adapter Throughput: 2.5Gbytes/sec Capacity: 96TB
16-24 core server 80 SATA 2-4TB Disks 10 GbE adapter Throughput: 2.5Gbytes/sec Capacity: 320TB
Typical Storage Server
High Capacity Server
![Page 36: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/36.jpg)
36
Scalable Storage Bandwidth is Important for Big Data
0
20
40
60
80
100
120
1 10 20 30 40 50
Scalable Storage Single SAN/NAS
# Hosts
GBytes/sec
![Page 37: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/37.jpg)
37
Hadoop has Significant Ephemeral Data
%75%%of%Disk%Bandwidth%
Job%
Map%Task%
Map%Task%
Map%Task%
Map%Task%
Reduce%
Reduce%
HDFS%
DFS%Input%Data%
DFS%Output%Data%%
12%%of%Bandwidth%
%12%%of%Bandwidth%
Spills%&%Logs%spill*.out*
Spills%
Map%Output%file.out*
Shuffle%Map_*.out*
Sort%
Combine%Intermediate.out*
![Page 38: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/38.jpg)
38
Impact of Temporary Data
! Leverage local-disks when shared-storage is used • As much as 75% of the bandwidth will be transient • No-need for reliable, replicated storage
• Just use unprotected local-disk
HDFS%
Shared Data (Shared, Network Storage)
Temporary Data (Local Disks)
![Page 39: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/39.jpg)
39
Extend Virtual Storage Architecture to Include Local Disk
! Shared Storage: SAN or NAS • Easy to provision • Automated cluster rebalancing
! Hybrid Storage • SAN for boot images, VMs, other
workloads
• Local disk for Hadoop & HDFS • Scalable Bandwidth, Lower Cost/GB
Host
Had
oop
Oth
er V
M
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Oth
er V
M
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
Host
Had
oop
Had
oop
Oth
er V
M
![Page 40: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/40.jpg)
40
Scale-out Storage for Big Data: Local Disk or Scale-out-NAS
Host%
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Servers with Local Disks Big-Data using Scale-out NAS
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Scale<out%NAS%
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Scale<out%NAS%
Temp Data
Shared Data
![Page 41: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/41.jpg)
41
Big-Data using Local Disks
Host%
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Servers with Local Disks
16-24 core server 12-24 SATA 2-4TB Disks 10 GbE adapter iSCSI/NFS for Shared Storage for vMotion etc,…
High Performance 10GBE Switch per Rack
![Page 42: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/42.jpg)
42
Big Data with Scale-out-NAS
Big-Data using Scale-out NAS
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Scale<out%NAS%
Host%
Host%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Scale<out%NAS%
Temp Data
Shared Data
Isilon Scale-out
NAS
Local Disk or SSD In each Host
For Transient Data
![Page 43: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/43.jpg)
43
Big Data Impact to Networking
![Page 44: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/44.jpg)
44
Hadoop has Specific Network Demands that must be Considered
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
! A framework for distributed processing of large data sets across clusters of computers using a simple programming model.
![Page 45: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/45.jpg)
45
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
L2%Switch%
Top%of%Rack%Switch%
L2%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
L2%Switch% L2%Switch%
AggregaQng%Switch%
AggregaQng%Switch%
A Typical Network Architecture
![Page 46: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/46.jpg)
46
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
L2%Switch%
Top%of%Rack%Switch%
L2%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
L2%Switch% L2%Switch%
AggregaQng%Switch%
AggregaQng%Switch%
A Typical Network Architecture
! Hadoop Requirements • Hadoop can generate up to 200mbits per core of network traffic • Bursts of bandwidth during remote data read and shuffle phases
• Data locality can help minimize net reads, but shuffle and reduce are global
! Inside Rack Network • Today’s core-counts require 10GBe to hosts inside Rack
• High throughput switches required to meet aggregate bandwidth
• Switch Network Buffers sizing is important due to coordinated bursts
! Rack-to-rack Network • Two level switch aggregation can be a bottleneck
• Easy to saturate, especially during shuffle phases
! Link-Speed Sizing • Hadoop originally designed around 4 cores, 1Gbit uplink
• Now, 16-24 cores per box means 10Gbe is required for uplink
![Page 47: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/47.jpg)
47
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
L2%Switch%
Top%of%Rack%Switch%
L2%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
Top%of%Rack%Switch%
Host%
Host%
Host%
Host%
L2%Switch% L2%Switch%
AggregaQng%Switch%
AggregaQng%Switch%
Two Level Aggregated Network Characteristics
Top of Rack: Overprovisioning Possible 10Gbe High End Switch >500Gbits/sec
Between Racks: Bandwidth can be saturated
![Page 48: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/48.jpg)
48
Future Network Designs, optimized for Big-Data
Traditional Tree-structured networks
Clos network between Aggregation and Intermediate (core) switches
LA – (physical) Locator address AA – (flat, virtual) Application address
Pros: full-bisection bandwidth, non-blocking switching fabric, path diversity! Formalized in 1953
Cons: High oversubscription of upper-level (Aggregation) and core switches. Hard to get full-bisection bandwidth Images courtesy Greenberg et al.:http://research.microsoft.com/pubs/80693/vl2-sigcomm09-final.pdf and http://www.stanford.edu/class/ee384y/Handouts/clos_networks.pdf
![Page 49: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/49.jpg)
49
In Conclusion
! Software Defined Datacenter enables Unified Big Data Platform ! Plan to build a Software Defined “Big Data”, Datacenter • Enable Hadoop and other Big-Data ecosystem components
! Values • Consolidated Hardware, reduced SKU
• Higher Utilization: Can leverage a shared cluster for bursty workloads • Rapid Deployment: Use APIs to deploy software-defined clusters rapidly
• Flexible Cluster Use: Share the same infrastructure for different workload types
! References www.projectserengeti.org vmware.com/hadoop VMware Hadoop Performance Whitepaper VMware Hadoop HA/FT Paper
![Page 50: Big Data/Hadoop Infrastructure Considerations](https://reader030.vdocument.in/reader030/viewer/2022020207/55635ecad8b42ae6088b467c/html5/thumbnails/50.jpg)
© 2009 VMware Inc. All rights reserved
Big Data in a Software-Defined Datacenter
Richard McDougall
Chief Architect, Storage and Application Services
VMware, Inc
@richardmcdougll