optimizing dell poweredge configurations for hadoop
Post on 13-Dec-2014
2.206 Views
Preview:
DESCRIPTION
TRANSCRIPT
Optimizing PowerEdge Configurations for Hadoop
Michael Pittaro Principal Architect, Big Data Solutions Dell
Big Data is when the data itself is part of the problem.
Volume
• A large amount of data, growing at large rates
Velocity
• The speed at which the data must be processed
Variety
• The range of data types and data structure
What is Big Data ?
Dell | Cloudera Apache Hadoop Solution
3
Retail Telco Media Web Finance
• A Proven Big Data Platform – Cloudera CDH4 Hadoop Distribution with Cloudera Manager – Validated and Supported Reference Architecture – Production deployments across all verticals
• Dell Crowbar provides deployment and management at scale
– Integrated with Cloudera Manager – Bare metal to deployed cluster in hours – Lifecycle management for ongoing operations
• Dell Partner Ecosystem
– Pentaho for Data Integration – Pentaho for Reporting and Visualization – Datameer for Spreadsheet style analytics and visualization – Clarity and Dell Implementation Services
Dell | Cloudera Apache Hadoop Solution
4
• Customers want results – Performance – Predictability – Reliability – Availability – Management – Monitoring
• Customers want value
• Big Data has many options – Servers – Networking – Software – Tools – Application Code – Fast Evolution
• Wide range of applications
The Problem with Big Data Projects
5
• Tested Server Configurations • Tested Network Configurations • Base Software Configuration
– Big Data Software – OS Infrastructure – Operational Infrastructure
• Predefined configuration – Recommended starting point
• Patterns, Use Cases, and Best
Practices are emerging in Big Data
• Reference Architectures help package this knowledge for reuse
A Reference Architecture Fills The Gap
6
• PowerEdge R720, R720XD – Balanced Compute and Storage
• PowerEdge C6105 – Scale Out Computing – Large Disk Capacity
• PowerEdge C8000 – Scale Out Computing – Flexible Configuration
7
Reference Architecture : Servers
1GbE 10GbE
Top of Rack
Force 10 S60
Force 10 S4810
Cluster Aggregation
Force 10 S4810 Force 10 S4810
Bonded Connections
Redundant Networking
Reference Architecture: Networking
8
• Hadoop – Cloudera CDH 4 – Cloudera Manager – Hadoop Tools
• Infrastructure Management
– Nagios – Ganglia
• Configuration Management
– Predefined parameters – Role based configuration
9
Reference Architecture: Software
Hive
Pig
HBase
Sqoop
Oozie
Hue
Flume
Whirr
Zookeeper
Tying it all Together: Crowbar
10
Del
l “C
row
bar
” O
ps
Man
agem
ent
Core Components & Operating Systems
Big Data Infrastructure & Dell Extensions
Physical Resources
APIs, User Access, & Ecosystem Partners
Crowbar
Deployer
Provisioner
Network RAID
BIOS IPMI
NTP
DNS Logging
HDFS HBase Hive
Nagios Ganglia
Pentaho
Cloudera
Cloudera Pig Force10
11 Revolutionary Cloud Solutions Confidential
Hadoop Node Architecture
Cloudera Manager
Hadoop Clients Task
Tracker Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Job Tracker
Job Tracker
Crowbar
Nagios
Ganglia
Admin Node
Edge Node Data Node Data Node Data Node
Master Name Node Secondary Name Node
Standby Name Node
Journal Node
Journal Node
Standby Name Node
High Availability Node
Active Name Node
Journal Node
Job Tracker
12 Revolutionary Cloud Solutions Confidential
Hadoop Cluster Scaling
Learning The Reference Architecture
• Read It ! – Read it again – Keep it under your pillow
• Three Documents
– Reference Architecture – Deployment Guide – Users Guide
• Deploy it
– Works on 4 or 5 nodes
• Available through the Dell Sales Team
13
Leveraging the Reference Architecture
• Start with the base configuration – It works, and eliminates mix and match problems – There are a lot of subtle details hidden behind the configurations
• Easy changes: processor, memory, disk
– Will generally not break anything – Will affect performance, however
• Harder changes: Hadoop configuration
– Mainly, need to know what you're doing here – We have experience and recommendations
• Hardest Changes: Optimization for workloads
– The default configuration is a general purpose one – Specific workloads must be tested and benchmarked
14
• Assume 1.5 Hadoop Tasks per physical core – Turn Hyperthreading on – This allows headroom for other processes
• Configure Hadoop Task slots – 2/3 map tasks – 1/3 reduce tasks
• Dual Socket 6 core Xeon example › mapred.tasktracker.map.tasks.maximum: 12 › mapred.task.tracker.reduce.tasks.maximum: 6
• Faster is better
– Hadoop compression uses processor cycles – Most Hadoop jobs are I/O bound, not processor bound – The Map / Reduce balance depends on actual workload – It’s hard to optimize more without knowing the actual workload
Selecting Processors
15
• Hadoop scales processing and storage together – The cluster grows by adding more data nodes – The ratio of processor to storage is the main adjustment
• Generally, aim for a 1 spindle / 1 core ratio
– I/O is large blocks (64Mb to 256Mb) – Primarily sequential read/write, very little random I/O – 8 tasks will be reading or writing 8 individual spindles
• Drive Sizes and Types
– NL SAS or Enterprise SATA 6 Gb/sec – Drive size is mainly a price decision
• Depth per node
– Up to 48 TB/node is common – 112 Tb / node is possible – Consider how much data is ‘active’ – Very deep storage impacts recovery performance
Spindle / Core / Storage Depth Optimization
16
PowerEdge C8000 Hadoop Scaling - 16 core Xeon
17
0
5,000
10,000
15,000
20,000
25,000
30,000
35,0001
15 29
43 57 71 85
99
113
127
141
155
169
183
197
211
22
52
39
Tb
Sto
rag
e
(1) 12 spindle 3Tb versus (3) 6 spindle 3Tb
Cores (1)
Storage (1)
IOPS (1)
Storage (3)
IOPS (3)
• Workload optimization requires profiling and benchmarking
• HBase versus pure Map/Reduce are different – I/O patterns are different – Hbase requires more memory – Cloudera RTQ (Impala) is I/O Intensive
• Map Reduce usage varies
– I/O intensive to CPU intensive
• Ingestion and Transfer impact the edge (gateway) nodes
• Heterogenous Cluster versus dedicated Clusters ? – Cloudera have added support for heterogenous clusters and nodes – Dedicated cluster makes sense if workload is consistent
› Primarily for ‘data’ businesses
Workload Optimization : Hadoop has widely varying workloads
18
Reference Architecture Options
• High Availability – Networking configuration – Master / Secondary Name Node configuration
• Alternative Switches
– It’s possible – Contact us for advice
• Cluster Size
– The Reference Architecture Scales Easily to Around 720 Nodes – Beyond that, a network engineer needs to take a closer look
• Node Size
– Memory recommendations are a starting point – Disk / Core balance is a never ending debate
19
Model Data Node Configuration Comments RA
R720Xd Dual socket, 12 cores, 24 x 2.5” spindles
Most popular platform for Hadoop
C8000 Dual socket, 16 cores, 16 x 3.5” spindles
Popular for deep/dense Hadoop applications
C6100 / C6105
Dual socket, 8/12 cores, 12 x 3.5” spindles
Two node version. C6100 is hardware EOL
C2100 Dual Socket, 12 cores, 12 x 3.5” spindles
Popular, hardware EOL but often repurposed for Hadoop
R620 Dual Socket, 8 cores, 10 x 2.5” spindles
1U form factor
C6220 Dual-socket, 8 cores, 6 x 2.5” spindles
Core/spindle ratio is not ideal for Hadoop.
In the Wild – Dell Customer Hadoop Configurations
20
SecureWorks : Based on R720xd Reference Architecture
SecureWorks 24 hours a day, 365 days a year, helping protect the security of its customers’ assets in real time
Challenge Collecting, processing, and analyzing massive amounts of data from customer environments
Results • Reduced cost of data storage to ~21 cents
per gigabyte
• 80% savings over previous proprietary solution
• 6 months faster deployment
• < 1 yr. payback on entire investment
• Data doubles every 18 months, magnifying savings
Further Information
• Dell Hadoop Home Page – http://www.dell.com/hadoop
• Dell Cloudera Apache Hadoop install with Crowbar (video)
– http://www.youtube.com/watch?v=ZWPJv_OsjEk
• Cloudera CDH4 Documentation – http://ccp.cloudera.com/display/CDH4DOC/CDH4+Documentation
• Crowbar homepage and documentation on GitHub
– http://github.com/dellcloudedge/crowbar/wiki
• Open Source Crowbar Installers – http://crowbar.zehicle.com/
22
Q&A
23
Thank you!
24
25
Notices & Disclaimers
Copyright © 2013 by Dell, Inc.
No part of this document may be reproduced or transmitted in any form without the written permission from Dell, Inc.
This document could include technical inaccuracies or typographical errors. Dell may make improvements or changes in the product(s) or program(s) described herein at any time without notice. Any statements regarding Dell’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
References in this document to Dell products, programs, or services does not imply that Dell intends to make such products, programs or services available in all countries in which Dell operates or does business. Any reference to an Dell Program Product in this document is not intended to state or imply that only that program product may be used. Any functionality equivalent program, that does not infringe Dell’s intellectual property rights, may be used.
The information provided in this document is distributed “AS IS” without any warranty, either expressed or implied. Dell EXPRESSLY DISCLAIMS any warranties of merchantability, fitness for a particular purpose OR INFRINGEMENT. Dell shall have no responsibility to update this information.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any Dell patents or copyrights.
Dell, Inc. 300 Innovative Way Nashua, NH 03063 USA
top related