intel® distribution for apache hadoop™ on dell poweredge ... · packaged apache hadoop ecosystem...

25
Intel® Distribution for Apache Hadoop™ on Dell PowerEdge Servers A Dell Technical White Paper Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution Architect Dell Solution Centers Dave Jaffe, Ph.D. Solution Architect Dell Solution Centers Rob Wilbert Solution Architect Dell Solution Centers

Upload: others

Post on 08-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

Intel® Distribution for Apache

Hadoop™ on Dell PowerEdge

Servers

A Dell Technical White Paper

Armando Acosta

Hadoop Product Manager

Dell Revolutionary Cloud and Big Data Group

Kris Applegate

Solution Architect

Dell Solution Centers

Dave Jaffe, Ph.D.

Solution Architect

Dell Solution Centers

Rob Wilbert

Solution Architect

Dell Solution Centers

Page 2: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

2 Dell | Intel® Distribution for Apache Hadoop

Executive Summary

This document details the deployment of Intel® Distribution for Apache Hadoop* software on the

PowerEdge R720XD. The intended audiences for this document are customers and system architects

looking for information on implementing Apache Hadoop clusters within their information technology

environment for Big Data analytics.

The reference configuration introduces all the high-level components, hardware, and software that

are included in the stack. Each high-level component is then described individually.

Dell developed this document to help streamline deployment, provide best practices and improve the

overall customer experience.

THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN

TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS,

WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.

© 2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without

the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.

Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered

trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered

trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to

refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any

proprietary interest in trademarks and trade names other than its own.

July 2013

Page 3: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

3 Dell | Intel® Distribution for Apache Hadoop

Table of Contents 1 Introduction ................................................................................................................................................... 5

2 Dell Solution Centers .................................................................................................................................... 7

3 Dell’s Point Of View on Big Data ................................................................................................................ 8

4 Intel Distribution for Apache Hadoop ....................................................................................................... 9

Hadoop Use-Cases ......................................................................................................................................... 10

Intel’s Contributions to Open Source .......................................................................................................... 10

5 Intel Hadoop Solution Software Components ...................................................................................... 12

Server Roles .......................................................................................................................................................12

6 Best Practices for Running Intel Distribution of Apache Hadoop on Dell ........................................ 14

Node Count Recommendations .................................................................................................................. 14

Hardware Recommendations ........................................................................................................................ 15

Monitoring ..................................................................................................................................................... 15

Resiliency ...................................................................................................................................................... 16

Performance ................................................................................................................................................. 17

Software Considerations ................................................................................................................................ 18

Installation Environment Assumptions .................................................................................................... 18

High Availability ........................................................................................................................................... 18

Installation Considerations ............................................................................................................................ 19

7 Testing ........................................................................................................................................................... 21

HiBench .............................................................................................................................................................21

Teragen / Terasort ...........................................................................................................................................21

Tested Configuration .......................................................................................................................................21

Tuning and Optimization of Workloads ...................................................................................................... 22

8 Conclusions ................................................................................................................................................. 24

9 Resources ..................................................................................................................................................... 25

Links ................................................................................................................................................................... 25

Additional Whitepapers .................................................................................................................................. 25

Tables Table 1. Recommended Cluster Sizes .......................................................................................................... 14

Table 2. Software Revisions .............................................................................................................................21

Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration ...............................................21

Page 4: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

4 Dell | Intel® Distribution for Apache Hadoop

Table 4. PowerEdge R720XD Datanode As Tested Configuration .......................................................... 22

Table 5. Key Hadoop Configuration Parameters ........................................................................................ 22

Figures

Figure 1. Dell Solution Centers Locations ........................................................................................................ 7

Figure 2. Big Data Demands ............................................................................................................................... 8

Figure 3. Intel Foundational Technologies for Hadoop Performance ....................................................... 9

Figure 4. Dell Big Data Cluster Logical Diagram ...........................................................................................13

Figure 5. Ganglia Performance Monitor Tool (Included with IDH) ............................................................13

Figure 6. Cluster Network Diagram ................................................................................................................. 15

Figure 7. Dell’s OpenManage Power Center ................................................................................................. 16

Figure 8. Dell R720XD models with 2.5” and 3.5” inch drives ..................................................................... 17

Figure 9. The Role Assignment dropdown for HDFS roles ......................................................................... 19

Figure 10. Mount Points are configured below for the dfs.data.dir directories .................................... 20

Figure 11. Intel Active Tuning Technology ................................................................................................. 23

Page 5: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

5 Dell | Intel® Distribution for Apache Hadoop

1 Introduction

Hadoop is an Apache open source project being built and used by a global community of

contributors, using the Java programming language. Hadoop’s architecture is based on

the ability to scale in a nearly linear capacity. By harnessing the power of this tool, many

customers who previously would have had difficulty sorting through their complex data

can now deliver value faster, provide deeper insight, and even develop new business

models based off the speed and flexibility these analytics provide.

However, installing, configuring and running Hadoop is not trivial. There are different roles

and configurations that need to be deployed on various host computers. Designing,

deploying and optimizing the network layer to match Hadoop’s scalability requires

consideration for the type of workloads that will be running on the Hadoop cluster. These

issues are complicated by both the fast-moving pace of the core Hadoop project and the

challenges of managing a system designed to scale to thousands of nodes in a cluster.

Dell’s customer-centered approach is to create rapidly deployable and highly optimized

end-to-end Hadoop solutions running on highly scalable hardware. Dell listened to its

customers and partnered with Intel to design a Hadoop solution that is unique in the

marketplace, combining optimized hardware, software, and services to streamline

deployment and improve the customer experience.

Intel has created a high quality, controlled distribution of Hadoop and offers commercial

management software, updates, support and consulting services.

The Intel® Distribution for Apache Hadoop (IDH) software includes:

The Intel® Manager for Apache Hadoop software to install, configure, monitor and

administer the Apache Hadoop cluster

Enhancements to HBase and Hive for improved query performance and end user

experience

Resource monitoring capability using Nagios and Ganglia in the Intel® Manager

Superior security and performance through tightly integrated encryption and

compression, authentication and access control.

Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig,

among other tools

This solution provides a foundational platform for Intel to offer additional solutions as the

Apache Hadoop ecosystem evolves and expands. Aside from the Apache Hadoop core

technology (HDFS, MapReduce, etc.) Intel has designed additional capabilities to address

specific customer needs for Big Data applications such as:

Optimal installation and configuration of the Apache Hadoop cluster

Monitoring, reporting, and alerting of the hardware and software components

Providing job-level metrics for analyzing specific workloads deployed in the cluster

Infrastructure configuration automation

Page 6: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

6 Dell | Intel® Distribution for Apache Hadoop

In recent tests in the Dell Solution Center, the Intel® Distribution for Apache Hadoop

Release 2.4.1 was installed and tested on a cluster of Dell® PowerEdge® R720 servers,

resulting in a set of best practices for installing IDH on Dell clusters.

The next sections describe the role of the Dell Solution Centers and Dell’s point of view on

Big Data, followed by details of the IDH solution and IDH software components. Finally the

best practices developed by the Solution Center and the results of the IDH on Dell tests

are described.

Page 7: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

7 Dell | Intel® Distribution for Apache Hadoop

2 Dell Solution Centers The Dell Solution Centers (DSC) are a global network of connected labs that allow Dell to

help customers architect, validate and build solutions across Dell’s entire enterprise

portfolio. The Dell | Intel Cloud Acceleration Program (DICAP), a team within the Dell

Solution Centers, has the mission of providing customer engagements on the topics of

Cloud and Big Data.

With centers in every region, the DSC engages customers through informal 30-60 minute

briefings, longer half-day architectural design sessions, and one to two-week proof-of-

concept tests that enables customers to “kick the tires” of Dell solutions prior to purchase.

Interested customers should engage with their Dell account team to access the services of

the DSC.

Figure 1. Dell Solution Centers Locations

Sao Paulo and Dubai coming in the second half of 2013

Page 8: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

8 Dell | Intel® Distribution for Apache Hadoop

3 Dell’s Point Of View on Big Data

“Big Data” is a term often hyped in the IT press. There are many different interpretations of

what exactly this means. In Dell’s point of view the methods and principles of Big Data

aren’t new to the computer industry. In High Performance Clustered Computing (HPCC),

data warehouses, and traditional databases, Dell has been providing these solutions for

years. What has changed is the scale at which such tools need to operate. Every new

device in use in today’s society gathers more and more data and the need to store, report

and analyze it is paramount. The term “big” can be implied on a variety of different scales:

(See Figure 2)

Volume – no longer in the realm of gigabytes, but rather terabytes or petabytes.

Velocity – devices now can generate more data in a small time than can be

ingested using traditional means.

Variety – with the data types and schemas of all the various datasets differing so

much, being able to use a common datastore and to query across them provides

tremendous value.

Figure 2. Big Data Demands

Page 9: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

9 Dell | Intel® Distribution for Apache Hadoop

4 Intel® Distribution for Apache Hadoop Dell continues to hear from customers about their Big Data challenges, specifically a need

for solutions that allow flexibility and choice while enabling key insights from their data.

Based on customer conversations and Dell’s experience in providing Hadoop solutions,

one size does not fit all. Each Hadoop distribution offers unique features and benefits. For

this very reason, Dell is introducing the partnership with Intel for the Intel® Distribution for

Apache Hadoop* software on the PowerEdge R720XD.

The Dell and Intel partnership is good for all customers that want value from their data.

Both companies share a common goal to help build a robust Apache Hadoop ecosystem

that is enterprise ready, allowing all customers to take advantage of this disruptive

technology. The partnership provides stability to the Apache Hadoop open source project;

both companies have long term strategies that will help drive the right capabilities and

features bringing the most value to customers.

Intel brings a unique value proposition for customers: the ability to enable an optimized

solution from the CPU silicon all the way to the Hadoop distribution. Intel is

is the only vendor that can marry CPU technologies, SSD technology and 10Gb Ethernet to

benefit Hadoop performance. The Intel® Distribution for Apache Hadoop software

focuses on performance and security. The Dell and Intel strategy is to reinforce the

Hadoop distribution by making it more enterprise ready and provide a viable platform for

big data workloads in all IT environments. The Intel® Distribution for Apache Hadoop

software is especially suited for use cases where security and performance and ease of

data management are key needs.

Figure 3. Intel Foundational Technologies for Hadoop Performance

Page 10: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

10 Dell | Intel® Distribution for Apache Hadoop

Hadoop Use-Cases

The Intel® Distribution for Apache Hadoop has been deployed in many different customer

scenarios. A few use cases that stand out are in healthcare, telecommunications and

smart-grid technology:

Healthcare – Customers use the massive database capabilities of IDH to store and process

the human genome, evaluate pharmaceutical results and make patient care decisions. In

genomic research,, the fact that each human genome consists of 3.2 billion base-pairs

with upwards of 4 million variants, drives the need for a cost-effective, high performance,

scalable data processing engine.. At the same time, the deep security enhancements IDH

provides are of major importance to the healthcare industry’s strict compliance

regulations.

Telecommunications – More and more mobile devices are getting into the hands of

people all over the world. The billing systems for mobile providers need to be able to track

call lengths and durations, text messages and data usage. More importantly they need to

be able to report on this in near real-time. Hadoop is used instead of traditional massively

parallel processing (MPP) and data-warehouse (DW) technologies due to its lower total

cost of ownership (TCO) and inherent fault-tolerance.

Energy Smart-Grid – Mobile devices aren’t the only thing generating new data streams.

Smart power meters generate large streams of sensor data that can be used by energy and

utility companies to optimize service delivery. The ability to efficiently store this data is

allowing these companies to increase the rate of collection and provide additional, more

granular detail. Traditional databases are proving to be incapable of handling the ingestion

rate of this data at an affordable cost.

Intel’s Contributions to Hadoop As with many other open source projects, Hadoop’s power owes itself to the community

that developed it. Contribution to open source projects, either directly, or by enhancing

the ecosystem drives further adoption and deepens utilization. Intel has a long history of

both contributing to core open source projects (Linux Kernel, Hadoop and KVM) as well as

creation of complementary projects. Two key programs to note in the context of Hadoop

are:

Page 11: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

11 Dell | Intel® Distribution for Apache Hadoop

Project Rhino – This Intel-driven project enhances the data protection capabilities of Hadoop

to address the security and compliance challenges around emerging use-cases. More details

can be found at https://github.com/intel-hadoop/project-rhino/

Project Panthera – This project’s goal is to provide full SQL support to help companies

integrate Hadoop more deeply with their existing data analytics processes. More details can

be found at https://github.com/intel-hadoop/project-panthera.

Page 12: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

12 Dell | Intel® Distribution for Apache Hadoop

5 Intel Hadoop Solution Software Components Hadoop Distributed File System (HDFS) – This is the clustered file system that is at the

core of the Hadoop software stack. When data is stored on this file system it’s

automatically distributed for both resiliency and redundancy. In the default configuration,

every file is stored 3 times on 3 different nodes. With Intel Hadoop, tunable parameters can

be set to increase or decrease the file replication level as the file access frequency

increases or decreases.

MapReduce – This is the distributed batch-oriented parallel processing framework that

enables data analysis at a large scale. This framework is accessed by writing Java-based

MapReduce jobs that get executed against datasets in HDFS.

Hive – Hive makes accessing the power of MapReduce more familiar to existing database

customers. It exposes the data that resides on HDFS as a SQL-like database. Standard SQL

queries run against this data will be translated into MapReduce by Hive and executed

behind the scenes. With Intel Hadoop, Hadoop queries can run faster on data sets in

Hbase.

HBase – Some use-cases dictate the need for faster response times than a batch-based

job through Hive or MapReduce. For these use cases, HBase provides a non-relational,

column-based, distributed database that resides directly on top of HDFS. This allows users

to leverage HDFS’s massive scalability to provide service to emerging non-traditional

databases. The Hbase distribution in IDH is tuned to perform ad hoc queries faster via Hive

for large datasets.

Server Roles Name Node/JobTracker(s) – These nodes serve as control nodes for the HDFS,

MapReduce, and HBase processes. For HDFS, they own the block map and directory tree

for all the data on the cluster. With MapReduce, they own the Job Tracker daemon that

handles job execution and monitoring. Lastly with HBase, these servers are responsible for

running the monitoring processes as well as owning any metadata operations. Production

environments should have a primary and at least one standby Name Node.

Data Node(s) – These are the nodes that hold the data as well as execute the MapReduce

jobs. They are generally filled with large amounts of local disks, enabling the parallel

processing and distributed storage features of Hadoop. The number of Data Nodes is

dictated by use case. Adding additional Data Nodes increases both performance and

capacity simultaneously.

Edge Node(s) – These servers lie on the perimeter of the dedicated Hadoop network. They

are where external users and business processes interact with the cluster. Often times they

will have a number of Network Interface Cards (NICs) attached to the Hadoop network as

well as separate NICs attached to the enterprise’s production IT network. More Edge

nodes can be added as external access requirements increase.

Intel® Manager Node – This node is where the installation of the Intel Manager software

will reside. It runs the configuration management processes, web server software, and

performance monitoring software. In production installations, a dedicated server should

fulfill this task. In smaller installations such at the one employed by Dell in these tests, this

role was shared with the Edge Node.

Page 13: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

13 Dell | Intel® Distribution for Apache Hadoop

Figure 4. Dell Big Data Cluster Logical Diagram

Figure 5. Ganglia Performance Monitor Tool (Included with IDH)

Page 14: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

14 Dell | Intel® Distribution for Apache Hadoop

6

Best Practices for Running Intel Distribution on

Dell

Node Count Recommendations Dell recognizes that use-cases for Hadoop range from small development clusters all the

way through large multi petabyte production installations. Dell has a Professional Services

team that sizes Hadoop clusters for a customer’s particular use. As a starting point three

cluster configurations can be defined for typical use:

Minimum Development Cluster – This is targeted at functional testing and may even be

built from existing equipment. However, the performance of these types of clusters can be

significantly lower as they don’t benefit from the highly distributed nature of HDFS.

Recommended Small Cluster – This is a good starting point for customers taking the

initial steps into running IDH in production. It provides some layers of resiliency that is

expected in today’s production IT world.

Recommended Production Cluster – This configuration provides all the available options

for resiliency both at a hardware layer and software layer. In addition, it allows for an

adequate number of data nodes to demonstrate the performance benefits of distributed

storage and parallel computing.

Table 1. Recommended Cluster Sizes

Minimum Development

Cluster

Recommended Small

Cluster

Recommended

Production Cluster

Name Node(s)2 1

1 2

2 2

2

Edge Node(s) 01 1 1

Data Node(s) 3 5 15

Intel Manager Node 01 1 1

Page 15: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

15 Dell | Intel® Distribution for Apache Hadoop

1 GbE Switches 1 1 2

10 GbE Switches 0 2 2

Rack Units 9U 20U 42U 1 In this case a single node serves as the Name, Job Tracker, Edge and Intel Manager Node.

2 In some cases a single server can serve as both the Name and Job Trackers

Figure 6. Cluster Network Diagram

Hardware Recommendations

Dell’s complete portfolio really shines when building on comprehensive solutions. From

the servers to the switches and even on down to the Racks and monitoring tools, the value

of deploying on Dell is readily apparent.

Monitoring

Using the Dell Remote Access Card (DRACs) in the servers Dell customer can identify

increases in power consumption and temperature through as they exercise the disks and

CPUs. One great tool to aid with this is Dell ‘s OpenManage Power Center. This tool uses

the Intel Network Node Manager technology built into Dell Remote Access Controller

(DRAC) to provide metrics and trigger alert events based on customer criteria.

Page 16: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

16 Dell | Intel® Distribution for Apache Hadoop

Figure 7. Dell’s OpenManage Power Center

Resiliency

In production clusters it’s imperative to keep an eye towards mitigating as many points of

failure as possible. However, it is important to keep in mind that Hadoop (both through

HDFS and MapReduce) is meant to be natively tolerant of failures and will take care of

much of the needed underlying work. That said, when investing in building a robust and

resilient configuration here are key areas to focus on:

Switches –Multiple stacked Force 10 switches should be used for high availability. Force

10 S60 1GbE switches utilize stacking modules which provide for easier switch

management and faster inter-switch communication. On the Force 10 S4810s there is the

option of either stacking via the 10 or 40 GbE ports (FW 8.3.12+) or implementing Virtual

Link Trunking if you plan to scale beyond the stacking limitations (See switch

documentation for configuration maximums).

NICs – Either two single-port NIC cards or two dual-port cards are recommended in the

administration servers to guard against PCI-E slot failures. This is not as crucial on

datanodes due to datanode redundancy.

Disks – RAID is only recommended in the administration servers such as the Namenode.

In the Data nodes it’s strongly recommended to put as many separate disks as possible (no

RAID). The flexibility of the PowerEdge R720XD really shines here since it can hold either

(12) 3.5” drives or (24) 2.5” drives.

Page 17: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

17 Dell | Intel® Distribution for Apache Hadoop

Figure 8. Dell R720XD models with 2.5” and 3.5” inch drives

Performance

Performance optimization is a matter that varies greatly from customer to customer. There

are a few principles that should be considered in order to optimize cluster performance.

Network – While 10 GbE isn’t required, multiple bonded NICs of the fastest speed possible

are strongly recommended for the data network. Workloads vary on whether or not they

can truly benefit from a fast network, but with the prevalence of 10 GbE, it would be a wise

idea to invest ahead of the curve. You’ll also want enterprise-grade switches with deep

per-port packet buffers in order to handle the volume and density of traffic Hadoop can

generate. For 1 GbE Dell Force10 Series 60 work well and at 10 GbE Dell Force10 S4810s

are optimal.

Disks – A key principle of performance tuning is to eliminate input/output (IO) starvation

at the CPU layer and contention at the disk level. From this comes the initial

recommendation of a 1:1 ratio of disk spindle to physical processor core (with hyper-

threading counting as half of one physical core for this purpose). The correct choices of

disks and processors totally depends on the workload, which can vary from the heavily

storage -centric, with massive disks and few processors, to heavily processor centric, with

many cores and PCI-E SSDs. The Dell Professional Services team can provide consultation

and assessment to help customers achieve the proper balance. The Dell PowerEdge

R720XD provides the excellent flexibility with regards to drive and socket configurations.

Memory – Few Hadoop use-cases will be memory constrained but administration servers

should have sufficient memory for index caching (128GB for a robust configuration). For

the data nodes, while, there are emerging use-cases that call for high amounts of

memory, it’s been determined through Hadoop customer engagements in the Dell

Solution Centers that 64GB is a good target initially.

CPUs – As mentioned above, the use-case will determine the correct balance of CPU,

Memory, and disk speed. In performance use-cases the most cores (balancing out spindle

count if not SSD) and the highest possible frequency CPUs are recommended. However, if

you were more interested in storage capacity, you could look at some of the Intel Xeon

E5-2600L series processors that are more energy efficient.

Page 18: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

18 Dell | Intel® Distribution for Apache Hadoop

Software Considerations

Installation Environment Assumptions

Updated Operating System –the selected OS should have appropriate updates applied

prior to IDH installation. The IDH documentation lists supported OS versions as well as

required updates.

Package Management – As part of the installation an existing OS package repository

needs to be referenced. Additionally, a new repo for IDH software needs to be created. In

some cases (Red Hat Enterprise Linux) this may mean registering the OS with the proper

credentials.

DNS – Forward and reverse name resolution are required for installation. Hosts to host

communication will be handled by hostname so this becomes imperative. This can be

accomplished via /etc/hosts or a DNS server.

NIC Bonding – In order to get as much bandwidth and resiliency as possible, Dell,

recommends implementing bonding on the NICs. In these tests, mode 6 (balance-ALB)

was used.

Production Network Connectivity – The Edgenode needs to be connected to the user’s

existing network in order to facilitate access to the cluster. The speed of this link should

meet the needs of the inbound data ingestion plans (both in number of users/processes as

well as volume of data).

High Availability

Production Hadoop workloads require a high degree of resiliency to achieve desired

uptime goals. In IDH 2.4.1 High Availability (HA) is handled in an Active/Passive manner

using a number of components:.

Distributed Replicated Block Devices (DRBD) –allows a logical device to be mirrored

between two disparate systems

Pacemaker – a Cluster Resource Management (CRM) framework that starts, stops,

monitors and migrate resources automatically.

Corosync – a messaging framework, which Pacemaker uses, for internode

communication.

These tools, when used together, provide layers of redundancy for both the HDFS

NameNode service and the MapReduce JobTracker. In order to enable HA, additional

hardware may be required in the namenodes including extra NICs, more memory, and

additional disks. While both, Namenode HA service as well as Jobtracker HA service

failover is completely automatic, once the failover completes, in-flight jobs will be required

to be resubmitted.

High availability will require some additional network configuration as well. Virtual

hostnames and IP addresses for both the NameNode and the TaskTracker HA functions

must be identified and recorded in all /etc/hosts files, or DNS tables.

It’s worthy of note that the IDH 2.4 release is based off of the 1.x Hadoop open source

project that had no HA option inherent, but Intel’s distribution adds this capability.

Page 19: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

19 Dell | Intel® Distribution for Apache Hadoop

Installation Considerations Role Assignments

During the installation, the setup wizard prompts for specific role assignments of the

cluster servers. It’s a good idea to use the “Edit Roles” button on the last page of the wizard

to double-check that each of the parameters was set correctly, as shown in Figure 9.

Figure 9. The Role Assignment dropdown for HDFS roles

Mount Points

Mount points are key, to properly configure an optimized cluster. It’s always best practice

to be following the installation guide, and prior to starting HDFS or any of the services,

make sure that the values set for dfs.data.dir (Figure 10) and mapred.data.dir are set to the

appropriate mount points. In the case below, there is one mount point per physical spindle

allocated.

Page 20: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

20 Dell | Intel® Distribution for Apache Hadoop

Figure 10. Mount Points are configured below for the dfs.data.dir directories

Page 21: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

21 Dell | Intel® Distribution for Apache Hadoop

7 Testing Setup

HiBench Hibench is a Hadoop benchmark framework that consists of 9 typical workloads

representing common Hadoop workloads. These consist of micro benchmarks, HDFS

benchmarks, web search benchmarks, machine learning benchmarks, and data analytics

benchmarks. For this paper the most well-known subset of the HiBench suite, the Teragen

/ Terasort benchmark, was employed to test system IO.

Teragen / Terasort These two HDFS / MapReduce benchmarks are used in conjunction with each other to

stress Hadoop systems and provide valuable metrics with regards to network, disk and

CPU utilization. By starting with these as a baseline, Hadoop administrators can tune

Hadoop’s wide variety of parameters to get the desired performance. Teragen starts by

generating flat text files that contain pseudo-random data that Terasort then sorts. This

type of sort / shuffle exercise is similar to what is done over and over by customers as they

manipulate data through MapReduce jobs.

Tested Configuration In these tests a small Hadoop cluster was employed as recommended in Table 1. The

specific software revisions used in the test are shown in Table 2. The PowerEdge R720 and

R70XD hardware configurations are shown in Table 3 and Table 4. The hardware listed

should be used as initial guidance only. Additional configurations are very possible and will

likely be required as each customer’s environment and use-case is unique.

Table 2. Software Revisions

Component Revision

Redhat Enterprise Linux 6.4

Intel Distribution for Apache Hadoop

2.4.1 (Build 16962)

Apache Hadoop (IDH is based on)

1.0.3

Hbase 0.94.1

Hive 0.9.0

Zookeeper 3.4.5

HiBench 2.2

Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration

Component Detail

Height 2 Rack Units (3.5”)

Processor 2x Intel Xeon E5-2650 2 GHz 8-core procs

Memory 128 GB

Disk 6x 600 GB 15K SAS Drives

Page 22: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

22 Dell | Intel® Distribution for Apache Hadoop

Network 4x 1GbE LOMs, 2x 10GbE NICs

RAID Controller PowerEdge RAID Controller H710 (PERC)

Management Card Integrated Dell Remote Access Controller (iDRAC)

Table 4. PowerEdge R720XD Datanode As Tested Configuration

Component Detail

Height 2 Rack Units (3.5”)

Processor 2x Intel Xeon E5-2667 2.9 GHz 6-core procs

Memory 64 GB

Disk 24x 500GB 7200 RPM Nearline SAS drives

Network 4x 1GbE LOMs, 2x 10GbE NICs

RAID Controller PowerEdge RAID Controller H710 (PERC)

Management Card Integrated Dell Remote Access Controller (iDRAC)

Tuning and Optimization of Workloads

The cluster configuration variables used in these tests (Table 5) are simply a starting spot.

Parameters like dfs.block.size would be highly contingent on the type of data being stored

and the use-case thereof. A Dell Professional Services engagement is recommended to

achieve configurations optimized for the user’s workload.

Table 5. Key Hadoop Configuration Parameters

Name Value dfs.block.size 134217728 ipc.server.tcpnodelay FALSE ipc.client.tcpnodelay FASLE

io.sort.factor 100 io.sort.mb 400 io.sort.spill.percent 0.8 io.sort.record.percent 0.05 mapred.child.java.opts 1024m mapreduce.tasktracker.outofband.heartbeat TRUE mapred.job.reuse.jvm.num.tasks 1 mapred.min.split.size 134217728 mapred.reduce.parallel.copies 20 mapred.reduce.tasks.speculative.execution TRUE mapred.reduce.tasks 30* # Task Trackers mapred.map.tasks 20 * # of Task Trackers

mapred.compress.map.output TRUE tasktracker.http.threads 60

Page 23: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

23 Dell | Intel® Distribution for Apache Hadoop

io.buffer.file.size 4096 io.bytes.per.checksum 4096 mapred.task.timeout 1800000 mapred.tasktracker.map.tasks.maximum 30 mapred.tasktracker.reduce.tasks.maximum 20

Intel® Active Tuner

As part of IDH, Intel provides a unique tool that can help users optimize configuration

parameters. A small MapReduce job is created and uploaded along with command-line

parameters. The Active Tuner runs it for a pre-determined number of iterations while

adjusting the known-performance enhancing parameters to arrive at an optimal

configuration tuned for that workload.

Figure 11. Intel® Active Tuner

Page 24: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

24 Dell | Intel® Distribution for Apache Hadoop

8 Conclusions For enterprises looking to take advantage of the wealth of available data, the Intel®

Distribution for Apache Hadoop running on Dell PowerEdge server clusters provides a

robust platform for Big Data applications. Intel distribution stands out from others with its

high availability features and the Intel Active Tuning tool. IDH also takes key steps into

emerging areas of interest for customers around encryption and security of Big Data. With

a proven track record of supporting large genomics and telecommunications customers,

IDH is an attractive Hadoop solution offering. Deploying Intel® Distribution for Hadoop on

Dell’s award-winning hardware, results in a high quality, cost-effective Hadoop platform

for everyone. This Hadoop solution from Dell and Intel, for big data applications, benefits

all types of customers, from those that are just starting their investigation into Hadoop

technology, to those who are ready to build out large clusters for petabyte-scale

applications.

Page 25: Intel® Distribution for Apache Hadoop™ on Dell PowerEdge ... · Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig, among other tools This solution provides

25 Dell | Intel® Distribution for Apache Hadoop

9 Resources

Links

Intel® Distribution for Apache Hadoop – http://hadoop.intel.com

Intel HiBench - https://github.com/intel-hadoop/HiBench/

Project Rhino – https://github.com/intel-hadoop/project-rhino/

Project Panthera - https://github.com/intel-hadoop/project-panthera/

Reference Architecture for Intel distribution of Hadoop -

http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitecture.pdf

Security without compromising performance -

https://hadoop.intel.com/pdfs/IntelEncryptionforHadoopSolutionBrief.pdf

Additional Whitepapers

Genomic Analytics - Next Bio - http://hadoop.intel.com/pdfs/IntelNextBioCaseStudy.pdf

Smart Energy Analytics - Pecan Street - http://hadoop.intel.com/pdfs/smart-energy-

analytics-pecan-street.pdf

Telco Analytics - China Mobile -

http://hadoop.intel.com/pdfs/IntelChinaMobileCaseStudy.pdf

Healthy City Analytics – China -

http://hadoop.intel.com/pdfs/IntelChinaHealthyCityAnalyticsCaseStudy.pdf

Smart City Video Analytics – Shanghai -

http://hadoop.intel.com/pdfs/IntelSmartCityVideoAnalyticsShanghaiIdealCaseStudy.pdf