tr nimble hadoop best practices guide v2uploads.nimblestorage.com/wp-content/uploads/2015/... ·...

B E S T P R A C T I C E S G U I D E : N IMB LE S TOR AGE F OR H ADOOP 2 . X 1

BEST PRACTICES GUIDE

Nimble Storage for Hadoop 2.x on Oracle Linux and Red Hat Enterprise Linux 6


Document Revision

Table 1Table 1Table 1Table 1.

Date Revision Description

9/5/2014 1.0 Initial Draft

11/17/2014 1.1 Updated iSCSI & Multipath

THIS TECHNICAL TIP IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN

TYPOGRAPHICAL ERRORS AND TECHNICAL INACCUURACIES. THE CONTENT IS PROVIDED AS IS,

WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.

Nimble Storage: All rights reserved. Reproduction of this material in any manner whatsoever

without the express written permission of Nimble is strictly prohibited.


Table of Contents

Introduction ................................................................................................................................................................................. 4

Audience ...................................................................................................................................................................................... 4

Scope ........................................................................................................................................................................................... 4

Nimble Storage Features .......................................................................................................................................................... 5

Nimble Benefits for Hadoop ..................................................................................................................................................... 6

Nimble Recommended Settings for Hadoop Nodes ............................................................................................................. 6

Creating Nimble Volumes for Hadoop HDFS ........................................................................................................................10

Nimble Reference Architecture .............................................................................................................................................12

Hadoop2.x Recommended Settings for Nimble Storage ...................................................................................................13


Introduction

The purpose of this technical white paper is to walk through the step-by-step for setting up and tuning Linux operating

system for Hadoop running on Nimble Storage.

Audience

This guide is intended for Hadoop solution architects, storage engineers, system administrators and IT managers who

analyze, design and maintain a robust Hadoop environment on Nimble Storage. It is assumed that the reader has a

working knowledge of iSCSI SAN network design, and basic Nimble Storage operations. Knowledge of Oracle Linux and

Red Hat Enterprise Linux is also required.

Scope

Most traditional Hadoop implementations today use local JBOD for storage. The main reason is because of how Hadoop

started out with leveraging cheap commodity servers. Today, Hadoop implementations of large enterprises can consist

of hundreds or thousands of nodes and each node consists of one or more local disk drives. For high availability,

Hadoop utilizes its replication feature for nodes failure.

During the design phase for a new Hadoop implementation, Architects and Storage Administrators often times work

together to come up with the best server and storage needs. They have to consider the number of compute nodes and

storage requirements to facilitate high performance, high availability, and capacity.

This white paper explains the Nimble technology and how it can help lower the TCO of your Hadoop environment and still

achieve the performance required. This paper also discusses the best practices for implementing Linux operating

system for Hadoop on Nimble Storage.


Nimble Storage Features

Cache Accelerated Sequential Layout (CASL™)

Nimble Storage arrays are the industry’s first flash-optimized storage designed from the ground up to maximize

efficiency. CASL accelerates applications by using flash as a read cache coupled with a write-optimized data

layout. It offers high performance and capacity savings, integrated data protection, and easy lifecycle

management.

Flash-Based Dynamic Cache

Accelerate access to application data by caching a copy of active “hot” data and metatdata in flash for reads.

Customers benefit from high read throughput and low latency.

Write-Optimized Data Layout

Data written by a host is first aggregated or coalesced, then written sequentially as a full stripe with checksum

and RAID parity information to a pool of disk; CASL’s sweeping process also consolidates freed up disk space

for future writes. Customers benefit from fast sub-millisecond writes and very efficient disk utilization

Inline Universal Compression

Compress all data inline before storing using an efficient variable-block compression algorithm. Store 30 to 75

percent more data with no added latency. Customers gain much more usable disk capacity with zero

performance impact.

Instantaneous Point-in-Time Snapshots

Take point-in-time copies, which do not require data to be copied on future changes (redirect-on-write). Fast

restores without copying data. Customers benefit from a single, simple storage solution for primary and

secondary data, frequent and instant backups, fast restores and significant capacity savings.

Efficient Integrated Replication

Maintain a copy of data on a secondary system by only replicating compressed changed data on a set schedule.

Reduce bandwidth costs for WAN replication and deploy a disaster recovery solution that is affordable and easy

to manage.

Zero-Copy Clones

Instantly create full functioning copies or clones of volumes. Customers get great space efficient and

performance on cloned volumes, making them ideal for test, development, and staging Oracle databases.


Nimble Benefits for Hadoop

With today data types such as sensors data, web logs, social data, and other exhaust data considered too

expensive to store and analyze in a traditional RDBMS, Hadoop has become a mean to do just that in a cost-

effective manner. Hadoop utilizes off-the-shelf commodity servers and local JBOD to store and perform

analytics of that data for businesses.

These data types are often unstructured in nature and the amount can be vast. Having local JBOD to store

these data requires adding additional server. These JBODs don’t have the intelligence to compress the data for

space savings and thus, as the data grows, more nodes are needed. Since capacity is the only requirement,

adding nodes to a Hadoop cluster also adds compute power which could potentially increase operating

expenses such as power and cooling.

Nimble Storage features for Hadoop:

• Compression

• Caching

• Data Protection such as Replication and Snapshot

• Price/Performance

• Higher Density to lower TCO

• Sequential I/O by MBPS

• Random I/O by IOPS

Having Nimble Storage as a storage device in a Hadoop cluster provides space savings, performance, and data

protection. When running MapReduce jobs, not all I/Os are sequential in nature. There is quite a bit of

randomness which can be beneficial on a Nimble array since all random I/Os will be served from flash. Aside

from performance gain, the Nimble in-line compression feature provides space saving anywhere from 1.5x to 2x

depending on the data type. This will allow more storage without adding compute nodes, hence, reducing

power and cooling costs.

Another benefit for Hadoop is the data protection. All nodes in a Hadoop cluster such as NameNode or

DataNode can fail. Leveraging the Nimble snapshot feature to take backups of the NameNode is critical in a

Hadoop deployment. The NameNode is the centerpiece of a HDFS file system. It keeps the directory tree of all

files in the file system and tracks where across the cluster the file data is kept. If a snapshot is available,

recovering a NameNode in a cluster can be done in a matter of seconds. In addition to snapshot, Nimble also

offers replication which can be used to replicate data to another data center for disaster recovery. Our Mother

Nature can be unpredictable sometimes so planning for DR is critical.

Nimble Recommended Settings for Hadoop Nodes

Nimble Array

• Nimble OS should be at least 2.1.4 on either CS500 or CS700 series array

Hadoop Cluster Nodes

• Nimble Storage highly recommends all Hadoop cluster nodes to have CPU speed of 2.9GHz or higher.


Linux Operating System

• iSCSIiSCSIiSCSIiSCSI Timeout and Performance SettingsTimeout and Performance SettingsTimeout and Performance SettingsTimeout and Performance Settings

Understanding the meaning of these iSCSI timeouts allows administrators to set these timeouts appropriately. These iSCSI timeouts parameters in the /etc/iscsi/iscsi.conf file should be set as follow:

node.session.timeo.replacement_timeout = 120 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 10 node.session.nr_sessions = 4

node.session.cmds_max = 2048

node.session.queue_depth = 1024 = = = NOP= = = NOP= = = NOP= = = NOP----Out Interval/Timeout = = =Out Interval/Timeout = = =Out Interval/Timeout = = =Out Interval/Timeout = = = node.conn[0].timeo.noop_out_timeout = [ value ] iSCSI layer sends a NOP-Out request to each target. If a NOP-Out request times out (default - 10 seconds), the iSCSI layer responds by failing any running commands and instructing the SCSI layer to requeue those commands when possible. If dm-multipath is being used, the SCSI layer will fail those running commands and defer them to the multipath layer. The mulitpath layer then retries those commands on another path. If dm-multipath is not being used, those commands are retried five times (node.conn[0].timeo.noop_out_interval) before failing altogether. node.conn[0].timeo.noop_out_interval [ value ] Once set, the iSCSI layer will send a NOP-Out request to each target every [ interval value ] seconds. = = = SCSI Error= = = SCSI Error= = = SCSI Error= = = SCSI Error Handler = = =Handler = = =Handler = = =Handler = = = If the SCSI Error Handler is running, running commands on a path will not be failed immediately when a NOP-Out request times out on that path. Instead, those commands will be failed after replacement_timeout seconds. node.session.timeo.replacement_timeout = [ value ] ImportantImportantImportantImportant: Controls how long the iSCSI layer should wait for a timed-out path/session to reestablish itself before failing any commands on it. The recommended setting of 12The recommended setting of 12The recommended setting of 12The recommended setting of 120 seconds above 0 seconds above 0 seconds above 0 seconds above allows ample time for controller allows ample time for controller allows ample time for controller allows ample time for controller failoverfailoverfailoverfailover. Default is 120 seconds.

NoteNoteNoteNote: If set to 120 seconds, IO will be queued for 2 minutes before it can resume. The “1 queue_if_no_path1 queue_if_no_path1 queue_if_no_path1 queue_if_no_path” option in /etc/multipath.conf sets iSCSI timers to immediately defer commands to the multipath layer. This setting prevents IO errors from propagating to the application; because of this, you can set replacement_timeout to 60-120 seconds.

NoteNoteNoteNote: Nimble Storage strongly recommends using dm-multipath for all volumes.

• MultipathMultipathMultipathMultipath cccconfigurationsonfigurationsonfigurationsonfigurations


The multipath parameters in the /etc/multipath.conf file should be set as follow in order to sustain a failover.

Nimble recommends the use of aliases for mapped LUNs

defaults { user_friendly_names yes find_multipaths yes } devices { device { vendor "Nimble" product "Server" path_grouping_policy group_by_serial path_selector "round-robin 0" features "1 queue_if_no_path" path_checker tur rr_min_io_rq 10 rr_weight priorities failback immediate } } multipaths { multipath { wwid 20694551e4841f4386c9ce900dcc2bd34 alias hdfs-vol1 } }

• Disk IO SchedulerDisk IO SchedulerDisk IO SchedulerDisk IO Scheduler

IO Scheduler needs to be set at “noop”

To set IO Scheduler for all LUNs online, run the below command. NoteNoteNoteNote: multipath must be setup first before

running this command. Any additional LUNs added or server reboot will not automatically change to this

parameter. Run the same command again if new LUNs are added or a server reboot.

[root@mktg04 ~]# multipath -ll | grep sd | awk -F":" '{print $4}' | awk '{print $2}' | while read LUN; do echo

noop > /sys/block/${LUN}/queue/scheduler ; done

To set this parameter automatically, append the below syntax to /etc/grub.conf file under the kernel line.

elevator=noop

• CPU ScalingCPU ScalingCPU ScalingCPU Scaling GovernorGovernorGovernorGovernor

CPU Scaling Governor needs to be set at “performance”

To set the CPU scaling governor, run the below command.

[root@mktg04 ~]# for a in $(ls -ld /sys/devices/system/cpu/cpu[0-9]* | awk '{print $NF}') ; do echo

performance > $a/cpufreq/scaling_governor ; done

NoteNoteNoteNote: The setting above is not persistence after a reboot; hence the command needs to be executed when the


server comes back online. To avoid running the command after a reboot, place the command in the

/etc/rc.local file.

• iSCSI iSCSI iSCSI iSCSI Data NetworkData NetworkData NetworkData Network

Nimble recommends using 10GbE iSCSI for all databases.

2 separate subnets

2 x 10GbE iSCSI NICs

Use jumbo frames (MTU 9000) for iSCSI networks

Example of MTU setting for eth1: DEVICE=eth1 HWADDR=00:25:B5:00:00:BE TYPE=Ethernet UUID=31bf296f-5d6a-4caf-8858-88887e883edc ONBOOT=yes NM_CONTROLLED=no BOOTPROTO=static IPADDR=172.18.127.134 NETMASK=255.255.255.0 MTU=9000 To change MTU on an already running interface: [root@bigdata1 ~]# ifconfig eth1 mtu 9000

• /etc/sysctl.conf /etc/sysctl.conf /etc/sysctl.conf /etc/sysctl.conf

net.core.wmem_max = 16780000

net.core.rmem_max = 16780000

net.ipv4.tcp_rmem = 10240 87380 16780000

net.ipv4.tcp_wmem = 10240 87380 16780000

Run sysctl –p command after editing the /etc/sysctl.conf file.

• max_sectors_kb max_sectors_kb max_sectors_kb max_sectors_kb

Change max_sectors_kb on all volumes to 1024 (default 512).

To change max_sectors_kb to 1024 for a single volume: [root@bigdata1 ~]# echo 1024 > /sys/block/sd?/queue/max_sectors_kb Change all volumes: multipath -ll | grep sd | awk -F":" '{print $4}' | awk '{print $2}' | while read LUN do echo 1024 > /sys/block/${LUN}/queue/max_sectors_kb

B E S T P R A C T I C E S G U I D E : N IMB LE S TOR AGE F OR H ADOOP 2 . X 1 0

done

NoteNoteNoteNote: To make this change persistent after reboot, add the commands in /etc/rc.local file.

• VM dirty writeback and expireVM dirty writeback and expireVM dirty writeback and expireVM dirty writeback and expire

Change vm dirty writeback and expire to 100 (default 500 and 3000 respectively)

To change vm dirty writeback and expire: [root@bigdata1 ~]# echo 100 > /proc/sys/vm/dirty_writeback_centisecs [root@bigdata1 ~]# echo 100 > /proc/sys/vm/dirty_expire_centisecs

NoteNoteNoteNote: To make this change persistent after reboot, add the commands in /etc/rc.local file.

Creating Nimble Volumes for Hadoop HDFS

Table 1Table 1Table 1Table 1:

File TypeFile TypeFile TypeFile Type NumberNumberNumberNumber of Volumesof Volumesof Volumesof Volumes # of mountpoints# of mountpoints# of mountpoints# of mountpoints OS OS OS OS File System File System File System File System Nimble Nimble Nimble Nimble Storage Storage Storage Storage

CachCachCachCaching Policying Policying Policying Policy

Nimble Nimble Nimble Nimble Block Size Block Size Block Size Block Size

SettingSettingSettingSetting

HDFS

volumes

4 - system with 8 cores

8 - system with 16 cores

or more

One per disk EXT4 Yes 32KB

Example of 8 HDFS volumesExample of 8 HDFS volumesExample of 8 HDFS volumesExample of 8 HDFS volumes

[hduser@bigdata1 ~]$ df -h

/dev/mapper/hdfs1 493G 30G 438G 7% /hdfs1









EXT4 File System

When creating an EXT file system on a logical volume, the stridestridestridestride and stripestripestripestripe----widthwidthwidthwidth options must be used.

For example:

stride=2,stripe-width=16 (for Nimble performance policy 8KB block size with 8 volumes) stride=4,stripe-width=32 (for Nimble performance policy 16KB block size with 8 volumes) stride=8,stripe-width=64 (for Nimble performance policy 32KB block size with 8 volumes)

NoteNoteNoteNote: The stripe-width value depends on the number of volumes, and the stride size. The calculator can be

found here http://busybox.net/~aldot/mkfs_stride.html

For example: If there is one Nimble volume with 8KB block size performance policy, then it should look like this.

Creating Nimble Performance Policy

On the Nimble Management GUI, click on “Manage/Performance Policies” and click on the “New Performance

Policy” button. Enter the appropriate settings then click “OK”.


Examples of EXTExamples of EXTExamples of EXTExamples of EXT4444 SetupSetupSetupSetup with 8 Volumeswith 8 Volumeswith 8 Volumeswith 8 Volumes::::

Create EXT file systemCreate EXT file systemCreate EXT file systemCreate EXT file system

[root@mktg04 ~]# for a in {1..8} ; do mkfs.ext4 /dev/mapper/hdfs$a -b 4096 -E stride=8,stripe-width=64; done

Mount options in Mount options in Mount options in Mount options in /etc/fstab/etc/fstab/etc/fstab/etc/fstab filefilefilefile

/dev/mapper/hdfs1 /hdfs1 ext4 _netdev,noatime,nodiratime,discard,barrier=0 0 0








Nimble Reference Architecture

NoteNoteNoteNote: The Hadoop nodes need to contain only a single local disk for Linux operating system.


Hadoop2.x Recommended Settings for Nimble Storage

Core-site.xml

ParameterParameterParameterParameter

ValueValueValueValue

file.stream-buffer-size

32768

io.file.buffer.size

32768

HDFS-site.xml



dfs.blocksize

512MB

dfs.replication

2

Yarn-site.xml



yarn.scheduler.minimum-allocation-mb

Minimum RAM per container (system memory dependent)

yarn.scheduler.maximum-allocation-mb

25% be higher than mapreduce.reduce.memory.mb

yarn.nodemanager.resource.memory-mb

Maximum memory Yarn can use on the node (system memory dependent)

Mapred-site.xml




mapreduce.map.memory.mb

Twice the amount of yarn.scheduler.minimum-allocation-mb

mapreduce.map.java.opts

75% of mapreduce.map.memory.mb

mapreduce.reduce.memory.mb

4 times yarn.scheduler.minimum-allocation-mb

mapreduce.reduce.java.opts

75% of mapreduce.reduce.memory.mb


Nimble Storage, Inc.

211 River Oaks Parkway, San Jose, CA 95134

Tel: 877-364-6253) | www.nimblestorage.com | [email protected]

© 2014 Nimble Storage, Inc. Nimble Storage, InfoSight, SmartStack, NimbleConnect, and CASL are trademarks or registered trademarks of Nimble Storage, Inc. All other trademarks are the property of their respective owners. BPG-Hadoop-1114

tr nimble hadoop best practices guide v2uploads.nimblestorage.com/wp-content/uploads/2015/... ·...

Documents