introduction to cloudera's administrator training for apache hadoop

1

An Introduction to Cloudera’s Administrator Training for Apache Hadoop

Ian WrigleySr. Curriculum [email protected]

2© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Cloudera Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s advanced configuration options

Question time

Topics

3

1 Broadest Range of CoursesDeveloper, Admin, Analyst, HBase, Data Science

2

3

Most Experienced InstructorsOver 15,000 students trained since 2009

5 Widest Geographic CoverageMost classes offered: 50 cities worldwide plus online

6 Most Relevant Platform & CommunityCDH deployed more than all other distributions combined

7 Depth of Training MaterialHands-on exercises and VMs support live instruction

Leader in CertificationOver 5,000 accredited Cloudera professionals

4 State of the Art CurriculumClasses updated regularly as Hadoop evolves 8 Ongoing Learning

Video tutorials and e-learning complement training

Why Cloudera Training?

4

Data AnalystTraining

Implement massively distributed, columnar storage at scaleEnable random, real-time read/write access to all data

HBaseTraining

Configure, install, and monitor clusters for optimal performanceImplement security measures and multi-user functionality

Vertically integrate basic analytics into data managementTransform and manipulate data to drive high-value utilization

Enterprise Training

Use Cloudera Manager to speed deployment and scale the clusterLearn which tools and techniques improve cluster performance

Learning Path: System Administrators

AdministratorTraining


Why Take Training?




Question time

Topics


During the Administrator course, you learn:

The core technologies of Hadoop

How to populate HDFS from external sources

How to plan your Hadoop cluster hardware and software

How to deploy a Hadoop cluster

What issues to consider when installing Pig, Hive, and Impala

What issues to consider when deploying Hadoop clients

How Cloudera Manager can simplify Hadoop administration

How to configure HDFS for high availability

What issues to consider when implementing Hadoop security

Administrator Course Objectives


How to schedule jobs on the cluster

How to maintain your cluster

How to monitor, troubleshoot, and optimize the cluster

Administrator Course Objectives (cont’d)


The course features many Hands-On Exercises, including:–Deploying Hadoop in pseudo-distributed mode–Deploying a complete, multi-node Hadoop cluster– Importing data into HDFS using Sqoop and Flume– Installing Hive and Impala–Using Hue to control user access–Configuring HDFS High Availability–Configuring the FairScheduler– Troubleshooting problems on the cluster–… and more

Hands-On Exercises


Course Chapters

Introduction

Planning Your Hadoop Cluster Hadoop Installation and Initial Configuration Installing and Configuring Hive, Impala, and Pig Hadoop Clients Cloudera Manager Advanced Cluster Configuration Hadoop Security

Introduction to Apache Hadoop

Planning, Installing, andConfiguring a Hadoop Cluster

Course Introduction

The Case for Apache Hadoop HDFS Getting Data Into HDFS MapReduce

Managing and Scheduling Jobs Cluster Maintenance Cluster Monitoring and Troubleshooting Conclusion Kerberos Configuration Configuring HDFS Federation

Cluster Operations and Maintenance

Course Conclusion and Appendices


Why Take Training?




Question time

Topics


A single NameNode is a single point of failure

Two ways a NameNode can result in HDFS downtime–Unexpected NameNode crash (rare)–Planned maintenance of NameNode (more common)

HDFS High Availability (HA) eliminates this SPOF–Available in CDH4 (or related Apache Hadoop 0.23.x, and 2.x)

HDFS High Availability Overview


HDFS High Availability uses a pair of NameNodes–One Active and one Standby–Clients only contact the Active NameNode–DataNodes heartbeat in to both NameNodes–Active NameNode writes its metadata to a quorum of JournalNodes– Standby NameNode reads from the JournalNodes to remain in sync with

the Active NameNode

HDFS High Availability Architecture


Active NameNode writes edits to the JournalNodes– Software to do this is the Quorum Journal Manager (QJM)

–Built in to the NameNode–Waits for a success acknowledgment from the majority of JournalNodes

–Majority commit means a single crashed or lagging JournalNode will not impact NameNode latency

–Uses the Paxos algorithm to ensure reliability even if edits are being written as a JournalNode fails

Note that there is no Secondary NameNode when implementing HDFS High Availability– The Standby NameNode periodically performs checkpointing

HDFS High Availability Architecture (cont’d)


Only one NameNode must be active at any given time– The other is in standby mode

The standby maintains a copy of the active NameNode’s state– So it can take over when the active NameNode goes down

Two types of failover–Manual (detected and initiated by a user)–Automatic (detected and initiated by HDFS itself)

Failover


Automatic failover is based on Apache ZooKeeper–A coordination service system also used by HBase–An open source Apache project –One of the components in CDH

A daemon called the ZooKeeper Failover Controller (ZKFC) runs on each NameNode machine

ZooKeeper needs a quorum of nodes– Typical installations use three or five nodes– Low resource usage

–Can install alongside existing master daemons

Automatic Failover


HDFS HA With Automatic Failover – Deployment


Why Take Training?



A Deeper Dive: Some of Hadoop’s more advanced configuration options

Question time

Topics


hdfs-site.xml

dfs.namenode.handler.count The number of threads the NameNode uses to handle RPC requests from DataNodes. Default: 10. Recommended: ln(number of cluster nodes) * 20. Symptoms of this being set too low: ‘connection refused’ messages in DataNode logs as they try to transmit block reports to the NameNode. Used by the NameNode.

dfs.datanode.failed.volumes.tolerated

The number of volumes allowed to fail before the DataNode takes itself offline, ultimately resulting in all of its blocks being re-replicated. Default: 0, but often increased on machines with several disks. Used by DataNodes.


core-site.xml

fs.trash.interval When a file is deleted, it is placed in a .Trash directory in the user’s home directory, rather than being immediately deleted. It is purged from HDFS after the number of minutes specified. Default: 0 (disabled). Recommended: 1440 (one day). Used by clients and the NameNode.


mapred-site.xml

mapred.job.tracker.handler.count

Number of threads used by the JobTracker to respond to heartbeats from the TaskTrackers. Default: 10. Recommendation: ln(number of cluster nodes) * 20. Used by the JobTracker.

mapred.reduce.parallel.copies

Number of TaskTrackers a Reducer can connect to in parallel to transfer its data. Default: 5. Recommendation: ln(number of cluster nodes) * 4 with a floor of 10. Used by TaskTrackers.

tasktracker.http.threads The number of HTTP threads in the TaskTracker which the Reducers use to retrieve data. Default: 40. Recommendation: 80. Used by TaskTrackers.


mapred-site.xml (cont’d)

mapred.reduce.slowstart.completed.maps

The percentage of Map tasks which must be completed before the JobTracker will schedule Reducers on the cluster. Default: 0.05 (5 percent). Recommendation: 0.8 (80 percent). Used by the JobTracker.


Why Take Training?



A Deeper Dive: Some of Hadoop’s more advanced configuration options

Question time

Topics

24

• Submit questions in the Q&A panel

• Watch on-demand video of this webinar and many more at http://cloudera.com

• Follow Ian on Twitter @iwrigley

• Follow Cloudera University @ClouderaU

• Learn more at Strata + Hadoop World: http://tinyurl.com/hadoopworld

• Thank you for attending!

Register now for Cloudera training at http://university.cloudera.com

Use discount code Admin_10 to save 10% on new enrollments in

Administrator Training classes delivered by Cloudera until December 1, 2013*

Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until

December 1, 2013*

* Excludes classes sold or delivered by Cloudera partners

http://cloudera.com/

https://twitter.com/iwrigley

https://twitter.com/clouderau

http://tinyurl.com/hadoopworld

http://university.cloudera.com/

introduction to cloudera's administrator training for apache hadoop

Technology

cloudera training

apache hadoop hdfs

hadoop cluster hardware

hadoop administration

prior written consent

multinode hadoop cluster

apache hadoop planning

related apache hadoop