introduction to cloudera's administrator training for apache hadoop

24
1 An Introduction to Cloudera’s Administrator Training for Apache Hadoop Ian Wrigley Sr. Curriculum Manager [email protected]

Upload: cloudera-inc

Post on 06-May-2015

6.662 views

Category:

Technology


3 download

DESCRIPTION

Learn who is best suited to attend the full Administrator Training, what prior knowledge you should have, and what topics the course covers. Cloudera Senior Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during Admin Training and how they will help you move your Hadoop deployment from strategy to production and prepare for the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

TRANSCRIPT

Page 1: Introduction to Cloudera's Administrator Training for Apache Hadoop

1

An Introduction to Cloudera’s Administrator Training for Apache Hadoop

Ian WrigleySr. Curriculum [email protected]

Page 2: Introduction to Cloudera's Administrator Training for Apache Hadoop

2© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Cloudera Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s advanced configuration options

Question time

Topics

Page 3: Introduction to Cloudera's Administrator Training for Apache Hadoop

3

1 Broadest Range of CoursesDeveloper, Admin, Analyst, HBase, Data Science

2

3

Most Experienced InstructorsOver 15,000 students trained since 2009

5 Widest Geographic CoverageMost classes offered: 50 cities worldwide plus online

6 Most Relevant Platform & CommunityCDH deployed more than all other distributions combined

7 Depth of Training MaterialHands-on exercises and VMs support live instruction

Leader in CertificationOver 5,000 accredited Cloudera professionals

4 State of the Art CurriculumClasses updated regularly as Hadoop evolves 8 Ongoing Learning

Video tutorials and e-learning complement training

Why Cloudera Training?

Page 4: Introduction to Cloudera's Administrator Training for Apache Hadoop

4

Data AnalystTraining

Implement massively distributed, columnar storage at scaleEnable random, real-time read/write access to all data

HBaseTraining

Configure, install, and monitor clusters for optimal performanceImplement security measures and multi-user functionality

Vertically integrate basic analytics into data managementTransform and manipulate data to drive high-value utilization

Enterprise Training

Use Cloudera Manager to speed deployment and scale the clusterLearn which tools and techniques improve cluster performance

Learning Path: System Administrators

AdministratorTraining

Page 5: Introduction to Cloudera's Administrator Training for Apache Hadoop

5© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s advanced configuration options

Question time

Topics

Page 6: Introduction to Cloudera's Administrator Training for Apache Hadoop

6© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

During the Administrator course, you learn:

The core technologies of Hadoop

How to populate HDFS from external sources

How to plan your Hadoop cluster hardware and software

How to deploy a Hadoop cluster

What issues to consider when installing Pig, Hive, and Impala

What issues to consider when deploying Hadoop clients

How Cloudera Manager can simplify Hadoop administration

How to configure HDFS for high availability

What issues to consider when implementing Hadoop security

Administrator Course Objectives

Page 7: Introduction to Cloudera's Administrator Training for Apache Hadoop

7© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

How to schedule jobs on the cluster

How to maintain your cluster

How to monitor, troubleshoot, and optimize the cluster

Administrator Course Objectives (cont’d)

Page 8: Introduction to Cloudera's Administrator Training for Apache Hadoop

8© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

The course features many Hands-On Exercises, including:–Deploying Hadoop in pseudo-distributed mode–Deploying a complete, multi-node Hadoop cluster– Importing data into HDFS using Sqoop and Flume– Installing Hive and Impala–Using Hue to control user access–Configuring HDFS High Availability–Configuring the FairScheduler– Troubleshooting problems on the cluster–… and more

Hands-On Exercises

Page 9: Introduction to Cloudera's Administrator Training for Apache Hadoop

9© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Course Chapters

Introduction

Planning Your Hadoop Cluster Hadoop Installation and Initial Configuration Installing and Configuring Hive, Impala, and Pig Hadoop Clients Cloudera Manager Advanced Cluster Configuration Hadoop Security

Introduction to Apache Hadoop

Planning, Installing, andConfiguring a Hadoop Cluster

Course Introduction

The Case for Apache Hadoop HDFS Getting Data Into HDFS MapReduce

Managing and Scheduling Jobs Cluster Maintenance Cluster Monitoring and Troubleshooting Conclusion Kerberos Configuration Configuring HDFS Federation

Cluster Operations and Maintenance

Course Conclusion and Appendices

Page 10: Introduction to Cloudera's Administrator Training for Apache Hadoop

10© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s advanced configuration options

Question time

Topics

Page 11: Introduction to Cloudera's Administrator Training for Apache Hadoop

11© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

A single NameNode is a single point of failure

Two ways a NameNode can result in HDFS downtime–Unexpected NameNode crash (rare)–Planned maintenance of NameNode (more common)

HDFS High Availability (HA) eliminates this SPOF–Available in CDH4 (or related Apache Hadoop 0.23.x, and 2.x)

HDFS High Availability Overview

Page 12: Introduction to Cloudera's Administrator Training for Apache Hadoop

12© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HDFS High Availability uses a pair of NameNodes–One Active and one Standby–Clients only contact the Active NameNode–DataNodes heartbeat in to both NameNodes–Active NameNode writes its metadata to a quorum of JournalNodes– Standby NameNode reads from the JournalNodes to remain in sync with

the Active NameNode

HDFS High Availability Architecture

Page 13: Introduction to Cloudera's Administrator Training for Apache Hadoop

13© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Active NameNode writes edits to the JournalNodes– Software to do this is the Quorum Journal Manager (QJM)

–Built in to the NameNode–Waits for a success acknowledgment from the majority of JournalNodes

–Majority commit means a single crashed or lagging JournalNode will not impact NameNode latency

–Uses the Paxos algorithm to ensure reliability even if edits are being written as a JournalNode fails

Note that there is no Secondary NameNode when implementing HDFS High Availability– The Standby NameNode periodically performs checkpointing

HDFS High Availability Architecture (cont’d)

Page 14: Introduction to Cloudera's Administrator Training for Apache Hadoop

14© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Only one NameNode must be active at any given time– The other is in standby mode

The standby maintains a copy of the active NameNode’s state– So it can take over when the active NameNode goes down

Two types of failover–Manual (detected and initiated by a user)–Automatic (detected and initiated by HDFS itself)

Failover

Page 15: Introduction to Cloudera's Administrator Training for Apache Hadoop

15© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Automatic failover is based on Apache ZooKeeper–A coordination service system also used by HBase–An open source Apache project –One of the components in CDH

A daemon called the ZooKeeper Failover Controller (ZKFC) runs on each NameNode machine

ZooKeeper needs a quorum of nodes– Typical installations use three or five nodes– Low resource usage

–Can install alongside existing master daemons

Automatic Failover

Page 16: Introduction to Cloudera's Administrator Training for Apache Hadoop

16© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HDFS HA With Automatic Failover – Deployment

Page 17: Introduction to Cloudera's Administrator Training for Apache Hadoop

17© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s more advanced configuration options

Question time

Topics

Page 18: Introduction to Cloudera's Administrator Training for Apache Hadoop

18© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

hdfs-site.xml

dfs.namenode.handler.count The number of threads the NameNode uses to handle RPC requests from DataNodes. Default: 10. Recommended: ln(number of cluster nodes) * 20. Symptoms of this being set too low: ‘connection refused’ messages in DataNode logs as they try to transmit block reports to the NameNode. Used by the NameNode.

dfs.datanode.failed.volumes.tolerated

The number of volumes allowed to fail before the DataNode takes itself offline, ultimately resulting in all of its blocks being re-replicated. Default: 0, but often increased on machines with several disks. Used by DataNodes.

Page 19: Introduction to Cloudera's Administrator Training for Apache Hadoop

19© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

core-site.xml

fs.trash.interval When a file is deleted, it is placed in a .Trash directory in the user’s home directory, rather than being immediately deleted. It is purged from HDFS after the number of minutes specified. Default: 0 (disabled). Recommended: 1440 (one day). Used by clients and the NameNode.

Page 20: Introduction to Cloudera's Administrator Training for Apache Hadoop

20© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

mapred-site.xml

mapred.job.tracker.handler.count

Number of threads used by the JobTracker to respond to heartbeats from the TaskTrackers. Default: 10. Recommendation: ln(number of cluster nodes) * 20. Used by the JobTracker.

mapred.reduce.parallel.copies

Number of TaskTrackers a Reducer can connect to in parallel to transfer its data. Default: 5. Recommendation: ln(number of cluster nodes) * 4 with a floor of 10. Used by TaskTrackers.

tasktracker.http.threads The number of HTTP threads in the TaskTracker which the Reducers use to retrieve data. Default: 40. Recommendation: 80. Used by TaskTrackers.

Page 21: Introduction to Cloudera's Administrator Training for Apache Hadoop

21© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

mapred-site.xml (cont’d)

mapred.reduce.slowstart.completed.maps

The percentage of Map tasks which must be completed before the JobTracker will schedule Reducers on the cluster. Default: 0.05 (5 percent). Recommendation: 0.8 (80 percent). Used by the JobTracker.

Page 22: Introduction to Cloudera's Administrator Training for Apache Hadoop

22© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Why Take Training?

Administrator Course Contents

A Deeper Dive: An overview of HDFS High Availability

A Deeper Dive: Some of Hadoop’s more advanced configuration options

Question time

Topics

Page 23: Introduction to Cloudera's Administrator Training for Apache Hadoop

23

Page 24: Introduction to Cloudera's Administrator Training for Apache Hadoop

24

• Submit questions in the Q&A panel

• Watch on-demand video of this webinar and many more at http://cloudera.com

• Follow Ian on Twitter @iwrigley

• Follow Cloudera University @ClouderaU

• Learn more at Strata + Hadoop World: http://tinyurl.com/hadoopworld

• Thank you for attending!

Register now for Cloudera training at http://university.cloudera.com

Use discount code Admin_10 to save 10% on new enrollments in

Administrator Training classes delivered by Cloudera until December 1, 2013*

Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until

December 1, 2013*

* Excludes classes sold or delivered by Cloudera partners