yarn

1

YARN Alex Moundalexis

@technmsg

CC BY 2.0 / Richard Bumgardner

Been there, done that.

3

• SoluAons Architect •  AKA consultant •  government •  Infrastructure

Alex @ Cloudera

©2014 Cloudera, Inc. All rights reserved.

4

• product •  distribuAon of Hadoop components, Apache licensed •  enterprise tooling

• support • training • services (aka consulAng) • community

What Does Cloudera Do?


5

• Cloudera builds things soTware • most donated to Apache •  some closed-‐source

• Cloudera “products” I reference are open source •  Apache Licensed •  source code is on GitHub

• h[ps://github.com/cloudera

Disclaimer


6

• deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

• sizing & tuning •  depends heavily on data and workload

• coding •  line diagrams don’t count

• algorithms •  I suck at math, ask anyone

What This Talk Isn’t About


7

• Why YARN? • Architecture • Availability • Resources & Scheduling • MR1 to MR2 Gotchas • History •  Interfaces • ApplicaAons • StoryAme

So What ARE We Talking About?


9

Why “Ecosystem?”

•  In the beginning, just Hadoop • HDFS • MapReduce

• Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaAons •  ConfiguraAon • Workflow


10

ParAal Ecosystem

Hadoop

external system

RDBMS / DWH

web server

device logs

API access

log collecAon

DB table import

batch processing

machine learning

external system

API access

user

RDBMS / DWH

DB table export

BI tool + JDBC/ODBC

Search

SQL


11

HDFS

• Distributed, highly fault-‐tolerant filesystem • OpAmized for large streaming access to data • Based on Google File System

•  h[p://research.google.com/archive/gfs.html


12

Lots of Commodity Machines

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]



Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]


13

MapReduce (MR)

• Programming paradigm • Batch oriented, not realAme • Works well with distributed compuAng • Lots of Java, but other languages supported • Based on Google’s paper

•  h[p://research.google.com/archive/mapreduce.html


14

MR1 Components

•  JobTracker •  accepts jobs from client •  schedules jobs on parAcular nodes •  accepts status data from TaskTrackers

• TaskTracker •  one per-‐node • manages tasks •  crunches data in-‐place •  reports to JobTracker


15

Under the Covers


16

You specify map() and reduce() functions. ��

��The framework does the

rest. 60


WHY DO WE NEED THIS? But wait…

20

YARN Yet Another Ridiculous Name

21

YARN Yet Another Ridiculous Name

22

YARN Yet Another Resource NegoAator

23

Why YARN / MR2?


• Scalability • JT kept track of individual tasks and wouldn’t scale

• UAlizaAon • All slots are equal even if the work is not equal

• MulA-‐tenancy • Every framework shouldn’t need to write its own execuAon engine

• All frameworks should share the resources on a cluster

24

An OperaAng System?


TradiAonal OperaAng System

Storage: File System

ExecuAon/Scheduling: Processes/Kernel

Scheduler

Hadoop

Storage: Hadoop

Distributed File System (HDFS)

ExecuAon/Scheduling: Yet Another Resource NegoJaJor (YARN)

25

MulAple levels of scheduling


• YARN • Which applicaAon (framework) to give resources to?

• ApplicaAon (Framework -‐ MR etc.)

• Which task within the applicaAon should use these resources?

27

Architecture


28

Architecture – running mulAple applicaAons


29

Control Flow: Submit applicaAon


30

Control Flow: Get applicaAon updates


31

Control Flow: AM asking for resources


32

Control Flow: AM using containers


33

ExecuAon Modes


• Local mode • Uber mode • Executors

• DefaultContainerExecutor • LinuxContainerExecutor

35

Client Failover

Client Failover

Availability


RM Elector

RM Elector ZK Store

NM NM NM NM

Client Client Client

36

Availability – SubtleAes


• Embedded leader elector • No need for a separate daemon like ZKFC

• Implicit fencing using ZKRMStateStore

• AcAve RM claims exclusive access to store through ACL magic

37

Availability – ImplicaAons


• Previously submi[ed applicaAons conAnue to run • New ApplicaAon Masters are created

• If the AM checkpoints state, can conAnue from where it leT • MR keeps track of completed tasks. They don’t have to be re-‐run

• Future • Work-‐preserving RM Restart / Failover

38

Availability – ImplicaAons


• Transparent to clients • RM unavailable for a small duraAon • AutomaAcally failover to the AcAve RM • Web UI redirects • REST API redirects (starAng 5.1.0)

40

Resource Model and CapaciAes


• Resource vectors • e.g. 1024 MB, 2 vcores, … • No more task slots!

• Nodes specify the amount of resources they have • yarn.nodemanager.resource.memory-‐mb • yarn.nodemanager.resource.cpu-‐vcores

• vcores to cores relaAon, not really “virtual”

41

Resources and Scheduling


• What you request is what you get • No more fixed-‐size slots • Framework/applicaAon requests resources for a task

• MR AM requests resources for map and reduce tasks, these requests can potenAally be for different amounts of resources

42

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2

Node 1 Node 2 Node 3

43

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


I want 2 containers with 1024 MB and a 1 core each

44

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


Noted

45

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


I’m sAll here

46

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


I’ll reserve some space on node1 for AM1

47

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


Got anything for me?

48

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


Here’s a security token to let you launch a container on Node 1

49

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


Hey, launch my container with this shell command

50

YARN Scheduling

ResourceManager

ApplicaAon Master 1

ApplicaAon Master 2


Container

51

Resources on a Node 5 GB

Reduce 1536 MB

Map 512 MB

Map 1024 MB

Map 256 MB

Map 256 MB

Reduce 512 MB

MR -‐ AM 1024 MB

52

FairScheduler (FS)

• When space becomes available to run a task on the cluster, which applicaAon do we give it to?

• Find the job that is using the least space.

53

FS: Apps and Queues

• Apps go in “queues” • Share fairly between

queues • Share fairly between

apps within queues

54

FS: Hierarchical Queues

Root Mem Capacity: 12 GB CPU Capacity: 24 cores

MarkeJng Fair Share Mem: 4 GB Fair Share CPU: 8

cores

R&D Fair Share Mem: 4 GB Fair Share CPU: 8

cores

Sales Fair Share Mem: 4 GB Fair Share CPU: 8

cores

Jim’s Team Fair Share Mem: 2 GB Fair Share CPU: 4

cores

Bob’s Team Fair Share Mem: 2 GB Fair Share CPU: 4

cores

55

FS: Fast and Slow Lanes

Root Mem Capacity: 12 GB CPU Capacity: 24 cores

MarkeJng Fair Share Mem: 4 GB Fair Share CPU: 8 cores

Sales Fair Share Mem: 4 GB Fair Share CPU: 8 cores

Fast Lane Max Share Mem: 1 GB Max Share CPU: 1 cores

Slow Lane Fair Share Mem: 3 GB Fair Share CPU: 7 cores

56

• Traverse the tree starAng at the root queue • Offer resources to subqueues in order of how few resources they’re using

FS: Fairness for Hierarchies

57

FS: Hierarchical Queues

Root

MarkeJng

R&D

Sales

Jim’s Team

Bob’s Team

58

FS: MulA-‐resource scheduling

• Scheduling based on mulAple resources • CPU, memory • Future: Disk, Network

• Why mulAple resources? • Be[er uAlizaAon • More fair

59

FS: More features


• PreempAon • To avoid starvaAon, preempt tasks using more than their fairshare aTer the preempAon Ameout

• Warn applicaAons. ApplicaAon can choose to kill any of its containers

• Locality through delay scheduling • Try to give node-‐local, rack-‐local resources by waiAng for someAme

60

Enforcing resource limits


• Memory • Monitor process usage and kill if crosses • Disable virtual memory checking • Physical memory checking is being improved

• CPU • Cgroups

61 61 MicrosoT Office EULA. Really.

62

MR1 to MR2 Gotchas


• AMs can take up all resources • Symptom: Submi[ed jobs don’t run • Fix in progress -‐ to limit number of max applicaAons • Work around – scheduler allocaAons to limit number of applicaAons

• How to run 4 maps and 2 reduces per node? • Don’t try to tune number of tasks per node • Set assignMulAple to false to spread allocaAons

63

MR1 to MR2 Gotchas


• Comparing MR1 and MR2 benchmarks • TestDFSIO runs best on dedicated CPU/disk, harder to pin • TeraSort changed: less compressible == more network xfer

• Resource AllocaAon vs Resource ConsumpAon • RM allocates resources, heap specified elsewhere • JVM overhead not included • Mind your mapred.[map|reduce].child.java.opts

64

MR1 to MR2 Gotchas


• Changes in logs, tracing problems harder • MR1: distributed grep on JobId • YARN logs more generic, deal with containers not apps

66

Job History

• Job History Viewing was moved to its own server: Job History Server

• Helps with load on RM (JT equivalent) • Helps separate MR from YARN

67

How History Flows?

• AM • While running, keeps track of all events during execuAon

• On success, before finishing up • Writes the history informaAon to done_intermediate_dir

• The JHS • periodically scans the done_intermediate dir • moves the files to done_dir • starts showing the history

68

History: Important ConfiguraAon ProperAes

• yarn.app.mapreduce.am.staging-dir •  Default (CM): /user ← Want this also for security •  Default (CDH): /tmp/hadoop-‐yarn/staging •  Staging directory for MapReduce applicaAons

• mapreduce.jobhistory.done-dir

•  Default: ${yarn.app.mapreduce.am.staging-‐dir}/history/done •  Final locaAon in HDFS for history files

• mapreduce.jobhistory.intermediate-done-dir

•  Default: ${yarn.app.mapreduce.am.staging-‐dir}/history/done_intermediate •  LocaAon in HDFS where AMs dump history files

69

History: Important ConfiguraAon ProperAes

• mapreduce.jobhistory.max-age-ms • Default 604800000 (7 days) • Max age before JHS deletes history

• mapreduce.jobhistory.move.interval-ms

• Default: 180000 (3 min) • Frequency at which JHS scans the intermediate_done dir

70

History: Miscellaneous

• The JHS runs as ‘mapred’, the AM run as the user who submi[ed the job, and the RM runs as ‘yarn’ • The done-‐intermediate dir needs to be writable by the user who submi[ed the job and readable by ‘mapred’ • The RM, AM, and JHS should have idenAcal versions of the jobhistory-‐related properAes so they all “agree”

71

ApplicaAon History Server / Timeline Server


• Work in progress to capture history and other informaAon for non-‐MR YARN applicaAons

72

YARN Container Logs


• While applicaAon is running • Local to the NM. yarn.nodemanager.log-‐dirs

• ATer applicaAon finishes • Logs aggregated to HDFS

• yarn.nodemanager.remote-‐app-‐log-‐dir

• Disable aggregaAon? • yarn.log-‐aggregaAon-‐enable

75

InteracAng with a YARN cluster


• Java API • MR1 – MR2 APIs are compaAble

• REST API • RM, NM, JHS – all have REST APIs that are very useful

• Llama (Long-‐Lived ApplicaAon Master) • Cloudera Impala can reserve, use, and release resource allocaAons without using YARN-‐managed container processes

• CLI • yarn rmadmin, applicaAon, etc.

• Web UI • New and “improved” – need Ame to get used to


• Shipping •  Enabled by default on CDH5+ •  Included for past two years, not enabled

• Supported • Recommended

The Cloudera View of YARN


• Benchmarking is harder •  different uAlizaAon paradigm •  “whole cluster” benchmarks more important, e.g. SWIM

• Tuning sAll largely trial/error • MR1 was the same originally •  YARN/MR2 will get there eventually

Growing Pains


• A few are using in producAon • Many are exploring

•  Spark •  Impala via Llama

• Most are waiAng

What Are Customers Doing?


• Mesos •  designed to be completely general purpose • more burden on app developer (offer model vs app request)

• YARN •  designed with Hadoop in mind •  supports Kerberos • more robust/familiar scheduling •  rack/machine locality, out of box

• Supportability •  all commercial Hadoop vendors support YARN •  support for Mesos limited to startup Mesosphere

Why not Mesos?


Is This the End for MapReduce?

ALL OF YOU Extra special thanks:


• CC BY 2.0 flik h[ps://flic.kr/p/4RVoUX • CC BY 2.0 Ian Sane h[ps://flic.kr/p/nRyHxd • CC BY-‐NC 2.0 lollyknit h[ps://flic.kr/p/49C1Xi • CC BY-‐ND 2.0 jankunst h[ps://flic.kr/p/deU71s • CC BY-‐SA 2.0 pierrepocs h[ps://flic.kr/p/9mgdMd • CC BY-‐SA 2.0 bekathwia h[ps://flic.kr/p/4FpABU • CC BY-‐NC-‐ND 2.0 digitalnc h[ps://flic.kr/p/dxyTt1 • CC BY-‐NC-‐ND 2.0 arselectronica h[ps://flic.kr/p/7yw8z2 • CC BY-‐NC-‐ND 2.0 yum9me h[ps://flic.kr/p/81hQ49 • CC BY-‐NC-‐SA 2.0 jimnix h[ps://flic.kr/p/gsqpWC • MicrosoT Office EULA (really)

Image Credits

86

Thank You! Alex Moundalexis @technmsg Insert wi[y tagline here.

yarn

Technology

justhadoop hdfs mapreduce

yahoo hadoop cluster

map andreduce functions