yarn
DESCRIPTION
A brief introduction to YARN: how and why it came into existence and how it fits together with this thing called Hadoop. Focus given to architecture, availability, resource management and scheduling, migration from MR1 to MR2, job history and logging, interfaces, and applications.TRANSCRIPT
1
YARN Alex Moundalexis
@technmsg
CC BY 2.0 / Richard Bumgardner
Been there, done that.
3
• SoluAons Architect • AKA consultant • government • Infrastructure
Alex @ Cloudera
©2014 Cloudera, Inc. All rights reserved.
4
• product • distribuAon of Hadoop components, Apache licensed • enterprise tooling
• support • training • services (aka consulAng) • community
What Does Cloudera Do?
©2014 Cloudera, Inc. All rights reserved.
5
• Cloudera builds things soTware • most donated to Apache • some closed-‐source
• Cloudera “products” I reference are open source • Apache Licensed • source code is on GitHub
• h[ps://github.com/cloudera
Disclaimer
©2014 Cloudera, Inc. All rights reserved.
6
• deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning • depends heavily on data and workload
• coding • line diagrams don’t count
• algorithms • I suck at math, ask anyone
What This Talk Isn’t About
©2014 Cloudera, Inc. All rights reserved.
7
• Why YARN? • Architecture • Availability • Resources & Scheduling • MR1 to MR2 Gotchas • History • Interfaces • ApplicaAons • StoryAme
So What ARE We Talking About?
©2014 Cloudera, Inc. All rights reserved.
9
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaAons • ConfiguraAon • Workflow
©2014 Cloudera, Inc. All rights reserved.
10
ParAal Ecosystem
Hadoop
external system
RDBMS / DWH
web server
device logs
API access
log collecAon
DB table import
batch processing
machine learning
external system
API access
user
RDBMS / DWH
DB table export
BI tool + JDBC/ODBC
Search
SQL
©2014 Cloudera, Inc. All rights reserved.
11
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpAmized for large streaming access to data • Based on Google File System
• h[p://research.google.com/archive/gfs.html
©2014 Cloudera, Inc. All rights reserved.
12
Lots of Commodity Machines
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
©2014 Cloudera, Inc. All rights reserved.
13
MapReduce (MR)
• Programming paradigm • Batch oriented, not realAme • Works well with distributed compuAng • Lots of Java, but other languages supported • Based on Google’s paper
• h[p://research.google.com/archive/mapreduce.html
©2014 Cloudera, Inc. All rights reserved.
14
MR1 Components
• JobTracker • accepts jobs from client • schedules jobs on parAcular nodes • accepts status data from TaskTrackers
• TaskTracker • one per-‐node • manages tasks • crunches data in-‐place • reports to JobTracker
©2014 Cloudera, Inc. All rights reserved.
15
Under the Covers
©2014 Cloudera, Inc. All rights reserved.
16
You specify map() and reduce() functions. ���
���The framework does the
rest. 60
©2014 Cloudera, Inc. All rights reserved.
WHY DO WE NEED THIS? But wait…
18 18
20
YARN Yet Another Ridiculous Name
21
YARN Yet Another Ridiculous Name
22
YARN Yet Another Resource NegoAator
23
Why YARN / MR2?
©2014 Cloudera, Inc. All rights reserved.
• Scalability • JT kept track of individual tasks and wouldn’t scale
• UAlizaAon • All slots are equal even if the work is not equal
• MulA-‐tenancy • Every framework shouldn’t need to write its own execuAon engine
• All frameworks should share the resources on a cluster
24
An OperaAng System?
©2014 Cloudera, Inc. All rights reserved.
TradiAonal OperaAng System
Storage: File System
ExecuAon/Scheduling: Processes/Kernel
Scheduler
Hadoop
Storage: Hadoop
Distributed File System (HDFS)
ExecuAon/Scheduling: Yet Another Resource NegoJaJor (YARN)
25
MulAple levels of scheduling
©2014 Cloudera, Inc. All rights reserved.
• YARN • Which applicaAon (framework) to give resources to?
• ApplicaAon (Framework -‐ MR etc.)
• Which task within the applicaAon should use these resources?
27
Architecture
©2014 Cloudera, Inc. All rights reserved.
28
Architecture – running mulAple applicaAons
©2014 Cloudera, Inc. All rights reserved.
29
Control Flow: Submit applicaAon
©2014 Cloudera, Inc. All rights reserved.
30
Control Flow: Get applicaAon updates
©2014 Cloudera, Inc. All rights reserved.
31
Control Flow: AM asking for resources
©2014 Cloudera, Inc. All rights reserved.
32
Control Flow: AM using containers
©2014 Cloudera, Inc. All rights reserved.
33
ExecuAon Modes
©2014 Cloudera, Inc. All rights reserved.
• Local mode • Uber mode • Executors
• DefaultContainerExecutor • LinuxContainerExecutor
35
Client Failover
Client Failover
Availability
©2014 Cloudera, Inc. All rights reserved.
RM Elector
RM Elector ZK Store
NM NM NM NM
Client Client Client
36
Availability – SubtleAes
©2014 Cloudera, Inc. All rights reserved.
• Embedded leader elector • No need for a separate daemon like ZKFC
• Implicit fencing using ZKRMStateStore
• AcAve RM claims exclusive access to store through ACL magic
37
Availability – ImplicaAons
©2014 Cloudera, Inc. All rights reserved.
• Previously submi[ed applicaAons conAnue to run • New ApplicaAon Masters are created
• If the AM checkpoints state, can conAnue from where it leT • MR keeps track of completed tasks. They don’t have to be re-‐run
• Future • Work-‐preserving RM Restart / Failover
38
Availability – ImplicaAons
©2014 Cloudera, Inc. All rights reserved.
• Transparent to clients • RM unavailable for a small duraAon • AutomaAcally failover to the AcAve RM • Web UI redirects • REST API redirects (starAng 5.1.0)
40
Resource Model and CapaciAes
©2014 Cloudera, Inc. All rights reserved.
• Resource vectors • e.g. 1024 MB, 2 vcores, … • No more task slots!
• Nodes specify the amount of resources they have • yarn.nodemanager.resource.memory-‐mb • yarn.nodemanager.resource.cpu-‐vcores
• vcores to cores relaAon, not really “virtual”
41
Resources and Scheduling
©2014 Cloudera, Inc. All rights reserved.
• What you request is what you get • No more fixed-‐size slots • Framework/applicaAon requests resources for a task
• MR AM requests resources for map and reduce tasks, these requests can potenAally be for different amounts of resources
42
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
43
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
I want 2 containers with 1024 MB and a 1 core each
44
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
Noted
45
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
I’m sAll here
46
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
I’ll reserve some space on node1 for AM1
47
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
Got anything for me?
48
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
Here’s a security token to let you launch a container on Node 1
49
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
Hey, launch my container with this shell command
50
YARN Scheduling
ResourceManager
ApplicaAon Master 1
ApplicaAon Master 2
Node 1 Node 2 Node 3
Container
51
Resources on a Node 5 GB
Reduce 1536 MB
Map 512 MB
Map 1024 MB
Map 256 MB
Map 256 MB
Reduce 512 MB
MR -‐ AM 1024 MB
52
FairScheduler (FS)
• When space becomes available to run a task on the cluster, which applicaAon do we give it to?
• Find the job that is using the least space.
53
FS: Apps and Queues
• Apps go in “queues” • Share fairly between
queues • Share fairly between
apps within queues
54
FS: Hierarchical Queues
Root Mem Capacity: 12 GB CPU Capacity: 24 cores
MarkeJng Fair Share Mem: 4 GB Fair Share CPU: 8
cores
R&D Fair Share Mem: 4 GB Fair Share CPU: 8
cores
Sales Fair Share Mem: 4 GB Fair Share CPU: 8
cores
Jim’s Team Fair Share Mem: 2 GB Fair Share CPU: 4
cores
Bob’s Team Fair Share Mem: 2 GB Fair Share CPU: 4
cores
55
FS: Fast and Slow Lanes
Root Mem Capacity: 12 GB CPU Capacity: 24 cores
MarkeJng Fair Share Mem: 4 GB Fair Share CPU: 8 cores
Sales Fair Share Mem: 4 GB Fair Share CPU: 8 cores
Fast Lane Max Share Mem: 1 GB Max Share CPU: 1 cores
Slow Lane Fair Share Mem: 3 GB Fair Share CPU: 7 cores
56
• Traverse the tree starAng at the root queue • Offer resources to subqueues in order of how few resources they’re using
FS: Fairness for Hierarchies
57
FS: Hierarchical Queues
Root
MarkeJng
R&D
Sales
Jim’s Team
Bob’s Team
58
FS: MulA-‐resource scheduling
• Scheduling based on mulAple resources • CPU, memory • Future: Disk, Network
• Why mulAple resources? • Be[er uAlizaAon • More fair
59
FS: More features
©2014 Cloudera, Inc. All rights reserved.
• PreempAon • To avoid starvaAon, preempt tasks using more than their fairshare aTer the preempAon Ameout
• Warn applicaAons. ApplicaAon can choose to kill any of its containers
• Locality through delay scheduling • Try to give node-‐local, rack-‐local resources by waiAng for someAme
60
Enforcing resource limits
©2014 Cloudera, Inc. All rights reserved.
• Memory • Monitor process usage and kill if crosses • Disable virtual memory checking • Physical memory checking is being improved
• CPU • Cgroups
61 61 MicrosoT Office EULA. Really.
62
MR1 to MR2 Gotchas
©2014 Cloudera, Inc. All rights reserved.
• AMs can take up all resources • Symptom: Submi[ed jobs don’t run • Fix in progress -‐ to limit number of max applicaAons • Work around – scheduler allocaAons to limit number of applicaAons
• How to run 4 maps and 2 reduces per node? • Don’t try to tune number of tasks per node • Set assignMulAple to false to spread allocaAons
63
MR1 to MR2 Gotchas
©2014 Cloudera, Inc. All rights reserved.
• Comparing MR1 and MR2 benchmarks • TestDFSIO runs best on dedicated CPU/disk, harder to pin • TeraSort changed: less compressible == more network xfer
• Resource AllocaAon vs Resource ConsumpAon • RM allocates resources, heap specified elsewhere • JVM overhead not included • Mind your mapred.[map|reduce].child.java.opts
64
MR1 to MR2 Gotchas
©2014 Cloudera, Inc. All rights reserved.
• Changes in logs, tracing problems harder • MR1: distributed grep on JobId • YARN logs more generic, deal with containers not apps
66
Job History
• Job History Viewing was moved to its own server: Job History Server
• Helps with load on RM (JT equivalent) • Helps separate MR from YARN
67
How History Flows?
• AM • While running, keeps track of all events during execuAon
• On success, before finishing up • Writes the history informaAon to done_intermediate_dir
• The JHS • periodically scans the done_intermediate dir • moves the files to done_dir • starts showing the history
68
History: Important ConfiguraAon ProperAes
• yarn.app.mapreduce.am.staging-dir • Default (CM): /user ← Want this also for security • Default (CDH): /tmp/hadoop-‐yarn/staging • Staging directory for MapReduce applicaAons
• mapreduce.jobhistory.done-dir
• Default: ${yarn.app.mapreduce.am.staging-‐dir}/history/done • Final locaAon in HDFS for history files
• mapreduce.jobhistory.intermediate-done-dir
• Default: ${yarn.app.mapreduce.am.staging-‐dir}/history/done_intermediate • LocaAon in HDFS where AMs dump history files
69
History: Important ConfiguraAon ProperAes
• mapreduce.jobhistory.max-age-ms • Default 604800000 (7 days) • Max age before JHS deletes history
• mapreduce.jobhistory.move.interval-ms
• Default: 180000 (3 min) • Frequency at which JHS scans the intermediate_done dir
70
History: Miscellaneous
• The JHS runs as ‘mapred’, the AM run as the user who submi[ed the job, and the RM runs as ‘yarn’ • The done-‐intermediate dir needs to be writable by the user who submi[ed the job and readable by ‘mapred’ • The RM, AM, and JHS should have idenAcal versions of the jobhistory-‐related properAes so they all “agree”
71
ApplicaAon History Server / Timeline Server
©2014 Cloudera, Inc. All rights reserved.
• Work in progress to capture history and other informaAon for non-‐MR YARN applicaAons
72
YARN Container Logs
©2014 Cloudera, Inc. All rights reserved.
• While applicaAon is running • Local to the NM. yarn.nodemanager.log-‐dirs
• ATer applicaAon finishes • Logs aggregated to HDFS
• yarn.nodemanager.remote-‐app-‐log-‐dir
• Disable aggregaAon? • yarn.log-‐aggregaAon-‐enable
75
InteracAng with a YARN cluster
©2014 Cloudera, Inc. All rights reserved.
• Java API • MR1 – MR2 APIs are compaAble
• REST API • RM, NM, JHS – all have REST APIs that are very useful
• Llama (Long-‐Lived ApplicaAon Master) • Cloudera Impala can reserve, use, and release resource allocaAons without using YARN-‐managed container processes
• CLI • yarn rmadmin, applicaAon, etc.
• Web UI • New and “improved” – need Ame to get used to
77 ©2014 Cloudera, Inc. All rights reserved.
• MR2 • Cloudera Impala • Apache Spark • Others? Custom? • Apache Slider (incubaAng); not producAon-‐ready
• Accumulo • HBase • Storm
YARN ApplicaAons
79 ©2014 Cloudera, Inc. All rights reserved.
• Shipping • Enabled by default on CDH5+ • Included for past two years, not enabled
• Supported • Recommended
The Cloudera View of YARN
80 ©2014 Cloudera, Inc. All rights reserved.
• Benchmarking is harder • different uAlizaAon paradigm • “whole cluster” benchmarks more important, e.g. SWIM
• Tuning sAll largely trial/error • MR1 was the same originally • YARN/MR2 will get there eventually
Growing Pains
81 ©2014 Cloudera, Inc. All rights reserved.
• A few are using in producAon • Many are exploring
• Spark • Impala via Llama
• Most are waiAng
What Are Customers Doing?
82 ©2014 Cloudera, Inc. All rights reserved.
• Mesos • designed to be completely general purpose • more burden on app developer (offer model vs app request)
• YARN • designed with Hadoop in mind • supports Kerberos • more robust/familiar scheduling • rack/machine locality, out of box
• Supportability • all commercial Hadoop vendors support YARN • support for Mesos limited to startup Mesosphere
Why not Mesos?
83 ©2014 Cloudera, Inc. All rights reserved.
Is This the End for MapReduce?
ALL OF YOU Extra special thanks:
85 ©2014 Cloudera, Inc. All rights reserved.
• CC BY 2.0 flik h[ps://flic.kr/p/4RVoUX • CC BY 2.0 Ian Sane h[ps://flic.kr/p/nRyHxd • CC BY-‐NC 2.0 lollyknit h[ps://flic.kr/p/49C1Xi • CC BY-‐ND 2.0 jankunst h[ps://flic.kr/p/deU71s • CC BY-‐SA 2.0 pierrepocs h[ps://flic.kr/p/9mgdMd • CC BY-‐SA 2.0 bekathwia h[ps://flic.kr/p/4FpABU • CC BY-‐NC-‐ND 2.0 digitalnc h[ps://flic.kr/p/dxyTt1 • CC BY-‐NC-‐ND 2.0 arselectronica h[ps://flic.kr/p/7yw8z2 • CC BY-‐NC-‐ND 2.0 yum9me h[ps://flic.kr/p/81hQ49 • CC BY-‐NC-‐SA 2.0 jimnix h[ps://flic.kr/p/gsqpWC • MicrosoT Office EULA (really)
Image Credits
86
Thank You! Alex Moundalexis @technmsg Insert wi[y tagline here.