bdm37: hadoop in production – the war stories by nikolaï grigoriev, principal software engineer,...

52
Hadoop – The War Stories Running Hadoop in large enterprise environment Nikolai Grigoriev ([email protected], @nikgrig) Principal Software Engineer, http://sociablelabs.com

Upload: big-data-montreal

Post on 15-Aug-2015

313 views

Category:

Software


2 download

TRANSCRIPT

Page 1: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop – The War Stories

Running Hadoop in large enterprise environment

Nikolai Grigoriev ([email protected], @nikgrig)Principal Software Engineer, http://sociablelabs.com

Page 2: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Agenda

● Why Hadoop?

● Planning Hadoop deployment

● Hadoop and read hardware

● Understanding the software stack

● Tuning HDFS, MapReduce and HBase

● Troubleshooting examples

● Testing your applications

Disclaimer: this presentation is based on the combined work experience from more thanone company and represents the author's personal point of view on the problems discussed in it.

Page 3: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Why Hadoop (and why have we decided to

use it)?● Need to store hundreds of Tb of info

● Need to process it in parallel

● Desire to have both storage and processing horizontally scalable

● Having and open-source platform with commercial support

Page 4: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Our application

Application servers(many :) )

Log processors

“ETL process”

Page 5: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Our application in numbers

● Thousands of user sessions per second

● Average session log size: ~30Kb, 3-7 events per log

● Target retention period – at least ~90 days

● Redundancy and HA everywhere

● Pluggable “ETL” modules for additional data processing

Page 6: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Main problem

Team had no practical knowledge of Hadoop, HDFS and HBase…

...and there was nobody at the company to help

Page 7: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

But we did not realize...

It was not THE ONLY problem wewere about to face!

Page 8: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

First fight – capacity planning

● Tons of articles are written about Hadoop capacity planning

● Architects may be spending months making educated guesses

● Capacity planning is really about finding the amount of $$$ to be spent on your cluster for target workload– If we had infinite amount of $$$ why would we

bother at all? ;)

Page 9: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop performance limiting factors

Page 10: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

It is all about the balance

● Your Hadoop cluster and your apps use all these resources at different time

● Over-provisioning of one of the resources usually lead to the shortage of another one - wasted $$$

Page 11: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

What can we say about an app?

● It is going to store X Tb of data– Amount of storage (not to forget the RF!)

– Accommodate for growth and failures

● It is going to ingest the data at Y Mb/s– Your network speed and number of nodes

● Latency– More HDDs and faster HDDs

– More RAM

– More nodes

Page 12: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

We are big enterprise...Geeky Hadoop developer

Old School Senior IT Guy

- many “commodity+” hosts- good but inexpensive networking- more regular HDDs- lots of RAM- I also love cloud…- my recent OS- my software configuration- simple network

SANs, RAIDs, SCSI, racks,Blades, redundancy, Cisco, HP, fiber optics,4-year-old rock-solid RHEL, SNMPmonitoring…

what? I am the Boss...

Page 13: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop cluster vs. old school application servers

● Mostly identical “commodity+” machines– Probably with the exception of NN, JT

● Better to have more simpler machines than fewer monster ones

● No RAID, just JBOD!

● Ethernet depending on the storage density, bonded 1Gbit may be enough

● Hadoop achieves with software what used to be achievable with [expensive!] hardware

Page 14: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

But still, your application is the driver, not the IT guy!

From Cloudera website – Hadoop machine configuration according to workload

Page 15: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Your job is:

● Educate your IT, get them on your side or at least earn their trust

● Try to build a capacity planning spreadsheet based on what you do know

● Apply common sense to guess what you do not know

● ...and plan a decent buffer

● Set reasonable performance targets for your application

Page 16: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Fight #2 – OMG, our application is slow!!!

● Main part of our application was the MR job merging the logs

● We have committed to deliver X logs/sec on a target test cluster with sample workload

● We were delivering like ~30% of that● ...weeks before release :)● ...and we have ran out of other excuses :(● It was clearly our software and/or

configuration

Page 17: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Wait a second – we have support contract from Hadoop vendor!

● I mean no disrespect to the vendors!

● But they do not know your application

● And they do not know your hardware

● And they do not know exactly your OS

● And they do not know your network equipment

● They can help you with some tuning, they can help you with bugs and crashes – but they won't be able (or sometimes simply qualified) to do your job!

Page 18: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

We are on our own :(

● We have realized that our testing methods were not adequate to Hadoop-based ETL process

● Testing the product end-to-end was too difficult, tracking changes was impossible

● Turn-around was too long, we could not try something quickly and revert back

● Observing and monitoring the live system with dummy incoming data was not productive enough

Page 19: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Key to successful testing

● Representative data set

● Ability to repeat the same operation as many times as needed with quick turnaround

● Each engineer had to be able to run the tests and try something

● Establishing the key metrics you monitor and try to improve

● Methodological approach – analyze, change, test, be ready to roll back

Page 20: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Our “reference runner”

Large sampledataset

“Reset” tool Runner tool Statistics

Recreates HBase tables(predefined regions),cleans HDFS etc

Injects the test data, prepares the environment,launches the MR job like realapplication, allows to quicklyrebuild and redeploy the partof the application

Any improvements sincelast run?

Manager

Page 21: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning results

● In two weeks we had the job that worked about 3 times faster

● Tuning was done everywhere – from OS to Hadoop/HBase and our code

● We were confident that the software was ready to go to production

● During following 2 years later we realized how bad was our design and how it should have been done ;)

Page 22: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop MapReduce DOs● Think processes, not threads

● Reusable objects, lower GC overhead● Snappy data compression is generally good

● Reasonable use of counters provides important information

● For frequently running jobs, distributed cache helps a lot

● Minimize disk I/O (spills etc), RAM is cheap

● Avoid unnecessary serialization/deserialization

Page 23: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop MapReduce DONTs

● Small files in HDFS

● Multithreaded programming inside mapper/reducer

● Fat tasks using too much heap

● Any I/O in M-R other than HDFS, ZK or HBase

● Over-complicated code (simple things work better)

Page 24: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Fight #3 – Going Production!

● Remember the slide about engineer vs. IT God preferences ;)

● Production hardware was slightly different from the test cluster

● Cluster has been deployed by the people who did not know Hadoop

● First attempt to run the software resulted in major failure and the cluster was finally handed over to the developers for fixing ;)

Page 25: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Production hardware

● HP blade servers, 32 core, 128GB of RAM

● Emulex dual-port 10G Ethernet NICs

● 14 HDDs per machine

● OEL 6.3

● 10G switch modules

● Company hosting center with dedicated networking and operations staff

Page 26: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hardware

BIOS/Firmware(s)

BIOS/Firmware settings

OS (Linux)

Java (JVM)

Hadoop services

Your application(s)

Step back – 10,000 ft look at Hadoop stack

Hardware

BIOS/Firmware(s)

BIOS/Firmware settings

OS (Linux)

Java (JVM)

Hadoop services

Your application(s)

Networ

k

- Hadoop is not just a bunch of Java apps- It is a data and application platform- It can run well, just run, barely run and cause constant headache – depending on how much love does it receive :)

Page 27: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop stack (continued)● In Hadoop a small problem, even sometimes on

a single node can be a major pain

● Isolating and finding that small problem may be difficult

● Symptoms are often obvious only at high level (e.g. application)

● Complex hardware (like HP) adds more potential problems

Page 28: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Example of one of the problems we had initially

● Jobs were failing because of timeouts

● Numerous I/O errors observed in job and HDFS logs

● This simple test was failing:$ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192$ time hdfs dfs -copyFromLocal test8Gb.bin /Zzz..zzz...zzz...5min...zzz…real 4m10.002suser 0m15.130ssys 0m4.094s

● IT was clueless but did not really bother● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!)● 10Gb network transfers to HDFS at ~160Mb/s

Page 29: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Role of HDFS in Hadoop

● In Hadoop HDFS is the key layer that provides the distributed filesystem services for other components

● Health of HDFS directly (and drastically) affects the health of other components

HDFS

Map-Reduce Data

HBase

Page 30: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

So, clearly HDFS was the problem

● But what was the problem with HDFS??

● How exactly HDFS writing works?

Page 31: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Chasing it down● Due to node-to-node streaming it was difficult to

understand who was responsible

● Theory of “one bad node in pipeline” was ruled out as results were consistently bad with the cluster of 14 nodes

● Idea (isolating the problem is good):

$ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin /real 0m42.002s$ time hdfs -Ddfs.replication=2 dfs -copyFromLocal test8Gb.bin /real 2m53.184s$ time hdfs -Ddfs.replication=3 dfs -copyFromLocal test8Gb.bin /real 3m41.072s

● 8192/42=195 Mb/s – hmmm….

Page 32: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Discoveries

● To make even longer story short...– Bug in “cubic” TCP congestion protocol in Linux kernel

– NIC firmware was too old

– Kernel driver for Emulex 10G NICs was too old

– Only one out of 8 NIC RX queues was enabled on some hosts

– A number of network settings were not appropriate for 10G network

– “irqbalance” process (due to kernel bug) was locking NIC RX queues by “losing” NIC IRQ handlers

– ...

Page 33: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

More discoveries

– Nodes were set up multi-homed, even HDFS at that time did not support that

– Misconfigured DNS and reverse DNS

● On disk I/O side– Bad filesystem parameters

– Read-ahead settings were wrong

– Disk controller firmware was old

Page 34: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

HDFS “litmus” test - TestDFSIO13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write

13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013

13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16

13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0

13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779

13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578

13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971

13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876

13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read

13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013

13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16

13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0

13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676

13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125

13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208

13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683

13/03/13 16:31:31 INFO fs.TestDFSIO:

Page 35: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Fight #4 – tuning Hadoop

● Why do people tune things (IT was not interested ;) )?

● With your own expensive hardware you want the maximum IOPS and CPU power for $$$ you have paid

● Not to mention that you simply want your apps to run faster

● Tuning is an endless process but 80/20 rule works perfectly

Page 36: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Even before you have something to tune….

● Pick reasonably good hardware but do not go high-end

● Same for network equipment

● Hadoop scales well and the redundancy is achieved by software

● More nodes is almost always better than going for extra node power and/or storage space

● Simpler systems are easier to tune, maintain and troubleshoot

● Different machines for master nodes

Page 37: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning the hardware and BIOS

● Updating BIOS and firmwares to recent versions

● Disabling dynamic CPU frequency scaling

● Tuning memory speed, power profile

● Disk controller, tune disk cache

Page 38: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

OS Tuning● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve

blocks 0%) and mount options (noatime,nodiratime, barriers etc)

● I/O scheduler depending on your disks and tasks

● Read-ahead settings

● Disable swap!

● irqbalance for big machines

● Tune other parameters (number of FDs, sockets)

● Install major troubleshooting tools (iostat, iotop, tcpdump, strace…) on every one

Page 39: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Network tuning● Test your TCP performance with iperf, ttcp or any other

tools you like

● Know your NICs well, install right firmware and kernel modules

● Tune your TCP and IP parameters (work harder if you have expensive 10G network)

● If your NIC supports TCP offload and it works – use it

● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty

● Learn ethtool and see what it can do for you

● Basic IP networking set-up (DNS etc) has to be 100% perfect

Page 40: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

JVM tuning● Hadoop allows you to set JVM options for all

processes

● Your Data Node, Name Node and HBase Region Servers are going to work hard and you need to help them to deal with your workload

● If your MR code is well designed you will most likely NOT need to tune JVM for MR tasks

● Your main enemy will be GC – until you become at lease allies, if not friends :)

Page 41: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning Hadoop services

● NameNode deals with many connections and needs ~150 bytes per HDFS block

● NameNode and DataNode are highly concurrent, latter needs many threads

● Use HDFS short-circuit reads if appropriate

● ZooKeeper needs to handle enough connections

● HBase uses LOTS of heap

● Reuse JVMs for MR jobs if appropriate

Page 42: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning MapReduce tasks (that means tuning for your code and data)

● If you run different MR jobs, consider tuning parameters for each of them, not once and for all of them

● Configure job scheduler to enforce the SLAs

● Estimate the resource needed for each job

● Plan how are you going to run your jobs

Page 43: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning your own code

● Test and profile your complex MR code outside of Hadoop (your savings will scale too!)

● Check for GC overhead

● Use reusable objects

● Avoid using expensive formats like JSON and XML

● Anything you waste is multiplied by the number of rows and the number of tasks!

● Evaluate the need for intermediate data compression

Page 44: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Tuning HBase

● That requires separate presentation

● You will need to fight hard for reducing GC pauses and overhead

● Pre-splitting regions may be a good idea to better balance the load

● Understand HBase compactions and deal with major compactions your way

Page 45: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Set up your monitoring (and alarming)

● You cannot improve what you cannot see!

● Monitor OS, Hadoop and your app metrics

● Ganglia, Graphite, LogStash, even Cloudera Manager are your friends

● Set the baseline, track your changes, observe the outcome

Page 46: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Fight #5 - Operations

● Real hand-over to the Operations people actually never happened

● In case of any problems either it was ignored or escalation to engineers was taking about 1 minute

● Neither NOC nor Operations staff wanted to acquire enough knowledge of Hadoop and the apps

● Monitoring was nearly non-existing

● Same for appropriate alarms

Page 47: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

If you are serious...

● Send your Ops for Hadoop training (or buy them books and have them read those!)

● Have them automate everything

● Ops have to understand your applications, not just the platform they are running on

● Your Ops need to be decent Linux admins

● ...and it would be great if they are also OK programmers (scripting, Java…)

● Of course, the motivation is the key

Page 48: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Plan and train for disaster

● Train your Ops how to help your system to survive till Monday morning

● Decide what sort of loss you will tolerate (BigData is not always so precious)

● Design your system for resilience, async processing, queuing etc

Page 49: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Fight #6 - evolution

● Sooner or later you will need to increase your capacity– Unless your business is stagnating

● Technically, you will either– Run out of storage space

– Start hitting the wall on IOPS or CPU and fail to respect your SLAs (even if only internal ones)

– Won't be able to deploy new applications

Page 50: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Understand your application - again

● Even if your apps runs fine you need to monitor the performance factors

● Build spreadsheets reflecting your current numbers● Plan for the business growth

● Translate this into the number of additional nodes and networking equipment

● Especially important if your hardware purchase cycle takes months

Page 51: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Conclusions● Not all companies are ready for BigData – often

because of conservative people in key positions

● Traditional IT/Ops/NOC organizations are often unable to support these platforms

● Engineers have to be given more power to control how the things they build are ran (DevOps)

● Hadoop is a complex platform and has to be taken seriously for serious applications

● If you really depend on Hadoop you do need to build in-house expertise

Page 52: BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Questions?

Thanks for listening!

Nikolai [email protected]