how klout migrated from cdh3 to cdh4 …and survived to tell about it

How Klout migrated from CDH3 to CDH4

…and survived to tell about it

Large Scale Production Engineering MeetupSeptember 19, 2013

Ian KallenLead Engineer, Klout

About Klout

● recognizing & rewarding online influence

● major social network activity signals

● Facebook, Twitter, Google+, LinkedIn, 4sq

● billions data points consumed & processed

● pipelines update scores & topics

● hive & oozie driven jobs & workflow

By The Numbers

● 2 TB data intake, 200 TB processed daily

● jobs clusters x 2 (dev/staging + production)

● hbase x 6 (dev/staging + production x 5)

● hbase: 350M req/day, 17K req/sec peak

● jobs, hbase & zookeeper

total =~ 350 hosts

● pipelines instable, slow on cdh3 (v0.20.2)

● HBase performance predictability

● old hive version limited pipeline developers

● cdh3 EOL’d 6/2013

● cdh4 (v2.0.x) supports NN H/A, impala

● more shiney things

Motivations

The Environment

● data center hosted

● I/O subsystems are under our control

● network latencies are under our control

● FAQ: Why not AWS?

● saved millions of dollars last year

● that's a lot of beer money.

● elasticity need is low, but...

● this is super easy on AWS

● bring up a replacement cluster

● double-write or migrate data to replacement

● tear down old cluster

● have a celebratory drink

● if you have any beer

money left

Cloud Envy

● nagios, pager duty for monitoring

● monit for process watchdogging

● jmx+, graphite, gdash+ for metrics

● ubuntu boot images for provisioning

● puppet for configuration management

● … no Cloudera Manager

Ops Infra

● no replacement infra to migrate to○ so upgrades must be done in place

● Cloudera's prefers Cloudera Manager○ so we were on our own to devise a plan

● Cloudera helped vet our plan (thanks!)

● confidence building on dev/staging clusters

● lots of rehearsals on VM's, bug reports

Making Plans

● detailed checklists, kanban board

● small test clusters, the dev clusters

● planned SLA miss for prod cluster upgrade

● lined up phone consult availability

w/Cloudera○ we needed it about 10 hours into prod jobs cluster

● nobody died

Execution

● jobs run faster (speculative execution?)

● pipelines are faster

● metrics exposed are improved

● HBase clusters lose block locality in transit ○ fixable

● no animals were harmed

in this production

Aftermath

● we had many post-mortems along the way

● lots of engineering time & attention

● sweating the details paid off

● mostly because we’re “power users” of hive

● lessons learned:○ re-align clusters

○ improve use of vendor tools where possible

■ e.g. Cloudera Manager

Retrospect

● dev/staging + prod clusters x 2

● better use of HDFS paths & job scheduling

● consolidating zookeeper ensembles

● implementing NameNode H/A

● evaluating Cloudera Manager

● evaluating Impala (maybe)

Onward

Klout is hiring awesome people passionate about

optimizing for innovation & stability, crunching big data &

robust systems

If you are a great Hadoop DevOps Engineer

Join Us!

ian@klout.com

Thanks!

Gratuitous Recruiting Slide

how klout migrated from cdh3 to cdh4 …and survived to tell about it

Technology

les primaires 2eme tour : twitter klout facebook

garth holsinger (klout) istrategy london 2012

cdh3 single node installation guide dell server...

klout score: measuring influence across multiple social...

klout perks one-pager

ouroburos as a service - klout

klout - pluggedin nyc011210

influencer metrics are getting a klout

hadoop installation cdh4

do you have klout presentation

fruji & klout

reverse engineering klout score - sean-cooke.com · what is...

gene...

cdh4 high availability guide b1

hacking your personal brand with klout

cdh4 pseudo installation - centos

improve klout

cdh4 quick start

icsoc · klout score score summary achievements influence...

klout: the evolution of influence