how klout migrated from cdh3 to cdh4 …and survived to tell about it
DESCRIPTION
A short talk on Klout's journey from cdh3 to cdh4TRANSCRIPT
![Page 1: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/1.jpg)
How Klout migrated from CDH3 to CDH4
…and survived to tell about it
Large Scale Production Engineering MeetupSeptember 19, 2013
Ian KallenLead Engineer, Klout
© 2013 Klout
![Page 2: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/2.jpg)
About Klout
● recognizing & rewarding online influence
● major social network activity signals
● Facebook, Twitter, Google+, LinkedIn, 4sq
● billions data points consumed & processed
● pipelines update scores & topics
● hive & oozie driven jobs & workflow
© 2013 Klout
![Page 3: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/3.jpg)
By The Numbers
● 2 TB data intake, 200 TB processed daily
● jobs clusters x 2 (dev/staging + production)
● hbase x 6 (dev/staging + production x 5)
● hbase: 350M req/day, 17K req/sec peak
● jobs, hbase & zookeeper
total =~ 350 hosts
© 2013 Klout
![Page 4: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/4.jpg)
● pipelines instable, slow on cdh3 (v0.20.2)
● HBase performance predictability
● old hive version limited pipeline developers
● cdh3 EOL’d 6/2013
● cdh4 (v2.0.x) supports NN H/A, impala
● more shiney things
Motivations
© 2013 Klout
![Page 5: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/5.jpg)
The Environment
● data center hosted
● I/O subsystems are under our control
● network latencies are under our control
● FAQ: Why not AWS?
● saved millions of dollars last year
● that's a lot of beer money.
● elasticity need is low, but...
© 2013 Klout
![Page 6: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/6.jpg)
● this is super easy on AWS
● bring up a replacement cluster
● double-write or migrate data to replacement
● tear down old cluster
● have a celebratory drink
● if you have any beer
money left
Cloud Envy
© 2013 Klout
![Page 7: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/7.jpg)
● nagios, pager duty for monitoring
● monit for process watchdogging
● jmx+, graphite, gdash+ for metrics
● ubuntu boot images for provisioning
● puppet for configuration management
● … no Cloudera Manager
Ops Infra
© 2013 Klout
![Page 8: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/8.jpg)
● no replacement infra to migrate to○ so upgrades must be done in place
● Cloudera's prefers Cloudera Manager○ so we were on our own to devise a plan
● Cloudera helped vet our plan (thanks!)
● confidence building on dev/staging clusters
● lots of rehearsals on VM's, bug reports
Making Plans
© 2013 Klout
![Page 9: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/9.jpg)
● detailed checklists, kanban board
● small test clusters, the dev clusters
● planned SLA miss for prod cluster upgrade
● lined up phone consult availability
w/Cloudera○ we needed it about 10 hours into prod jobs cluster
● nobody died
Execution
© 2013 Klout
![Page 10: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/10.jpg)
● jobs run faster (speculative execution?)
● pipelines are faster
● metrics exposed are improved
● HBase clusters lose block locality in transit ○ fixable
● no animals were harmed
in this production
Aftermath
© 2013 Klout
![Page 11: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/11.jpg)
● we had many post-mortems along the way
● lots of engineering time & attention
● sweating the details paid off
● mostly because we’re “power users” of hive
● lessons learned:○ re-align clusters
○ improve use of vendor tools where possible
■ e.g. Cloudera Manager
Retrospect
© 2013 Klout
![Page 12: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/12.jpg)
● dev/staging + prod clusters x 2
● better use of HDFS paths & job scheduling
● consolidating zookeeper ensembles
● implementing NameNode H/A
● evaluating Cloudera Manager
● evaluating Impala (maybe)
Onward
© 2013 Klout
![Page 13: How Klout migrated from CDH3 to CDH4 …and survived to tell about it](https://reader036.vdocument.in/reader036/viewer/2022081401/5596696e1a28ab72128b4748/html5/thumbnails/13.jpg)
Klout is hiring awesome people passionate about
optimizing for innovation & stability, crunching big data &
robust systems
If you are a great Hadoop DevOps Engineer
Join Us!
Thanks!
Gratuitous Recruiting Slide
© 2013 Klout