spotify: from 1 to 100 hadoop developers
DESCRIPTION
How Spotify scaled their Hadoop cluster and the people working on it from 1 to over 100 develop, and 1 node to now over 690 nodes pushing them to have the largest Hadoop cluster in Europe.TRANSCRIPT
From 1 to 100 developers Scaling for developer productivity at Spotify
@dawhiting
HUG UK @ Strata 11/11/2013
How do I scale? How many nodes?How much data?How many records?
2
How do I scale my development?
How many developers?How many teams?How many Hadoop jobs?How much code?
3
Data Infrastructure - July 2013
4
A brief history of Hadoop development at Spotify2008 - Spotify launches in Sweden
2009 - First Hadoop cluster for royalties, 2 developers
2010 - Up to 37 nodes, BI team formed, 3 devs/3 analysts
2011 - to Elastic MapReduce
2012 - Back to own cluster, 60 -> 190 nodes, Infrastructure/Insights/Tools team split
2013 - 6 teams just for data infrastructure, ~100 developers using Hadoop cluster.
Issues
What could possibly go wrong?•Contention for resources•Repetition of code, repetition of data•Poor code quality / technical debt•Disorganised HDFS•Data cataloguing
5
6
Contention for resources
Priority and isolation•What is important?Hadoop scheduler•Capacity scheduler•Queue isolationYARN•Resource allocation
Don’t Repeat YourselfRefactor data, not just code•Make popular data available pre-joined
•Analyse code to find jobs with the same dependencies
Work at a higher level•MapReduce out, (S)Crunch in•Allow substitution of operations for cached data
7
Code Quality &Technical DebtStable platform•Python -> JVMAbolish custom infrastructure•Off-the-shelf is often good enough
•Eg. Sqoop, Kafka, ...Testing•Make testing easier than running
•Enforced testing
8
HDFSRetention policy•Automatic deletion of old intermediate data
•Opt-out, not opt-inEstablish convention•Can you correctly guess the path to the data you need?
Enforce structure•Path literals are a code smell
9
Data Library
Core datasets•Identify•Catalogue•Document•MonitorData library as code library•Easy to use•Synced with release cycles
10
You can have it easier than us
Act now•Big Data technical debt is worse than normal technical debt•Rewriting 10 jobs is easier than rewriting 300
Plan to decentralise•At some point it won’t be enough to trust your developers•You won’t be able to review every job forever
Make it simpler to do things the right way•Example: build tools
11
Want to join the band?We’re hiring for Stockholm and NYC
Check out http://www.spotify.com/jobs for more information.