hadoop at spotify - meetupfiles.meetup.com/5139282/shug 1 - hadoop at spotify.pdf · spotify is a...

Post on 12-Jan-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Januari 2012

Hadoop at Spotify

Friday, January 25, 13

IntroductionUse casesSome statsInfrastructure overviewMap/Reduce in different languagesSchedulingScaling HadoopLessons learned

Agenda

Friday, January 25, 13

3+ years experience with HadoopWorking with a team of 12 engineers to keep our data flowingResponsible for the data architecture

Don’t hesitate to ask questions during this talk!

Wouter de Bie - Team lead for Analytics & Data Infrastructure

Who am I and what am I doing here?!

Introduction

Friday, January 25, 13

Too much data to fit into a RDBMS (or too costly)Hadoop provides good value for moneyRuns on “commodity” hardwareScales pretty well

Millions of daily active users == LOTS OF DATA

Why do we use Hadoop?

Introduction

Friday, January 25, 13

ReportingBusiness analysisOperational analysisIn-client features

Spotify is a Data-Driven company, so data is used practically everywhere..

Where do we use Hadoop?

Use cases

Friday, January 25, 13

Reporting to labels, licensors, partners and advertisers from day 1In 2008 a RDBMS would have worked for a while, although labels wanted very granular data, so we went with Hadoop instead

This is where it all started..

Reporting

Friday, January 25, 13

Analyzing growth, user behavior, sign-up funnels, etcCompany KPIsA/B testingNPS analysisSegmentation analysis

We have so much data, so let’s us it!

Business Analytics

Friday, January 25, 13

An analysis I can share

Friday, January 25, 13

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$

Friday, January 25, 13

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$

Friday, January 25, 13

Root cause analysisLatency analysisBetter capacity planning (servers, people, bandwidth)

Things get interesting when you combine business metrics with operational data

Operational Metrics

Friday, January 25, 13

Recommendations (better then external parties, because of the amount of data)RadioTop lists

Leverage data to create better product features

Product features

Friday, January 25, 13

200 GB of compressed data from users per day100 GB of data from services per day60+ GB of data generated in Hadoop each day190 node Hadoop cluster (24 core CPUs, 32 Gb RAM, 20 TB disk space)4 PB of storage capacity

When I mean “A LOT”, I mean:

Some geeky numbers

Friday, January 25, 13

Our datainfrastructure

Friday, January 25, 13

Spotify’s data infrastructureBackend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive

Friday, January 25, 13

Python with Hadoop Streaming Pros: fast development, many Spotify libraries available Cons: slower then Java, no access to Hadoop API

Java Pros: fast, access to Hadoop API Cons: verbose language, not many Spotify libs available

PIG Pros: very small scripts, faster then streaming Cons: yet another language to learn, not many Spotify libs available

Hive Pros: SQL like syntax (easy for non-programmers) and relational data model Cons: more moving parts (not well suited for a whole pipe line)

We use multiple languages to run Map/Reduce jobs

Map/Reduce languages

Friday, January 25, 13

Java, PIG, Hive and Python for writing map reduce 1.0Apache Kafka for log collection (in the process of replacing a batched transfer system)Apache Cassandra (no HBase ;))PostgreSQLSqoop for database import/exportAvro as storage format

Technology we use in combination with Hadoop:

Hadoop feels lonely without friends..

Some technical details

Friday, January 25, 13

Nothing suitable out there..https://github.com/spotify/luigiWritten in pythonGeneric scheduler and dependency system that supports Python M/R, Pig and Hive

We wrote and open-sourced our own scheduler: Luigi

Map/Reduce is simple, but a single map/reduce job is limited. You need to chain jobs.

Scheduling

Friday, January 25, 13

Scaling Hadoop

Friday, January 25, 13

Started with a small (scrap metal) cluster of 37 serversMoved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scaleBuilt an in-house cluster of 60 nodes because of EMR costsCapacity planning every 6 months, grown to 190 nodes todayPut in place data-retention policy and data archive

Our journey:

Scaling Hadoop

Friday, January 25, 13

Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS“Commodity hardware” doesn’t mean cheap hardwareHadoop isn’t a silver bulletHadoop is a complex system that needs love and careYou will have to extend Hadoop (and eco-system components) to tailor it to your needsHadoop is fun!

4+ years of Hadoop at Spotify taught us:

Lessons learned

Friday, January 25, 13

January 2013

Any questions?wouter@spotify.com / @xinit

Shameless plug: Spotify is hiring! Talk to me or someone with a Spotify badge if you’re interested!

Thank you!

Friday, January 25, 13

top related