the evolution of big data at spotify

40
The Evolution of Big Data at Spotify Josh Baer ([email protected])

Upload: josh-baer

Post on 07-Aug-2015

468 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: The Evolution of Big Data at Spotify

The Evolution of Big Data at Spotify

Josh Baer ([email protected])

Page 2: The Evolution of Big Data at Spotify

Who Am I?‣ Technical Product Owner at

Spotify, responsible for Hadoop

@l_phant

Page 3: The Evolution of Big Data at Spotify

Overview

• Creating Music Charts in three parts:

• Playing Music

• Collecting Data

• Processing Data

• The Future

Page 4: The Evolution of Big Data at Spotify

Building Music Charts

Page 5: The Evolution of Big Data at Spotify
Page 6: The Evolution of Big Data at Spotify

Building Music Charts

Play Music Collect Data Process

Page 7: The Evolution of Big Data at Spotify

Playing Music

Page 8: The Evolution of Big Data at Spotify

What is Spotify?

• Music Streaming Service

• Browse and Discover Millions of Songs, Artists and Albums

• By the end of 2014

• 60 Million Monthly Users

• 15 Million Paid Subscribers

Page 9: The Evolution of Big Data at Spotify

What is Spotify?

• Data Infrastructure

• 1300 Hadoop Nodes

• 42 PB Storage

• 30 TB data ingested via Kafka/day

• 400 TB generated by Hadoop/day

Page 10: The Evolution of Big Data at Spotify

Powered by Data

• Running App

• Matches music to running tempo

• Personalized running playlists in multiple tempos for millions of active users

http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery

Page 11: The Evolution of Big Data at Spotify

Powered by Data

• Now Page

• Shows, podcasts and playlists based on day-parts

• Personalized layout so you always have the right music for the right moment

Page 12: The Evolution of Big Data at Spotify
Page 13: The Evolution of Big Data at Spotify

Building Music Charts10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”

10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”

10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

• Raw data is complicated

• Often dirty

• Evolving structure

• Duplication all over

• Getting data to a central processing point is HARD

Page 14: The Evolution of Big Data at Spotify

Collecting Data

Page 15: The Evolution of Big Data at Spotify

“It’s simple, we just throw the data into

Hadoop”A naive data engineer

Page 16: The Evolution of Big Data at Spotify

LogArchiver

• Original method to transport logs from APs to HDFS

• Lasted from 2009 - 2013

• Relies on rsynch/scp to move files around

• Regularly scheduled via cron

Page 17: The Evolution of Big Data at Spotify
Page 18: The Evolution of Big Data at Spotify

LogArchiver Fails

• Worked well with small number of APs

• Issues with scale

• Manual Processes of adding new hosts

• Frequent dying of hosts or network issues caused massive congestion

• Manual process of overrides

Page 19: The Evolution of Big Data at Spotify

Apache Kafka to the rescue!

• A publish-subscribe messaging system open sourced by LinkedIn in 2011

• High level overview:

• Topic: Feeds of messages

• Producer: A message publisher

• Consumer : A subscriber of topics

Page 20: The Evolution of Big Data at Spotify

Apache Kafka to the rescue!

• Log -> HDFS latency reduced from hours to seconds!

• Benefits:

• Community supported

• Division of responsibilities

• Allowed for enhanced streaming use-cases

Page 21: The Evolution of Big Data at Spotify
Page 22: The Evolution of Big Data at Spotify

Processing Data

Page 23: The Evolution of Big Data at Spotify

Workflow Management Fail!

0  *  *  *  *        spotify-­‐core            hadoop  jar  merge_hourly.jar  15  *  *  *  *      spotify-­‐core            hadoop  jar  aggregate_song_plays.jar  30  *  *  *  *      spotify-­‐analytics  hadoop  jar  merge_artist_song.jar  *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar  *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar  */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar  

Page 24: The Evolution of Big Data at Spotify

Handles the ‘plumbing’ for Hadoop jobs

https://github.com/spotify/luigi

Luigi - Python Workflow Manager

Easy to get started, no xml like Oozie

Page 25: The Evolution of Big Data at Spotify

Hadoop Availability

• In 2013:

• Hadoop expanded to 200 nodes

• It was business critical

• It was not very reliable :-(

• Created a ‘squad’ with two missions:

• Migrate to a new distribution with Yarn

• Make Hadoop reliable

Page 26: The Evolution of Big Data at Spotify

How did we do?

Had

oop

Uptim

e

90%

92%

94%

96%

98%

100%

Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015

Hadoop ownerless Dedicated squad launches

Upgrade instability

Continually improving

Page 27: The Evolution of Big Data at Spotify

Going from Python to Crunch

• Most of our jobs were Hadoop (python) streaming

• Lots of failures, slow performance

• Had to find a better way

Page 28: The Evolution of Big Data at Spotify

Moving from Python to Crunch

• Investigated several frameworks*

• Selected Crunch:

• Real types - compile time error detection, better testability

• Higher level API - let the framework optimize for you

• Better performance #JVM_FTW

*thewit.ch/scalding_crunchy_pig

Page 29: The Evolution of Big Data at Spotify
Page 30: The Evolution of Big Data at Spotify

Play Music Collect Data Process

Data driven features that allows for new ways to

play and discover music

Lower latency and enhanced reliability for

passing data from Access Points to HDFS via Kafka

Increased Hadoop reliability, Luigi scheduling

and better performance with Crunch

Improving Charts!

Page 31: The Evolution of Big Data at Spotify

The Future

Page 32: The Evolution of Big Data at Spotify

Gro

wth

%

0

500

1000

1500

2000

2500

3000

3500

2012 2013 2014 2015

Hadoop Usage Spotify Users

Growth of Hadoop vs. Spotify Users

Page 33: The Evolution of Big Data at Spotify

Explosive Growth

• Increased Spotify Users

• More users listening to more music -> more data -> longer running jobs

• Increased Use Cases

• Beyond simple analytics into Machine Learning, advanced processing

• Increased Engineers

• In 2014, growth of data and machine learning engineers grew rapidly

Page 34: The Evolution of Big Data at Spotify

Scaling Machines: Easy Scaling People: Hard

Page 35: The Evolution of Big Data at Spotify

User Feedback: Automate it!

Page 36: The Evolution of Big Data at Spotify

Inviso

Developed by Netflix: https://github.com/Netflix/inviso

Page 37: The Evolution of Big Data at Spotify

Hadoop Report Card

• Contains Statistics • Guidelines and Best

Practices • Sent Quarterly

Page 38: The Evolution of Big Data at Spotify

Real Time Use Cases

• Expanding our use of Storm for:

• Targeting Ads based on genres

• Visualizing Data

• Quicker recommendations

• More information:

• https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/

Page 39: The Evolution of Big Data at Spotify

Two takeaways

• Getting data into Hadoop is half the challenge. Think early and often about scale.

• Increasing infrastructure reliability and performance leads to expanded use. This adds challenges but it’s a good problem to have.

Page 40: The Evolution of Big Data at Spotify

Join The Band!

Engineers needed in NYC, Stockholm http://spotify.com/jobs