how apache drives music recommendations at spotify
TRANSCRIPT
How Apache Drives Music Recommendations At Spotify
Josh Baer ([email protected])Note: The view expressed is my own and does not necessarily represent that of Spotify
Who Am I?• Technical Product Owner at
Spotify • Working with batch and fast
processing infrastructure
@l_phant
Music Discovery in the 90s
What is Spotify?• Music Streaming Service • Launched in 2008 • Free and Premium Tiers • Available in 58 Countries
75+ Million Active Users
30+ Million Songs
1+ Billion Plays/Day
Music Recommendations with Apache
How do we recommend a personalized playlist of
new music to 75+ million users?
10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
It begins with a log
Apache Kafka at Spotify•340 Kafka-related nodes
•30 TB/day from logs
How do we store TBs of new data every data?
Apache Hadoop at Spotify• 1700 Nodes
• 60 PB of Data
• 70 TB of Memory
• Over 1 Million jobs run in Q3, 2015
Proc
essi
ng G
row
th
150%
250%
350%
450%
550%
Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015 Q3-2015
Hadoop at Spotify
Processing Toolbox• Apache Crunch
• Scalding
• Apache Hive
• Apache Spark
• Apache Storm
• Hadoop Streaming
• Apache Pig
Storage Formats• Apache Avro
• Apache Parquet
How do we personalize the playlists?
Collaborative FilteringJustin Bieber Drake Avicii Major Lazer
Anna Listened Listened
Gustav Listened Listened Listened
Mary Listened Listened Listened Listened
Michael Listened ListenedSuggest
How do we serve new playlists to all our users
every week?
Apache Cassandra at Spotify• Number of Clusters: 113
• Number of Machines: 1155
• Largest Cluster: 60 Nodes
Driven By Data
Driven By Apache
Thank YOU for your contributions to
Apache products!
One Last Thing…
Spotify Luigi•Workflow Manager •Over 150 contributors •Used by 10s, possibly 100s of companies
Maybe… Apache Luigi?Sponsors/mentors/contributors wanted!
Think this stuff is interesting?We have a great time building it!
spotify.com/jobs
Better Spotify ML Presentations• Algorithmic Music Recommendations at Spotify (Chris Johnson)
• Interactive Recommender Systems with Netflix and Spotify (Chris Johnson)
• Music recommendations @ MLConf 2014 (Erik Bernhardsson)
• Machine learning @ Spotify (Andy Sloane)
• Recommending music on Spotify with deep learning (Sander Dieleman)
• Scala Data Pipelines @ Spotify (Neville Li)
• Spotify's Music Recommendations Lambda Architecture (Esh Kumar and Emily Samuels)