machine learning @ spotify - madison big data meetup

39

Upload: andy-sloane

Post on 21-Apr-2017

10.709 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Machine learning @ Spotify - Madison Big Data Meetup

Machine Learning &Big Data @

Andy Sloane@a1k0n

http://a1k0n.net

Madison Big Data MeetupJan 27, 2015

Page 2: Machine learning @ Spotify - Madison Big Data Meetup

Big data?60M Monthly Active Users (MAU)50M tracks in our catalog...But many are identical copies from differentreleases (e.g. US and UK releases of the samealbum)...and only 4M unique songs have been listened to>500 times

Page 3: Machine learning @ Spotify - Madison Big Data Meetup

Big data?Raw material: application logs, delivered via ApacheKafkaWake Me Up by Avicii has been played 330M times, by~6M different users"EndSong": 500GB / day...But aggregated per-user play counts for a wholeyear fit in ~60GB ("medium data")

Page 4: Machine learning @ Spotify - Madison Big Data Meetup

Hadoop @ Spotify900 nodes (all in London datacenter)34 TB RAM total~16000 typical concurrent tasks (mappers/reducers)2GB RAM per mapper/reducer slot

Page 5: Machine learning @ Spotify - Madison Big Data Meetup

What do we need ML for?RecommendationsRelated ArtistsRadio

Page 6: Machine learning @ Spotify - Madison Big Data Meetup

Recommendations

Page 7: Machine learning @ Spotify - Madison Big Data Meetup

The Discover page4M tracks x 60M active users, rebuilt daily

Page 8: Machine learning @ Spotify - Madison Big Data Meetup

The Discover page

Okay, but how do we come up with recommendations?Collaborative filtering!

Page 9: Machine learning @ Spotify - Madison Big Data Meetup

Collaborative filtering

Page 10: Machine learning @ Spotify - Madison Big Data Meetup

Collaborative filteringGreat, but how does that actually work?

Each time a user plays something, add it to a matrixCompute similarity, somehow, between items based onwho played what

Page 11: Machine learning @ Spotify - Madison Big Data Meetup

Collaborative filteringSo compute some distance between every pair of rowsand columns

That's just O( ) = O( ) operations... O_OWe need a better way...

60M 2

2 1.8 × 1015

(BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM:https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)

I've tried it but don't have results to report here yet :(

Page 12: Machine learning @ Spotify - Madison Big Data Meetup

Collaborative filteringLatent factor models

Instead, we use a "small" representation for each user &item: -dimensional vectorsf

(here, )f = 2

and approximate the big matrix with it.

Page 13: Machine learning @ Spotify - Madison Big Data Meetup

Why vectors?

Very compact representation of musical style or user'stasteOnly like 40-200 elements (2 shown above forillustration)

Page 14: Machine learning @ Spotify - Madison Big Data Meetup

Why vectors?Dot product between items = similarity between itemsDot product between vectors = good/badrecommendation

user x item

2 x 4 = 8

-4 x 0 = 0

2 x -2 = -4

-1 x 5 = + -5

= -1

Page 15: Machine learning @ Spotify - Madison Big Data Meetup

Recommendations via dot products

Page 16: Machine learning @ Spotify - Madison Big Data Meetup

Another example of tracks in twodimensions

Page 17: Machine learning @ Spotify - Madison Big Data Meetup

Implicit Matrix FactorizationHu, Koren, Volinsky - Collaborative Filtering for Implicit

Feedback Datasets

Tries to predict whether user listens to item :u i

P = a ( )

⎜⎜⎜⎜

0001

0100

0110

1001

⎟⎟⎟⎟X

⎝⎜⎜⎜ Y T

⎠⎟⎟⎟

is all item vectors, is all user vectorsY X

"implicit" because users don't tell us what they like, weonly observe what they do/don't listen to

Page 18: Machine learning @ Spotify - Madison Big Data Meetup

Goal: make close to 1 for things each user haslistened to, 0 for everything else.

Implicit Matrix FactorizationÞxu yi

— user 's vector — item 's vector — 1 if user played item , 0 otherwise — "confidence", ad-hoc weight based on number of

times user played item ; e.g., — regularization penalty to avoid overfitting

xu uyi ipui u icui

u i 1 + α Þ �������

λ

Minimize:

+ λ( || | + || | )∑u,i

cui( + )pui xTu yi

2

∑u

xu |2 ∑i

yi |2

Page 19: Machine learning @ Spotify - Madison Big Data Meetup

Solution: alternate solving for all users :

and all items :

Alternating Least Squaresxu

= ( Y + ( + I)Y + λIxu Y T Y T Cu )+1Y T CupuÞyi

= ( X + ( + I)X + λIyi X T X T C i )+1X T C ipÞi

= x matrix, sum of outer products of all items same, except only items the user played

= weighted -dimensional sum of items theuser played

YY T f f( + I)YY T Cu

Y T Cupu f

Page 20: Machine learning @ Spotify - Madison Big Data Meetup

Alternating Least SquaresKey point: each iteration is linear in size of input, even

though we are solving for all users x all items, and needsonly memory to solvef 2

No learning rates, just a few tunable parameters ( , , )f λ αAll you do is add stuff up, solve an x matrix problem,

and repeat!f f

We use dimensional vectors forrecommendations

f = 40

Matrix/vector math using numpy in Python, breeze inscala

Page 21: Machine learning @ Spotify - Madison Big Data Meetup

Alternating Least SquaresAdding lots of stuff up

Problem: any user (60M) can play any item (4M)thus we may need to add any user's vector to anyitem's vector

If we put user vectors in memory, it takes a lot of RAM!Worst case: 60M users * 40 dimensions * sizeof(float) =9.6GB of user vectors...too big to fit in a mapper slot on our cluster

Page 22: Machine learning @ Spotify - Madison Big Data Meetup

Solution: Split the data into a matrix

Most recent run made a 14 x 112 grid

Adding lots of stuff up

Page 23: Machine learning @ Spotify - Madison Big Data Meetup

Input is a bunch of tuples is the same modulo K for all users is the same modulo L for all items

e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...

One map shard(user, item, count)

useritem

Page 24: Machine learning @ Spotify - Madison Big Data Meetup

Add up vectors from every data point

Then flip users ↔ items and repeat!

Adding stuff up(user, item, count)

def mapper(self, input): # Luigi-style python job user, item, count = parse(input) conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count # add up user vectors from previous iteration term1 = conf * self.user_vectors[user] term2 = np.outer(user_vectors[user], user_vectors[user]) * (conf - 1) yield item, np.array([term1, term2])

def reducer(self, item, terms): term1, term2 = sum(terms) item_vector = np.solve( self.YTY + term2 + self.l2penalty * np.identity(self.dim), term1) yield item, item_vector

Page 25: Machine learning @ Spotify - Madison Big Data Meetup

Alternating Least SquaresImplemented in Java Map-Reduce framework whichruns other models, tooAfter about 20 iterations, we convergeEach iteration takes about 20 minutes, so about 7-8hours totalRecomputed from scratch weeklyUser vectors recomputed daily, keeping items fixedSo we have vectors, now what?

Page 26: Machine learning @ Spotify - Madison Big Data Meetup

60M users x 4M recommendable items

Finding Recommendations

For each user, how do we find the best items giventheir vector?Brute force is O(60M x 4M x 40) = O(9 peta-operations)!Instead, use an approximation based on localitysensitive hashing (LSH)

Page 27: Machine learning @ Spotify - Madison Big Data Meetup

Approximate Nearest Neighbors /

Locality-Sensitive HashingAnnoy - github.com/spotify/annoy

Page 28: Machine learning @ Spotify - Madison Big Data Meetup

Annoy - github.com/spotify/annoyPre-built read-only database of item vectorsInternally, recursively splits random hyperplanes

Nearby points likely on the same side of random splitBuilds several random trees (a forest) for betterapproximation

Given an -dimensional query vector, finds similar itemsin databaseIndex loads via mmap, so all processes on the samemachine share RAMQueries are very, very fast, but approximatePython implementation available, Java forthcoming

f

Page 29: Machine learning @ Spotify - Madison Big Data Meetup

Generating recommendationsAnnoy index for all items is only 1.2GBI have one on my laptop... Live demo!Could serve up nearest neighbors at load time, but weprecompute Discover on Hadoop

Page 30: Machine learning @ Spotify - Madison Big Data Meetup

Generating recommendations in parallel

Send annoy index in distributed cache, load it via mmapin map-reduce processReducer loads vectors + user stats, looks up ANN,generates recommendations.

Page 31: Machine learning @ Spotify - Madison Big Data Meetup

Related Artists

Page 32: Machine learning @ Spotify - Madison Big Data Meetup

Related ArtistsGreat for music discoveryEssential for finding believable reasons for latentfactor-based recommendations

When generating recommendations, run through a listof related artists to find potential reasons

Page 33: Machine learning @ Spotify - Madison Big Data Meetup

Similar items use cosine distanceCosine is similar to dot product; just add anormalization stepHelps "factor out" popularity from similarity

Page 34: Machine learning @ Spotify - Madison Big Data Meetup

Related ArtistsHow we build it

Similar to user recommendations, but with moremodels, not necessarily collaborative filtering based

Implicit Matrix Factorization (shown previously)"Vector-Exp", similar model but probabilistic innature, trained with gradient descentGoogle word2vec on playlistsEcho Nest "cultural similarity" — based on scrapingweb pages about music!

Query ANNs to generate candidatesScore candidates from all models, combine and rankPre-build table of 20 nearest artists to each artist

Page 35: Machine learning @ Spotify - Madison Big Data Meetup

Radio

Page 36: Machine learning @ Spotify - Madison Big Data Meetup

ML-wise, exactly the same as Related Artists!

RadioFor each track, generate candidates with ANN fromeach modelScore w/ all models, rank with ensembleStore top 250 nearest neighbors in a database(Cassandra)User plays radio → load 250 tracks and shuffleThumbs up → load more tracks from the thumbed-upsongThumbs down → remove that song / re-weight tracks

Page 37: Machine learning @ Spotify - Madison Big Data Meetup

Upcoming workDeep learning based item similarity

http://benanne.github.io/2014/08/05/spotify-cnns.html

Page 38: Machine learning @ Spotify - Madison Big Data Meetup

Upcoming workAudio fingerprint basedcontent deduplication

~1500 Echo Nest Musical Fingerprints per track based matching to accelerate all-pairs

similarityFast connected components using Hash-to-Minalgorithm - mapreduce steps

Min-Hash

O(log d)http://arxiv.org/pdf/1203.5387.pdf

Page 39: Machine learning @ Spotify - Madison Big Data Meetup

Thanks!I can be reached here:

Andy SloaneEmail: Twitter:

Special thanks to , whose slides Iplagiarized mercilessly

[email protected]@a1k0n

http://a1k0n.net

Erik Bernhardsson