machine learning @ spotify - madison big data meetup

Machine Learning &Big Data @

Andy Sloane@a1k0n

http://a1k0n.net

Madison Big Data MeetupJan 27, 2015

http://a1k0n.net/

Big data?60M Monthly Active Users (MAU)50M tracks in our catalog...But many are identical copies from differentreleases (e.g. US and UK releases of the samealbum)...and only 4M unique songs have been listened to>500 times

Big data?Raw material: application logs, delivered via ApacheKafkaWake Me Up by Avicii has been played 330M times, by~6M different users"EndSong": 500GB / day...But aggregated per-user play counts for a wholeyear fit in ~60GB ("medium data")

Hadoop @ Spotify900 nodes (all in London datacenter)34 TB RAM total~16000 typical concurrent tasks (mappers/reducers)2GB RAM per mapper/reducer slot

What do we need ML for?RecommendationsRelated ArtistsRadio

Recommendations

The Discover page4M tracks x 60M active users, rebuilt daily

The Discover page

Okay, but how do we come up with recommendations?Collaborative filtering!

Collaborative filtering

Collaborative filteringGreat, but how does that actually work?

Each time a user plays something, add it to a matrixCompute similarity, somehow, between items based onwho played what

Collaborative filteringSo compute some distance between every pair of rowsand columns

That's just O( ) = O( ) operations... O_OWe need a better way...

60M 2

2 1.8 × 1015

(BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM:https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)

I've tried it but don't have results to report here yet :(

Collaborative filteringLatent factor models

Instead, we use a "small" representation for each user &item: -dimensional vectorsf

(here, )f = 2

and approximate the big matrix with it.

Why vectors?

Very compact representation of musical style or user'stasteOnly like 40-200 elements (2 shown above forillustration)

Why vectors?Dot product between items = similarity between itemsDot product between vectors = good/badrecommendation

user x item

2 x 4 = 8

-4 x 0 = 0

2 x -2 = -4

-1 x 5 = + -5

= -1

Recommendations via dot products

Another example of tracks in twodimensions

Implicit Matrix FactorizationHu, Koren, Volinsky - Collaborative Filtering for Implicit

Feedback Datasets

Tries to predict whether user listens to item :u i

P = a ( )

⎛

⎝

⎜⎜⎜⎜

0001

0100

0110

1001

⎞

⎠

⎟⎟⎟⎟X

⎛

⎝⎜⎜⎜ Y T

⎞

⎠⎟⎟⎟

is all item vectors, is all user vectorsY X

"implicit" because users don't tell us what they like, weonly observe what they do/don't listen to

Goal: make close to 1 for things each user haslistened to, 0 for everything else.

Implicit Matrix FactorizationÞxu yi

— user 's vector — item 's vector — 1 if user played item , 0 otherwise — "confidence", ad-hoc weight based on number of

times user played item ; e.g., — regularization penalty to avoid overfitting

xu uyi ipui u icui

u i 1 + α Þ ��

λ

Minimize:

+ λ( || | + || | )∑u,i

cui( + )pui xTu yi

2

∑u

xu |2 ∑i

yi |2

Solution: alternate solving for all users :

and all items :

Alternating Least Squaresxu

= ( Y + ( + I)Y + λIxu Y T Y T Cu )+1Y T CupuÞyi

= ( X + ( + I)X + λIyi X T X T C i )+1X T C ipÞi

= x matrix, sum of outer products of all items same, except only items the user played

= weighted -dimensional sum of items theuser played

YY T f f( + I)YY T Cu

Y T Cupu f

Alternating Least SquaresKey point: each iteration is linear in size of input, even

though we are solving for all users x all items, and needsonly memory to solvef 2

No learning rates, just a few tunable parameters ( , , )f λ αAll you do is add stuff up, solve an x matrix problem,

and repeat!f f

We use dimensional vectors forrecommendations

f = 40

Matrix/vector math using numpy in Python, breeze inscala

Alternating Least SquaresAdding lots of stuff up

Problem: any user (60M) can play any item (4M)thus we may need to add any user's vector to anyitem's vector

If we put user vectors in memory, it takes a lot of RAM!Worst case: 60M users * 40 dimensions * sizeof(float) =9.6GB of user vectors...too big to fit in a mapper slot on our cluster

Solution: Split the data into a matrix

Most recent run made a 14 x 112 grid

Adding lots of stuff up

Input is a bunch of tuples is the same modulo K for all users is the same modulo L for all items

e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...

One map shard(user, item, count)

useritem

Add up vectors from every data point

Then flip users ↔ items and repeat!

Adding stuff up(user, item, count)

def mapper(self, input): # Luigi-style python job user, item, count = parse(input) conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count # add up user vectors from previous iteration term1 = conf * self.user_vectors[user] term2 = np.outer(user_vectors[user], user_vectors[user]) * (conf - 1) yield item, np.array([term1, term2])

def reducer(self, item, terms): term1, term2 = sum(terms) item_vector = np.solve( self.YTY + term2 + self.l2penalty * np.identity(self.dim), term1) yield item, item_vector

Alternating Least SquaresImplemented in Java Map-Reduce framework whichruns other models, tooAfter about 20 iterations, we convergeEach iteration takes about 20 minutes, so about 7-8hours totalRecomputed from scratch weeklyUser vectors recomputed daily, keeping items fixedSo we have vectors, now what?

60M users x 4M recommendable items

Finding Recommendations

For each user, how do we find the best items giventheir vector?Brute force is O(60M x 4M x 40) = O(9 peta-operations)!Instead, use an approximation based on localitysensitive hashing (LSH)

Approximate Nearest Neighbors /

Locality-Sensitive HashingAnnoy - github.com/spotify/annoy

Annoy - github.com/spotify/annoyPre-built read-only database of item vectorsInternally, recursively splits random hyperplanes

Nearby points likely on the same side of random splitBuilds several random trees (a forest) for betterapproximation

Given an -dimensional query vector, finds similar itemsin databaseIndex loads via mmap, so all processes on the samemachine share RAMQueries are very, very fast, but approximatePython implementation available, Java forthcoming

f

https://github.com/spotify/annoy

Generating recommendationsAnnoy index for all items is only 1.2GBI have one on my laptop... Live demo!Could serve up nearest neighbors at load time, but weprecompute Discover on Hadoop

Generating recommendations in parallel

Send annoy index in distributed cache, load it via mmapin map-reduce processReducer loads vectors + user stats, looks up ANN,generates recommendations.

Related Artists

Related ArtistsGreat for music discoveryEssential for finding believable reasons for latentfactor-based recommendations

When generating recommendations, run through a listof related artists to find potential reasons

Similar items use cosine distanceCosine is similar to dot product; just add anormalization stepHelps "factor out" popularity from similarity

Related ArtistsHow we build it

Similar to user recommendations, but with moremodels, not necessarily collaborative filtering based

Implicit Matrix Factorization (shown previously)"Vector-Exp", similar model but probabilistic innature, trained with gradient descentGoogle word2vec on playlistsEcho Nest "cultural similarity" — based on scrapingweb pages about music!

Query ANNs to generate candidatesScore candidates from all models, combine and rankPre-build table of 20 nearest artists to each artist

ML-wise, exactly the same as Related Artists!

RadioFor each track, generate candidates with ANN fromeach modelScore w/ all models, rank with ensembleStore top 250 nearest neighbors in a database(Cassandra)User plays radio → load 250 tracks and shuffleThumbs up → load more tracks from the thumbed-upsongThumbs down → remove that song / re-weight tracks

Upcoming workDeep learning based item similarity

http://benanne.github.io/2014/08/05/spotify-cnns.html

http://benanne.github.io/2014/08/05/spotify-cnns.html

Upcoming workAudio fingerprint basedcontent deduplication

~1500 Echo Nest Musical Fingerprints per track based matching to accelerate all-pairs

similarityFast connected components using Hash-to-Minalgorithm - mapreduce steps

Min-Hash

O(log d)http://arxiv.org/pdf/1203.5387.pdf

http://en.wikipedia.org/wiki/MinHash

http://arxiv.org/pdf/1203.5387.pdf

Thanks!I can be reached here:

Andy SloaneEmail: Twitter:

Special thanks to , whose slides Iplagiarized mercilessly

[email protected]@a1k0n

http://a1k0n.net

Erik Bernhardsson

https://twitter.com/a1k0n

mailto:[email protected]

http://erikbern.com/

http://a1k0n.net/

machine learning @ spotify - madison big data meetup

Data & Analytics