scala data pipelines @ spotify

51
Scala Data Pipelines @ Spotify Neville Li @sinisa_lyh

Upload: neville-li

Post on 21-Aug-2015

1.250 views

Category:

Software


8 download

TRANSCRIPT

Page 1: Scala Data Pipelines @ Spotify

Scala Data Pipelines @ SpotifyNeville Li @sinisa_lyh

Page 2: Scala Data Pipelines @ Spotify

Who am I?

‣ Spotify NYC since 2011

‣ Formerly Yahoo! Search

‣ Music recommendations

‣ Data infrastructure

‣ Scala since 2013

Page 3: Scala Data Pipelines @ Spotify

Spotify in numbers

• Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day

Page 4: Scala Data Pipelines @ Spotify

Music recommendation @ Spotify

• Discover Weekly • Radio • Related Artists • Discover Page

Page 5: Scala Data Pipelines @ Spotify

Recommendation systems

Page 6: Scala Data Pipelines @ Spotify

A little teaser

PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn)Crunch: CombineFns are used to represent the associative operations…

Grouped[K, +V]::reduce[U >: V](fn: (U, U) ⇒ U)Scalding: reduce with fn which must be associative and commutative…

PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)Spark: Merge the values for each key using an associative reduce function…

Page 7: Scala Data Pipelines @ Spotify

Monoid! enables map side reduce

Actually it’s a semigroup

Page 8: Scala Data Pipelines @ Spotify

One more teaser

Linear equation in Alternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u)

vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtYratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _)ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _)http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations

Page 9: Scala Data Pipelines @ Spotify

Success story

• Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off

Page 10: Scala Data Pipelines @ Spotify

First 10 months

……

Page 11: Scala Data Pipelines @ Spotify

Activity over time

Page 12: Scala Data Pipelines @ Spotify

Guess how many jobs written by yours truly?

Page 13: Scala Data Pipelines @ Spotify

Performance vs. Agility

https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

Page 14: Scala Data Pipelines @ Spotify

Let’s dive into something technical

Page 15: Scala Data Pipelines @ Spotify

To join or not to join?

val streams: TypedPipe[(String, String)] = _ // (track, user)val tgp: TypedPipe[(String, String)] = _ // (track, genre)streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only

Page 16: Scala Data Pipelines @ Spotify

Hash join

val streams: TypedPipe[(String, String)] = _ // (track, user)val tgp: TypedPipe[(String, String)] = _ // (track, genre)streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only

Page 17: Scala Data Pipelines @ Spotify

CoGroup

val streams: TypedPipe[(String, String)] = _ // (track, user)val tgp: TypedPipe[(String, String)] = _ // (track, genre)streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres) .group .reduce(_ ++ _) // map-side reduce!

Page 18: Scala Data Pipelines @ Spotify

CoGroup

val streams: TypedPipe[(String, String)] = _ // (track, user)val tgp: TypedPipe[(String, String)] = _ // (track, genre)streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres) .group .sum // SetMonoid[Set[T]] from Algebird

* sum[U >: V](implicit sg: Semigroup[U])

Page 19: Scala Data Pipelines @ Spotify

Key-value file as distributed cache

val streams: TypedPipe[(String, String)] = _ // (gid, user)val tgp: SparkeyManager = _ // tgp replicated to all mappersstreams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum

https://github.com/spotify/sparkey

SparkeyManager wraps DistributedCacheFile

Page 20: Scala Data Pipelines @ Spotify

Joins and CoGroups

• Require shuffle and reduce step • Some ops force everything to reducers

e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…

into MultiJoin(a, b, c, …)

Page 21: Scala Data Pipelines @ Spotify

Distributed cache

• Faster with off-heap binary files • Building cache = more wiring

• Memory mapping may interfere with YARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…

Page 22: Scala Data Pipelines @ Spotify

Analyze your jobs

• Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew

Page 23: Scala Data Pipelines @ Spotify

Not enough math?

Page 24: Scala Data Pipelines @ Spotify

Recommending tracks

• User listened to Rammstein - Du Hast • Recommend 10 similar tracks

• 40 dimension feature vectors for tracks • Compute cosine similarity between all pairs • O(n) lookup per user where n ≈ 30m • Try that with 50m users * 10 seed tracks each

Page 25: Scala Data Pipelines @ Spotify

ANNOY - cheat by approximation

• Approximate Nearest Neighbor Oh Yeah • Random projections and binary tree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup

https://github.com/spotify/annoy

https://github.com/spotify/annoy-java

Page 26: Scala Data Pipelines @ Spotify

ANN Benchmark

https://github.com/erikbern/ann-benchmarks

Page 27: Scala Data Pipelines @ Spotify

Filtering candidates

• Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago

• 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user

Page 28: Scala Data Pipelines @ Spotify

Options

• Aggregate all logs daily • Aggregate last x days daily

• CSV of artist/album/track ids • Bloom filters

Page 29: Scala Data Pipelines @ Spotify

Decayed value with cutoff

• Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.990.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently

Page 30: Scala Data Pipelines @ Spotify

Bloom filters

• Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR

Page 31: Scala Data Pipelines @ Spotify

Size versus max items & FP prob

• User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate

Page 32: Scala Data Pipelines @ Spotify

Scalable Bloom Filter

• Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead

Page 33: Scala Data Pipelines @ Spotify

Scalable Bloom Filter

• Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead

n=1k

item

Page 34: Scala Data Pipelines @ Spotify

Scalable Bloom Filter

• Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead

n=1k n=10k

item

full

Page 35: Scala Data Pipelines @ Spotify

Scalable Bloom Filter

• Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead

item

n=1k n=10k n=100k

fullfull

Page 36: Scala Data Pipelines @ Spotify

Scalable Bloom Filter

• Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead

n=1k n=10k n=100k n=1m

item

fullfullfull

Page 37: Scala Data Pipelines @ Spotify

Opportunistic Bloom Filter

• Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup

Page 38: Scala Data Pipelines @ Spotify

Opportunistic Bloom Filter

• Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup

n=1k����������� ������������������  80%

n=10k����������� ������������������  8%

n=100k����������� ������������������  0.8%

n=1m����������� ������������������  0.08%

item

Page 39: Scala Data Pipelines @ Spotify

Opportunistic Bloom Filter

• Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup

n=1k����������� ������������������  100%

n=10k����������� ������������������  70%

n=100k����������� ������������������  7%

n=1m����������� ������������������  0.7%

item

full

Page 40: Scala Data Pipelines @ Spotify

Opportunistic Bloom Filter

• Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup

n=1k����������� ������������������  100%

n=10k����������� ������������������  100%

n=100k����������� ������������������  60%

n=1m����������� ������������������  6%

item

full full

Page 41: Scala Data Pipelines @ Spotify

Opportunistic Bloom Filter

• Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup

n=1k����������� ������������������  100%

n=10k����������� ������������������  100%

n=100k����������� ������������������  60%

n=1m����������� ������������������  6%

item

full fullunder-����������� ������������������  utilizedkeep

Page 42: Scala Data Pipelines @ Spotify

Want more scala.language.experimental?

Page 43: Scala Data Pipelines @ Spotify

Track metadata

• Label dump → content ingestion • Third party track genres, e.g. GraceNote • Audio attributes, e.g. tempo, key, time signature • Cultural data, e.g. popularity, tags • Latent vectors from collaborative filtering

• Many sources for album, artist, user metadata too

Page 44: Scala Data Pipelines @ Spotify

Multiple data sources

• Big joins • Complex dependencies • Wide rows with few columns accessed • Wasting I/O

Page 45: Scala Data Pipelines @ Spotify

Apache Parquet

• Pre-join sources into mega-datasets • Store as Parquet columnar storage • Column projection • Predicate pushdown • Avro within Scalding pipelines

Page 46: Scala Data Pipelines @ Spotify

Projection

pipe.map(a => (a.getName, a.getAmount))versus

Parquet.project[Account]("name", "amount")• Strings → unsafe and error prone • No IDE auto-completion → finger injury • my_fancy_field_name → .getMyFancyFieldName • Hard to migrate existing code

Page 47: Scala Data Pipelines @ Spotify

Predicate

pipe.filter(a => a.getName == "Neville" && a.getAmount > 100)

versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn("name"), Binary.fromString("Neville")), FilterApi.gt(FilterApi.floatColumn("amount"), 100f.asInstnacesOf[java.lang.Float]))

Page 48: Scala Data Pipelines @ Spotify

Macro to the rescue

Code → AST → (pattern matching) → (recursion) → (quasi-quotes) → Code Projection[Account](_.getName, _.getAmount)Predicate[Account](x => x.getName == “Neville" && x.getAmount > 100)

https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html

Page 49: Scala Data Pipelines @ Spotify

What else?

‣ Analytics

‣ Ads targeting, prediction

‣ Metadata quality

‣ Zeppelin

‣ More cool stuff in the works

Page 50: Scala Data Pipelines @ Spotify

And we’re hiring

Page 51: Scala Data Pipelines @ Spotify

The End Thank You