beyond shuffling - strata london 2016
TRANSCRIPT
Beyond Shufflingtips & tricks for scaling Apache Spark
Reviside May2016
Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Principal Software Engineer at IBM’s Spark Technology Center
previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
co-author of a new book focused on Spark performance coming out next year*
@holdenkarau
Slide share http://www.slideshare.net/hkarau
Linkedin https://www.linkedin.com/in/holdenkarau
Github https://github.com/holdenk
Spark Videos http://bit.ly/holdenSparkVideos
What is going to be covered:
What I think I might know about you
RDD re-use (caching, persistence levels, and checkpointing)
Working with key/value data
Why group key is evil and what we can do about it
When Spark SQL can be amazing and wonderful
A brief introduction to Datasets (new in Spark 1.6)
Iterator-to-Iterator transformations (or yet another way to go OOM in the night)
How to test your Spark code :)
Torsten Reuschling
Or….
Huang Yun Chung
The different pieces of Spark
Apache Spark
SQL & DataFrames Streaming Language
APIs
Scala, Java, Python, & R
Graph Tools
Spark ML bagel & Graph X
MLLib Community Packages
Jon Ross
Who I think you wonderful humans are?
Nice* people
Don’t mind pictures of cats
Know some Apache Spark
Want to scale your Apache Spark jobs
Don’t overly mind a grab-bag of topics
Lori Erickson
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
Lets look at some old stand bys:
val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()grouped.mapValues(_.sum)val warnings = rdd.filter(_.toLower.contains("error")).count()
Tomomi
RDD re-use - sadly not magic
If we know we are going to re-use the RDD what should we do?
If it fits nicely in memory caching in memory
persisting at another level
MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER
checkpointing
Noisey clusters
_2 & checkpointing can help
persist first for checkpointing
Richard Gillin
Considerations for Key/Value Data
What does the distribution of keys look like?
What type of aggregations do we need to do?
Do we want our data in any particular order?
Are we joining with another RDD?
Whats our partitioner?
If we don’t have an explicit one: what is the partition structure?
eleda 1
What is key skew and why do we care?
Keys aren’t evenly distributed
Sales by zip code, or records by city, etc.
groupByKey will explode (but it's pretty easy to break)
We can have really unbalanced partitions
If we have enough key skew sortByKey could even fail
Stragglers (uneven sharding can make some tasks take much longer)
Mitchell Joyce
groupByKey - just how evil is it?
Pretty evil
Groups all of the records with the same key into a single record
Even if we immediately reduce it (e.g. sum it or similar)
This can be too big to fit in memory, then our job fails
Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)
(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)
(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)
(67843, T, R)(10003, A, R)(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()grouped.mapValues(_.sum)
Tomomi
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val wordCounts = wordPairs.reduceByKey(_ + _)wordCounts
Let’s see what it looks like when we run the two
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions// Evil group by key versionval words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()val evilWordCounts = grouped.mapValues(_.sum)evilWordCounts.take(5)// Less evil versionval wordCounts = wordPairs.reduceByKey(_ + _)wordCounts.take(5)
GroupByKey
reduceByKey
So what did we do instead?
reduceByKey
Works when the types are the same (e.g. in our summing version)
aggregateByKey
Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
Can just the shuffle cause problems?
Sorting by key can put all of the records in the same partition
We can run into partition size limits (around 2GB)
Or just get bad performance
So we can handle data like the above we can add some “junk” to our key
(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)
(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)
(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)
PROTodd Klassy
Shuffle explosions :((94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)
(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)
(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)
(94110, A, B)(94110, A, C)(94110, E, F)(94110, A, R)(94110, D, R)(94110, E, R)(94110, E, R)(94110, T, R)(94110, T, R)
(67843, T, R)(10003, A, R)(10003, D, E)
javier_artiles
100% less explosions(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)
(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)
(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)
(94110_A, A, B)(94110_A, A, C)(94110_A, A, R)(94110_D, D, R)
(94110_T, T, R)(10003_A, A, R)(10003_D, D, E)(67843_T, T, R)
(94110_E, E, R)(94110_E, E, R)(94110_E, E, F)(94110_T, T, R)
Well there is a bit of magic in the shuffle….
We can reuse shuffle files
But it can (and does) explode*
Sculpture by Flaming Lotus GirlsPhoto by Zaskoda
Where can Spark SQL benefit perf?
Structured or semi-structured data
OK with having less* complex operations available to us
We may only need to operate on a subset of the data
The fastest data to process isn’t even read
Remember that non-magic cat? Its got some magic** now
In part from peeking inside of boxes
non-JVM (aka Python & R) users: saved from double serialization cost! :)**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic
Matti Mattila
Why is Spark SQL good for those things?
Space efficient columnar cached representation
Able to push down operations to the data store
Optimizer is able to look inside of our operations
Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _))
Matti Mattila
How much faster can it be?Andrew Skudder
But where will it explode?
Iterative algorithms - large plans
Some push downs are sad pandas :(
Default shuffle size is too small for big data (200 partitions WTF?)
Default partition size when reading in is also WTF
How to avoid lineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] }
karmablue
Introducing Datasets
New in Spark 1.6 - coming to more languages in 2.0
Provide templated compile time strongly typed version of DataFrames
Make it easier to intermix functional & relational code
Do you hate writing UDFS? So do I!
Still an experimental component (API will change in future versions)
Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless
Houser Wolf
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
convert a Dataset to a DataFrame to access more DataFrame functions. Shouldn’t be needed in 2.0 + & almost free anyways
Convert DataFrame back to a Dataset
A typed query (specifies the return type).Traditional functional
reduction:arbitrary scala code :)
And functional style maps:
/** * Functional map + Dataset, sums the positive attributes for the pandas */def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum}}
Photo by Christian Heilmann
Iterator to Iterator transformations
Iterator to Iterator transformations are super useful
They allow Spark to spill to disk if reading an entire partition is too much
Not to mention better pipelining when we put multiple transformations together
Most of the default transformations are already set up for this
map, filter, flatMap, etc.
But when we start working directly with the iterators
Sometimes to save setup time on expensive objects
e.g. mapPartitions, mapPartitionsWithIndex etc.
Reading into a list or other structure (implicitly or otherwise) can even cause OOMs :(
Christian Heilmann
Making our code testable
Before we can refactor our code we need to have good tests
Testing individual functions is easy if we factor them out - less lambdas :(
Testing our Spark transformations is a bit trickier, but there are a variety of tools
Property checking can save us from having to come up lots of test cases
A simple unit test with spark-testing-base
class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) }}
Ok but what about problems @ scale
Maybe our program works fine on our local sized input
If we are using Spark our actual workload is probably huge
How do we test workloads too large for a single machine?
we can’t just use parallelize and collect
Qfamily
Distributed “set” operations to the rescue*
Pretty close - already built into Spark
Doesn’t do so well with floating points :(
damn floating points keep showing up everywhere :p
Doesn’t really handle duplicates very well
{“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = { expected.zip(result).filter{case (x, y) => x != y}.take(1).headOption }
Matti Mattila
Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ + _) val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} }
Matti Mattila
But where do we get the data for those tests?
If you have production data you can sample you are lucky!
If possible you can try and save in the same format
If our data is a bunch of Vectors or Doubles Spark’s got tools :)
Coming up with good test data can take a long time
Lori Rielly
QuickCheck / ScalaCheck
QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck - supported by the two unit testing libraries for Spark
sscheck
Awesome people*, supports generating DStreams too!
spark-testing-base
Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
With spark-testing-base
test("map should not change number of elements") { val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
With spark-testing-base & a million entries
test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }
Additional Spark Testing Resources
LibrariesScala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)Java: spark-testing-base (unit)Python: spark-testing-base (unittest2), pyspark.test (pytest)
Strata San Jose Talk (up on YouTube)Blog posts
Unit Testing Spark with Java by Jesse AndersonMaking Apache Spark Testing Easy with Spark Testing BaseUnit testing Apache Spark with py.test
raider of gin
Additional Spark Resources
Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.)
http://spark.apache.org/docs/latest/Kay Ousterhout’s work
http://www.eecs.berkeley.edu/~keo/ BooksVideosSpark Office Hours
Normally in the bay area - will do Google Hangouts ones soonfollow me on twitter for future ones - https://twitter.com/holdenkarau
raider of gin
Learning Spark
Fast Data Processing with Spark(Out of Date)
Fast Data Processing with Spark (2nd edition)
Advanced Analytics with Spark
Learning Spark
Fast Data Processing with Spark(Out of Date)
Fast Data Processing with Spark (2nd edition)
Advanced Analytics with Spark
Coming soon: Spark in Action
Coming soon:High Performance Spark
And the next book…..
First four chapters are available in “Early Release”*:Buy from O’Reilly - http://bit.ly/highPerfSparkBook signing Friday June 3rd 10:45** @ O’Reilly boothGet notified when updated & finished:http://www.highperformancespark.com https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome book. **Normally there are about 25 printed copies first come first serve and when we run out its just me, a stuffed animal, and a bud light lime (or local equivalent)
Spark Videos
Apache Spark Youtube ChannelMy Spark videos on YouTube -
http://bit.ly/holdenSparkVideos Spark Summit 2014 trainingPaco’s Introduction to Apache Spark
k thnx bye!
If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark
Will tweet results “eventually” @holdenkarau
PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?:http://bit.ly/pySparkUDF