beyond shuffling - strata london 2016

52
Beyond Shuffling tips & tricks for scaling Apache Spark Reviside May 2016

Upload: holden-karau

Post on 21-Apr-2017

675 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Beyond shuffling - Strata London 2016

Beyond Shufflingtips & tricks for scaling Apache Spark

Reviside May2016

Page 2: Beyond shuffling - Strata London 2016

Who am I?

My name is Holden Karau

Prefered pronouns are she/her

I’m a Principal Software Engineer at IBM’s Spark Technology Center

previously Alpine, Databricks, Google, Foursquare & Amazon

co-author of Learning Spark & Fast Data processing with Spark

co-author of a new book focused on Spark performance coming out next year*

@holdenkarau

Slide share http://www.slideshare.net/hkarau

Linkedin https://www.linkedin.com/in/holdenkarau

Github https://github.com/holdenk

Spark Videos http://bit.ly/holdenSparkVideos

Page 3: Beyond shuffling - Strata London 2016

What is going to be covered:

What I think I might know about you

RDD re-use (caching, persistence levels, and checkpointing)

Working with key/value data

Why group key is evil and what we can do about it

When Spark SQL can be amazing and wonderful

A brief introduction to Datasets (new in Spark 1.6)

Iterator-to-Iterator transformations (or yet another way to go OOM in the night)

How to test your Spark code :)

Torsten Reuschling

Page 4: Beyond shuffling - Strata London 2016

Or….

Huang Yun Chung

Page 5: Beyond shuffling - Strata London 2016

The different pieces of Spark

Apache Spark

SQL & DataFrames Streaming Language

APIs

Scala, Java, Python, & R

Graph Tools

Spark ML bagel & Graph X

MLLib Community Packages

Jon Ross

Page 6: Beyond shuffling - Strata London 2016

Who I think you wonderful humans are?

Nice* people

Don’t mind pictures of cats

Know some Apache Spark

Want to scale your Apache Spark jobs

Don’t overly mind a grab-bag of topics

Lori Erickson

Page 7: Beyond shuffling - Strata London 2016

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

Photo from Cocoa Dream

Page 8: Beyond shuffling - Strata London 2016

Lets look at some old stand bys:

val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()grouped.mapValues(_.sum)val warnings = rdd.filter(_.toLower.contains("error")).count()

Tomomi

Page 9: Beyond shuffling - Strata London 2016

RDD re-use - sadly not magic

If we know we are going to re-use the RDD what should we do?

If it fits nicely in memory caching in memory

persisting at another level

MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER

checkpointing

Noisey clusters

_2 & checkpointing can help

persist first for checkpointing

Richard Gillin

Page 10: Beyond shuffling - Strata London 2016

Considerations for Key/Value Data

What does the distribution of keys look like?

What type of aggregations do we need to do?

Do we want our data in any particular order?

Are we joining with another RDD?

Whats our partitioner?

If we don’t have an explicit one: what is the partition structure?

eleda 1

Page 11: Beyond shuffling - Strata London 2016

What is key skew and why do we care?

Keys aren’t evenly distributed

Sales by zip code, or records by city, etc.

groupByKey will explode (but it's pretty easy to break)

We can have really unbalanced partitions

If we have enough key skew sortByKey could even fail

Stragglers (uneven sharding can make some tasks take much longer)

Mitchell Joyce

Page 12: Beyond shuffling - Strata London 2016

groupByKey - just how evil is it?

Pretty evil

Groups all of the records with the same key into a single record

Even if we immediately reduce it (e.g. sum it or similar)

This can be too big to fit in memory, then our job fails

Unless we are in SQL then happy pandas

PROgeckoam

Page 13: Beyond shuffling - Strata London 2016

So what does that look like?(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)

(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)

(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)

(67843, T, R)(10003, A, R)(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]

Tomomi

Page 14: Beyond shuffling - Strata London 2016

Let’s revisit wordcount with groupByKey

val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()grouped.mapValues(_.sum)

Tomomi

Page 15: Beyond shuffling - Strata London 2016

And now back to the “normal” version

val words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val wordCounts = wordPairs.reduceByKey(_ + _)wordCounts

Page 16: Beyond shuffling - Strata London 2016

Let’s see what it looks like when we run the two

Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp

val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions// Evil group by key versionval words = rdd.flatMap(_.split(" "))val wordPairs = words.map((_, 1))val grouped = wordPairs.groupByKey()val evilWordCounts = grouped.mapValues(_.sum)evilWordCounts.take(5)// Less evil versionval wordCounts = wordPairs.reduceByKey(_ + _)wordCounts.take(5)

Page 17: Beyond shuffling - Strata London 2016

GroupByKey

Page 18: Beyond shuffling - Strata London 2016

reduceByKey

Page 19: Beyond shuffling - Strata London 2016

So what did we do instead?

reduceByKey

Works when the types are the same (e.g. in our summing version)

aggregateByKey

Doesn’t require the types to be the same (e.g. computing stats model or similar)

Allows Spark to pipeline the reduction & skip making the list

We also got a map-side reduction (note the difference in shuffled read)

Page 20: Beyond shuffling - Strata London 2016

So why did we read in python/*.py

If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent

Which is why groupByKey can be safe sometimes

Page 21: Beyond shuffling - Strata London 2016

Can just the shuffle cause problems?

Sorting by key can put all of the records in the same partition

We can run into partition size limits (around 2GB)

Or just get bad performance

So we can handle data like the above we can add some “junk” to our key

(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)

(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)

(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)

PROTodd Klassy

Page 22: Beyond shuffling - Strata London 2016

Shuffle explosions :((94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)

(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)

(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)

(94110, A, B)(94110, A, C)(94110, E, F)(94110, A, R)(94110, D, R)(94110, E, R)(94110, E, R)(94110, T, R)(94110, T, R)

(67843, T, R)(10003, A, R)(10003, D, E)

javier_artiles

Page 23: Beyond shuffling - Strata London 2016

100% less explosions(94110, A, B)(94110, A, C)(10003, D, E)(94110, E, F)

(94110, A, R)(10003, A, R)(94110, D, R)(94110, E, R)

(94110, E, R)(67843, T, R)(94110, T, R)(94110, T, R)

(94110_A, A, B)(94110_A, A, C)(94110_A, A, R)(94110_D, D, R)

(94110_T, T, R)(10003_A, A, R)(10003_D, D, E)(67843_T, T, R)

(94110_E, E, R)(94110_E, E, R)(94110_E, E, F)(94110_T, T, R)

Page 24: Beyond shuffling - Strata London 2016

Well there is a bit of magic in the shuffle….

We can reuse shuffle files

But it can (and does) explode*

Sculpture by Flaming Lotus GirlsPhoto by Zaskoda

Page 25: Beyond shuffling - Strata London 2016

Where can Spark SQL benefit perf?

Structured or semi-structured data

OK with having less* complex operations available to us

We may only need to operate on a subset of the data

The fastest data to process isn’t even read

Remember that non-magic cat? Its got some magic** now

In part from peeking inside of boxes

non-JVM (aka Python & R) users: saved from double serialization cost! :)**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic

Matti Mattila

Page 26: Beyond shuffling - Strata London 2016

Why is Spark SQL good for those things?

Space efficient columnar cached representation

Able to push down operations to the data store

Optimizer is able to look inside of our operations

Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _))

Matti Mattila

Page 27: Beyond shuffling - Strata London 2016

How much faster can it be?Andrew Skudder

Page 28: Beyond shuffling - Strata London 2016

But where will it explode?

Iterative algorithms - large plans

Some push downs are sad pandas :(

Default shuffle size is too small for big data (200 partitions WTF?)

Default partition size when reading in is also WTF

Page 29: Beyond shuffling - Strata London 2016

How to avoid lineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] }

karmablue

Page 30: Beyond shuffling - Strata London 2016

Introducing Datasets

New in Spark 1.6 - coming to more languages in 2.0

Provide templated compile time strongly typed version of DataFrames

Make it easier to intermix functional & relational code

Do you hate writing UDFS? So do I!

Still an experimental component (API will change in future versions)

Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless

Houser Wolf

Page 31: Beyond shuffling - Strata London 2016

Using Datasets to mix functional & relational style:

val ds: Dataset[RawPanda] = ...val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)

Page 32: Beyond shuffling - Strata London 2016

So what was that?

ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)

convert a Dataset to a DataFrame to access more DataFrame functions. Shouldn’t be needed in 2.0 + & almost free anyways

Convert DataFrame back to a Dataset

A typed query (specifies the return type).Traditional functional

reduction:arbitrary scala code :)

Page 33: Beyond shuffling - Strata London 2016

And functional style maps:

/** * Functional map + Dataset, sums the positive attributes for the pandas */def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum}}

Page 34: Beyond shuffling - Strata London 2016

Photo by Christian Heilmann

Page 35: Beyond shuffling - Strata London 2016

Iterator to Iterator transformations

Iterator to Iterator transformations are super useful

They allow Spark to spill to disk if reading an entire partition is too much

Not to mention better pipelining when we put multiple transformations together

Most of the default transformations are already set up for this

map, filter, flatMap, etc.

But when we start working directly with the iterators

Sometimes to save setup time on expensive objects

e.g. mapPartitions, mapPartitionsWithIndex etc.

Reading into a list or other structure (implicitly or otherwise) can even cause OOMs :(

Christian Heilmann

Page 36: Beyond shuffling - Strata London 2016

Making our code testable

Before we can refactor our code we need to have good tests

Testing individual functions is easy if we factor them out - less lambdas :(

Testing our Spark transformations is a bit trickier, but there are a variety of tools

Property checking can save us from having to come up lots of test cases

Page 37: Beyond shuffling - Strata London 2016

A simple unit test with spark-testing-base

class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) }}

Page 38: Beyond shuffling - Strata London 2016

Ok but what about problems @ scale

Maybe our program works fine on our local sized input

If we are using Spark our actual workload is probably huge

How do we test workloads too large for a single machine?

we can’t just use parallelize and collect

Qfamily

Page 39: Beyond shuffling - Strata London 2016

Distributed “set” operations to the rescue*

Pretty close - already built into Spark

Doesn’t do so well with floating points :(

damn floating points keep showing up everywhere :p

Doesn’t really handle duplicates very well

{“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...

Matti Mattila

Page 40: Beyond shuffling - Strata London 2016

Or use RDDComparisions:

def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = { expected.zip(result).filter{case (x, y) => x != y}.take(1).headOption }

Matti Mattila

Page 41: Beyond shuffling - Strata London 2016

Or use RDDComparisions:

def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ + _) val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} }

Matti Mattila

Page 42: Beyond shuffling - Strata London 2016

But where do we get the data for those tests?

If you have production data you can sample you are lucky!

If possible you can try and save in the same format

If our data is a bunch of Vectors or Doubles Spark’s got tools :)

Coming up with good test data can take a long time

Lori Rielly

Page 43: Beyond shuffling - Strata London 2016

QuickCheck / ScalaCheck

QuickCheck generates tests data under a set of constraints

Scala version is ScalaCheck - supported by the two unit testing libraries for Spark

sscheck

Awesome people*, supports generating DStreams too!

spark-testing-base

Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs

*I assume

PROtara hunt

Page 44: Beyond shuffling - Strata London 2016

With spark-testing-base

test("map should not change number of elements") { val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }

Page 45: Beyond shuffling - Strata London 2016

With spark-testing-base & a million entries

test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc) (Arbitrary.arbitrary[String])) { rdd => rdd.map(_.length).count() == rdd.count() } check(property) }

Page 46: Beyond shuffling - Strata London 2016

Additional Spark Testing Resources

LibrariesScala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)

example-spark (unit)Java: spark-testing-base (unit)Python: spark-testing-base (unittest2), pyspark.test (pytest)

Strata San Jose Talk (up on YouTube)Blog posts

Unit Testing Spark with Java by Jesse AndersonMaking Apache Spark Testing Easy with Spark Testing BaseUnit testing Apache Spark with py.test

raider of gin

Page 47: Beyond shuffling - Strata London 2016

Additional Spark Resources

Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.)

http://spark.apache.org/docs/latest/Kay Ousterhout’s work

http://www.eecs.berkeley.edu/~keo/ BooksVideosSpark Office Hours

Normally in the bay area - will do Google Hangouts ones soonfollow me on twitter for future ones - https://twitter.com/holdenkarau

raider of gin

Page 48: Beyond shuffling - Strata London 2016

Learning Spark

Fast Data Processing with Spark(Out of Date)

Fast Data Processing with Spark (2nd edition)

Advanced Analytics with Spark

Page 49: Beyond shuffling - Strata London 2016

Learning Spark

Fast Data Processing with Spark(Out of Date)

Fast Data Processing with Spark (2nd edition)

Advanced Analytics with Spark

Coming soon: Spark in Action

Coming soon:High Performance Spark

Page 50: Beyond shuffling - Strata London 2016

And the next book…..

First four chapters are available in “Early Release”*:Buy from O’Reilly - http://bit.ly/highPerfSparkBook signing Friday June 3rd 10:45** @ O’Reilly boothGet notified when updated & finished:http://www.highperformancespark.com https://twitter.com/highperfspark

* Early Release means extra mistakes, but also a chance to help us make a more awesome book. **Normally there are about 25 printed copies first come first serve and when we run out its just me, a stuffed animal, and a bud light lime (or local equivalent)

Page 51: Beyond shuffling - Strata London 2016

Spark Videos

Apache Spark Youtube ChannelMy Spark videos on YouTube -

http://bit.ly/holdenSparkVideos Spark Summit 2014 trainingPaco’s Introduction to Apache Spark

Page 52: Beyond shuffling - Strata London 2016

k thnx bye!

If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark

Will tweet results “eventually” @holdenkarau

PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?:http://bit.ly/pySparkUDF