beyond parallelize and collect by holden karau
TRANSCRIPT
![Page 1: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/1.jpg)
Beyond Parallelize & Collect
(Effective testing of Spark Programs)
Now mostly
“works”*
*See developer for details. Does not imply warranty. :p
![Page 2: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/2.jpg)
Who am I?● My name is Holden Karau● Prefered pronouns are she/her● I’m a Software Engineer● currently IBM and previously Alpine, Databricks, Google, Foursquare &
Amazon● co-author of Learning Spark & Fast Data processing with Spark● @holdenkarau● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Spark Videos http://bit.ly/holdenSparkVideos
![Page 3: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/3.jpg)
What is going to be covered:● What I think I might know about you● A bit about why you should test your programs● Using parallelize & collect for unit testing (quick skim)● Comparing datasets too large to fit in memory● Considerations for Streaming & SQL (DataFrames & Datasets)● Cute & scary pictures
○ I promise at least one panda and one cat
● “Future Work”○ Integration testing lives here for now (sorry)○ Some of this future work might even get done!
![Page 4: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/4.jpg)
Who I think you wonderful humans are?● Nice* people● Like silly pictures● Familiar with Apache Spark
○ If not, buy one of my books or watch Paco’s awesome video
● Familiar with one of Scala, Java, or Python○ If you know R well I’d love to chat though
● Want to make better software○ (or models, or w/e)
![Page 5: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/5.jpg)
So why should you test?● Makes you a better person● Save $s
○ May help you avoid losing your employer all of their money■ Or “users” if we were in the bay
○ AWS is expensive
● Waiting for our jobs to fail is a pretty long dev cycle● This is really just to guilt trip you & give you flashbacks to your QA internships
![Page 6: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/6.jpg)
So why should you test - continued
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
![Page 7: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/7.jpg)
So why should you test - continued
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
![Page 8: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/8.jpg)
Why don’t we test?● It’s hard
○ Faking data, setting up integration tests, urgh w/e
● Our tests can get too slow● It takes a lot of time
○ and people always want everything done yesterday○ or I just want to go home see my partner○ etc.
![Page 9: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/9.jpg)
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
![Page 10: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/10.jpg)
An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() }
Photo by morinesque
![Page 11: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/11.jpg)
And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = { f.map(_.split(" ").toList) }
Photo by morinesque
![Page 12: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/12.jpg)
Wait, where were the batteries?
Photo by Jim Bauer
![Page 13: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/13.jpg)
Let’s get batteries!● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base ○ sscheck - https://github.com/juanrh/sscheck
● Integration testing○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
● Performance○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
● Spark job validation○ spark-validator - https://github.com/holdenk/spark-validator
Photo by Mike Mozart
![Page 14: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/14.jpg)
A simple unit test re-visited (Scala)class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) }}
![Page 15: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/15.jpg)
Ok but what about problems @ scale● Maybe our program works fine on our local sized input● If we are using Spark our actual workload is probably huge● How do we test workloads too large for a single machine?
○ we can’t just use parallelize and collect
Qfamily
![Page 16: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/16.jpg)
Distributed “set” operations to the rescue*● Pretty close - already built into Spark● Doesn’t do so well with floating points :(
○ damn floating points keep showing up everywhere :p
● Doesn’t really handle duplicates very well ○ {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
![Page 17: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/17.jpg)
Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD
[T], result: RDD[T]): Option[(T, T)] = {
expected.zip(result).filter{case (x, y) => x != y}.take(1).
headOption
}
Matti Mattila
![Page 18: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/18.jpg)
Or use RDDComparisions:def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option
[(T, Int, Int)] = {
val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +
_)
val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
=>
i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).
headOption.
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
i2.headOption.getOrElse(0))}
}
Matti Mattila
![Page 19: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/19.jpg)
But where do we get the data for those tests?● If you have production data you can sample you are lucky!
○ If possible you can try and save in the same format
● If our data is a bunch of Vectors or Doubles Spark’s got tools :)● Coming up with good test data can take a long time
Lori Rielly
![Page 20: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/20.jpg)
QuickCheck / ScalaCheck● QuickCheck generates tests data under a set of constraints● Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark● sscheck
○ Awesome people*, supports generating DStreams too!
● spark-testing-base○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
*I assume
PROtara hunt
![Page 21: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/21.jpg)
With spark-testing-basetest("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() }}
![Page 22: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/22.jpg)
Testing streaming….
Photo by Steve Jurvetson
![Page 23: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/23.jpg)
// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
SparkContext,
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration - 1).
toInt
val selectedInput = if (index < input.size) input(index) else
Seq[T]()
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
}
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + rdd.id + " with " + selectedInput)
Some(rdd)
}
}
Artisanal Stream Testing Codetrait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext {
// Name of the framework for Spark context def framework: String = this.getClass.getSimpleName
// Master for Spark context def master: String = "local[4]"
// Batch duration def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString }
// Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() }
Photo by Steve Jurvetson
![Page 24: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/24.jpg)
and continued….
/** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) }
// Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() }
/** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream[R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } }
}
![Page 25: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/25.jpg)
and now for the clock/* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */class TestManualClock(var time: Long) extends Clock { def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time }
def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() }
def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() }
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
/** * @param targetTime block until the clock time is set or advanced to at least this time * @return current time reported by the clock when waiting finishes */ def waitTillTime(targetTime: Long): Long = synchronized { while (time < targetTime) { wait(100) } getTimeMillis() }
}
![Page 26: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/26.jpg)
Testing streaming the happy panda way● Creating test data is hard
○ ssc.queueStream works - unless you need checkpoints (1.4.1+)
● Collecting the data locally is hard○ foreachRDD & a var
● figuring out when your test is “done”
Let’s abstract all that away into testOperation
![Page 27: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/27.jpg)
We can hide all of that:test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true)}
Photo by An eye for my mind
![Page 28: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/28.jpg)
What about DataFrames?● We can do the same as we did for RDD’s (.rdd)● Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
● Sadly it’s not in a published package & local only● instead we expose:
def equalDataFrames(expected: DataFrame, result: DataFrame) {def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
![Page 29: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/29.jpg)
…. and Datasets● We can do the same as we did for RDD’s (.rdd)● Inside of Spark validation looks like:
def checkAnswer(df: Dataset[T], expectedAnswer: T*)
● Sadly it’s not in a published package & local only● instead we expose:
def equalDatasets(expected: Dataset[U], result: Dataset[V]) {def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
![Page 30: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/30.jpg)
This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) }
*This may or may not be easier.
![Page 31: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/31.jpg)
Which has “built-in” large support :)
![Page 32: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/32.jpg)
Photo by allison
![Page 33: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/33.jpg)
Let’s talk about local mode● It’s way better than you would expect*● It does its best to try and catch serialization errors● It’s still not the same as running on a “real” cluster● Especially since if we were just local mode, parallelize and collect might be
fine
Photo by: Bev Sykes
![Page 34: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/34.jpg)
Options beyond local mode:● Just point at your existing cluster (set master)● Start one with your shell scripts & change the master
○ Really easy way to plug into existing integration testing
● spark-docker - hack in our own tests● YarnMiniCluster
○ https://github.
com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala
○ In Spark Testing Base extend SharedMiniCluster■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner
![Page 35: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/35.jpg)
Validation● Validation can be really useful for catching errors before deploying a model
○ Our tests can’t catch everything
● For now checking file sizes & execution time seem like the most common best practice (from survey)
● Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option
● spark-validator is still in early stages and not ready for production use but interesting proof of concept
Photo by:Paul Schadler
![Page 36: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/36.jpg)
Related talks & blog posts● Testing Spark Best Practices (Spark Summit 2014)● Every Day I’m Shuffling (Strata 2015) & slides● Spark and Spark Streaming Unit Testing● Making Spark Unit Testing With Spark Testing Base
![Page 37: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/37.jpg)
Learning Spark
Fast Data Processing with Spark(Out of Date)
Fast Data Processing with Spark (2nd edition)
Advanced Analytics with Spark
![Page 38: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/38.jpg)
Learning Spark
Fast Data Processing with Spark(Out of Date)
Fast Data Processing with Spark (2nd edition)
Advanced Analytics with Spark
Coming soon: Spark in Action
Coming soon:High Performance Spark
![Page 39: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/39.jpg)
And the next book…..
Still being written - signup to be notified when it is available:● http://www.highperformancespark.com ● https://twitter.com/highperfspark
![Page 40: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/40.jpg)
Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base ● sscheck: https://github.com/juanrh/sscheck ● spark-validator: https://github.com/holdenk/spark-validator *ALPHA*
● spark-perf - https://github.com/databricks/spark-perf
● spark-integration-tests - https://github.com/databricks/spark-integration-tests
● scalacheck - https://www.scalacheck.org/
![Page 41: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/41.jpg)
And including spark-testing-base:sbt:
"com.holdenkarau" %% "spark-testing-base" % "1.5.2_0.3.1"
maven:
<dependency>
<groupId>com.holdenkarau</groupId>
<artifactId>spark-testing-base</artifactId>
<version>${spark.version}_0.3.1</version>
<scope>test</scope>
</dependency>
![Page 42: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/42.jpg)
“Future Work”● Better ScalaCheck integration (ala sscheck)● Testing details in my next Spark book● Whatever* you all want
○ Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
● integration testing (for now see @cfriegly’s Spark + Docker setup):○ https://github.com/fluxcapacitor/pipeline
Pretty unlikely:
● Integrating into Apache Spark ( SPARK-12433 )*That I feel like doing, or you feel like making a pull request for.
Photo by bullet101
![Page 43: Beyond Parallelize and Collect by Holden Karau](https://reader034.vdocument.in/reader034/viewer/2022051318/58f9a933760da3da068b6c26/html5/thumbnails/43.jpg)
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey: http://bit.ly/holdenTestingSpark
Will use update results in Strata Presentation & tweet eventually at @holdenkarau