real-life apache spark: tips and tricks from the trenches

Real-Life Apache Spark: Tips and Tricks from the Trenches Noah Bieler

Wealthport AG

Zurich Spark Meetup, March 2016

#ZurichSparkUsers

Overview

2

Spark Intro Spark Pitfalls:

Joining Persistance Serialisation and more

Pimp my Library Pattern User Defined Types Spark on AWS Spark & Cassandra Testing Spark

Overview

3




Spark and the MapReduce Model

4

Map Reduce

Express your computations in terms of map (embarrassingly parallel) and reduce operations.

BreadSandwich

Tomato

Cheese

Spark RDD

5

• RDD (Resilient Distributed Dataset) are the abstraction Spark uses to model parallelism.

• Uses the MapReduce model (map, reduce, immutability)

• RDDs in the code are only instructions to compute something. What Spark actually does is not obvious (optimisations, predicate pushdown etc.)

• Since the actual computation is “delayed” you cannot use RDDs within RDDs.

rdd1 rdd2 rdd3 rdd4map f countmap hmap g

c: Long

rdd1 rdd2map f . g . h count

In the code:

In the VM: c: Long

RDD: Mental Model

6

rdd1 rdd2map f

On the Driver

action

On the Nodes: rdd1 rdd2map f

Partition 1:

Partition N:

…

object Main { def test = { val rdd: RDD[Int] = ...

val a = 1 + 2 + 3 // happens on the driver

rdd.map { i => i + a // happens on the nodes } } }

parallelize

RDDs vs. DataFrames (since 1.3) vs. DataSets (since 1.6)

7

RDDs are the most basic building block on Spark. Limited API but full control and type safety.

DataFrames are RDDs of Rows (= Seq[Any], no type safety!) with a schema; basically like a table. More methods but less control. For example, one cannot control the partitioning. Possibility to use SQL statements.

New Datasets (since Spark 1.6) are like RDDs (type safety) but with (optimised) methods known from DataFrames (count).

Overview

8




Spark Pitfall: Join

9

joinRDD[(K,V)] RDD[(K,W)]

rdd1

1 -> “abc” 2 -> “dfg” …

3 -> “hij” 4 -> “xzy” …

rdd1

rdd2 1 -> 3.142 2 -> 2.718 …

3 -> 1.618 4 -> 8.314 …

rdd2

join

join

Partition 1:

Partition 2:

result 1 -> (“abc”, 3.142) 2 -> (“dfg”, 2.718) …

3 -> (“hij”, 1.618) 4 -> (“xzy”, 8.314) …

result

x No network traffic

Before you join, make sure that the two data frames are properly partitioned.

rdd.partitionBy(new HashPartitioner(4 * nodeCount))

Spark Pitfalls: Join

10

Don’t use map on a partitioned PairRDD but mapValues if possible. Otherwise the partitioning is destroyed.

rdd1 1 -> “abc” 2 -> “dfgh” …

rdd2

1 -> 3 2 -> 4 …

map { case (k, v) => (k, v.size) }

rdd1

1 -> “abc” 2 -> “dfgh” …

1 -> 3 2 -> 4 …

mapValues(_.size)

Spark cannot know if key was changed. → Partitioning is erased.

Spark knows that key was not changed. → Partitioning is kept.

rdd2

Spark Pitfalls: Joining a large and a small RDD

11

rdd1 sc.broadcast(rdd2.collect())

1 -> 3.142 ...1 -> “a”

1 -> “b”

2 -> “c”

1 -> “c”

3 -> “c”

not partitioned

When joining a large with a small RDD, it might be better to broadcast the small one. Especially, if otherwise the RDDs must be partitioned.

Spark Pitfalls: Persistence

12

Persist DataFrames/RDDs which result in more than one branch of transformations.

rdd1

rdd2

rdd3

rdd4

map f

map hmap g

.persist()

……

object RDDs {

/** Automatically persist and unpersist an RDD * before and after the calculation. */ def withPersistedRDD[A, B]( rdd: RDD[A], storageLevel: StorageLevel )(f: RDD[A] => B): B = { val result = Try(f(rdd.persist(storageLevel))) rdd.unpersist() result.get }

withPersistedRDD(rdd1.map(f)) { rdd2 => val rdd3 = rdd2.map(g) val rdd4 = rdd2.map(h) /* ... */ result } }

Spark Pitfalls: Serialisation

13

class Algorithm1 (val primeNumber: Int) extends Serializable {

def run(rdd: RDD[String]): RDD[Int] = {

rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

val veryLargeTabe = Seq(/* ... */) }

class Algorithm2 (val primeNumber: Int) {

def run(rdd: RDD[String]): RDD[Int] = {

val _primeNUmber = primeNumber

rdd.map { s => s.foldLeft(0) { case (hash, c) => hash + _primeNumber * c.toInt } }


You actually use this.primeNumber and therefore serialise the whole instance(including veryLargeTable).

A local copy of this.primeNumber avoids serialising the whole instance.

Spark Pitfalls: Serialisation

14

You actually use this.hash and therefore serialise the whole instance(including veryLargeTable).

A function factory for hash avoids serialising the whole instance.

class Algorithm (val primeNumber: Int) extends Serializable {

def hash(s: String) = s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt }

def run(rdd: RDD[String]): RDD[Int] = { rdd.map { s => hash(s) } }


class Algorithm (val primeNumber: Int) {

def hashFunction() = { val _primeNUmber = primeNumber (s: String) => s.foldLeft(0) { case (hash, c) => hash + primeNumber * c.toInt } }

def run(rdd: RDD[String]): RDD[Int] = { val hash = hashFunction() rdd.map { s => hash(s) } }


Spark Pitfalls: MapLike not Serializable

15

object Main { val myMap = Map(1 -> "a", 2 -> "bc", 3 -> "def") .mapValues(_.size) // Produces MapLike, not Serializable .map(identity) // Produces Map again

val myOtherMap = /* ... */

val totalSize = sc.parallelize(Seq(myMap, myOtherMap)) .map(_.size) .reduce(_+_) // Would fail without map(identity), SI-7005 }

After running mapValues on a Map, run map(identity) on it, to avoid a NonSerializableException.

Spark Pitfalls: Avoid groupByKey followed by mapValues

16

The Seq produced by groupByKey can be potentially very large. Try to avoid it.

val rdd = sc.parallelize(Seq( "Hello", "World", "Bonjour", "Monde", "Guten Tag", "Welt" ))

val histogram1 = rdd. .map(_.size -> null) .groupByKey() // : RDD[(Int, Seq[Int, Any])] .mapValues(_.size)

val histogram2 = rdd. .map(_.size -> 1) .reduceByKey(_+_)

Spark Pitfalls: Row’s and null’s

17

• Spark’s Row is nothing but a wrapper for Seq[Any]: No type safety! • A Row will return a null if there is no value present!

row.getAs[String](index) == null // no exception!

row(index) == nullif

• A Row can loose its schema

A proper type hierarchy would not even define the function getAs(fieldName: String) for Rows!

dataFrame .map { row => val newRow = Row.fromSeq(row.toSeq.updated(timeIndex, timeStamp)) row.getAs[Int]("ID") -> newRow // Access element by field name } .map { case (id, row) => id -> row.getAs[String]("First Name") // No schema! }

Overview

18




Pimp my Spark: The “Pimp my Library” Pattern

19

object RowImplicits {

implicit class RowImplicit(row: Row) {

def updated[T](attributeId: AttributeId, value: T): Row = {

val newRow = Row.fromSeq(row.toSeq.updated(row.fieldIndex(attributeId), value))

Option(row.schema).map(newRow.withSchema).getOrElse(newRow) }

def withSchema(schema: StructType): Row = new GenericRowWithSchema(row.toSeq.toArray, schema)

def getStringOption(attributeIndex: Int): Option[String] = { if (row.isNullAt(attributeIndex)) None else Some(row.getString(attributeIndex)) } } }

Add Functionality to every possible library.

Overview

20




User Defined Types

21

Functional Programming stands on three pillars: • Variables are immutable (no side effects) • Functions are first class citizens (higher order functions) • Algebraic datatypes (strongly typed)

A good type hierarchy ensures that each function has only valid input and sane output.

Thus, it essential that Spark supports custom data types.

http://pt.slideshare.net/ScottWlaschin/fp-patterns-buildstufflt

def div(nominator: Int, denominator: NonZeroInteger) = nominator / denominator.value

def div(nominator: Int, denominator: Int) = denominator match { case 0 => None case _ => Some(nominator / denominator) }

User Defined Types

22

@SQLUserDefinedType(udt = classOf[EntityIdType]) case class EntityId(uuid: UUID) extends Serializable

object EntityId { def generate(): EntityId = EntityId(UUID.randomUUID()) }

case object EntityIdType extends EntityIdType

If you want to identify your rows with UUIDs,you need to use user defined types since Spark does not support UUIDs.

User Defined Types

23

class EntityIdType private extends UserDefinedType[EntityId] () { override def sqlType: DataType = StringType

override def serialize(obj: Any): UTF8String = obj match { case null => null.asInstanceOf[UTF8String] case t: EntityId => UTF8String.fromString(t.uuid.toString) case _ => throw new IllegalArgumentException(/*...*/) }

override def deserialize(datum: Any): EntityId = datum match { case s: UTF8String => new EntityId(UUID.fromString(s.toString)) case s: String => new EntityId(UUID.fromString(s)) case _ => throw new IllegalArgumentException(/*...*/) }

override def userClass: Class[EntityId] = classOf[EntityId] }

Sometimes Spark serialises using normal Strings, sometimes using UTF8Strings.

Overview

24




Running Spark in the Cloud (AWS)

25

Three cluster managers: • Standalone • Apache Mesos • Hadoop’s YARN

Two possibilities: • Create an EC2 instance and use the spark-ec2 scripts to manage the instances.

Time-consuming, not everything works out of the box. E.g. encoding has to be set manually. • Use Amazon EMR to have a managed environment.

Pricier and and releases are a bit slower. Uses YARN.

Both methods let you access data on S3 (AWS storage).

$ cat spark/conf/spark-defaults.conf spark.akka.frameSize 1000 spark.driver.memory 11g spark.driver.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 55g spark.executor.extraJavaOptions -XX:+HeapDumpOnOutOfMemoryError

Overview

26




Running Spark and Cassandra

27

Cassandra is a distributed NoSQL database technology optimised for fault tolerance. Initially invented by Facebook and now used world-wide (twitter, reddit, …).

We have just started to experiment with it and fixed some bugs with respect to user defined types and Scala 2.10 reflection.

We are using Datastax’ driver to connect Spark and Cassandra. Supports since recently Spark 1.6 (before was 1.5).

We are using cassandra-unit to write our unit tests.

Overview

28




Testing Spark

29

abstract class TestBase extends FunSuite with BeforeAndAfterAll with BeforeAndAfterEach with Matchers {

protected val sparkConfigProperties = mutable.Map[String, String]() protected implicit var sparkContext: SparkContext = _ protected implicit var sqlContext: SQLContext = _ protected implicit var cassandraSession: Session = _

override def beforeAll(): Unit = { System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort")

val conf = new SparkConf() .setMaster("local") .set("spark.testing", "true") .set("spark.ui.enabled", "false") .set("spark.master.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .set("spark.worker.ui.port", String.valueOf(new ServerSocket(0))) // Avoids port clashes with parallel tests running .setAll(sparkConfigProperties)

sparkContext = new SparkContext(conf) sqlContext = new SQLContext(sparkContext) }

override def afterAll(): Unit = { sparkContext.stop() System.clearProperty("spark.driver.port") System.clearProperty("spark.hostPort") } }

Testing Spark and Cassandra

30

class CassandraTest extends TestBase { sparkConfigProperties("spark.cassandra.connection.host") = "127.0.0.1"

override def beforeAll(): Unit = { EmbeddedCassandraServerHelper.startEmbeddedCassandra("cassandra.yaml", 300000)

sparkConfigProperties("spark.cassandra.connection.port") = EmbeddedCassandraServerHelper.getNativeTransportPort.toString

super.beforeAll()

cassandraSession = CassandraConnector(sparkContext.getConf).openSession()

cassandraSession.execute(s"DROP KEYSPACE IF EXISTS test_keyspace") val dataLoader = new CQLDataLoader(cassandraSession) dataLoader.load(new ClassPathCQLDataSet("cassandra/create_schema.cql", true, "test_keyspace")) }

override def afterAll(): Unit = { cassandraSession.close() EmbeddedCassandraServerHelper.cleanEmbeddedCassandra() super.afterAll() } }

Wealthport AG, Rütistrasse 16, CH-8952 Schlieren, +41 43 508 50 96, [email protected], www.wealthport.com

Getting your data back into shape.

real-life apache spark: tips and tricks from the trenches

Data & Analytics