spark as a distributed scala

Spark As A Distributed Scala

Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark…

Who is who?Alexey Zvolinskiy ~4 years of Scala experience

Passing through Functional Programming in Scala Specialization on Coursera

@Fruzenshtein

Why Scala?

What makes Scala so great?1. Functional programming language*

2. Immutability

3. Type system

4. Collections API

5. Pattern matching

6. Implicit

Functional programming language1. Function is a first class citizen

2. Totality

3. Determinism

4. Purity

A => B

A1A2…

An

B1B2…

Bn

A => BAi Bi A => BAi Bi

A => BAi Bi

Immutability1. Makes a code more predictable

2. Reduces efforts to understand a code

3. Key to thread-safety

Books:Java concurrency in practiceEffective Java 2nd Edition

Type system1. Static typing

2. Type inference

3. Bounds Map[V, K]

List[T1 <: T2]

Set[+T]

Collections API

val numbers = List(1,2,3,4,5,6,7,8,9,10)numbers.filter(_ % 2 == 0) .map(_ * 10)//List(20, 40, 60, 80, 100)

filter(n:Int => Boolean)

//(n => n % 2 == 0)

//(n => n * 10)

map(n:Int => Int)

Collections APIval groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)))

groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get//List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))

And what?!=Parallelism=

Idea of parallelism

How to divide a problem into subproblems?

How to use a hardware optimally?

Parallelism background

Scala parallel collections

val from0to100000: Range = 0 until 100000

val list = from0to100000.toList

//scala.collection.parallel.immutable.ParSeq[Int]val parList = list.par

Some benchmarksval list = from0to100000.toListfor (i <- 1 to 10) { val t0 = System.currentTimeMillis() list.filter(isPrime(_)) println(System.currentTimeMillis - t0)}

def isPrime(n: Int): Boolean = ! ( (2 until n-1) exists (n % _ == 0))

val parList = list.parfor (i <- 1 to 10) { val t1 = System.currentTimeMillis() parList.filter(isPrime(_)) println(System.currentTimeMillis - t1)}

7106646763156275647887326543629662996286

5130510646494568458044464447443742904476

Ok, but what about Spark?!

Why distributed computations?

single machine (shared memory)

Multiple nodes (network)

Parallel collections (scala)

RDDs (spark)

Almost the same API

RDD example

Spark

Spark

Sparkval tweets: RDD[Tweet] = …tweets.filter( _.contains(“bigdata”))

Latency

Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee

http://research.google.com/people/jeff/

https://gist.github.com/2841832

Computation modelmemory disk network

seconds -

days

weeks -

months

weeks -

years

Scala transformations & actions1. Transformations are lazy

2. Actions are eager

mapfilterflatMap…

reducecollectcount…

val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body)

val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) .collect()

Rules of thumb1. Cache

2. Apply efficiently

3. Avoid shuffling

val tweets: RDD[Tweet] = …val cachedTweets = tweets.cache()cachedTweets.filter(_.contains(“USA”)) .map(t => (t.author, t.body)

cachedTweets.map(t => (t.author, t.body) .filter(_.contains(“USA”))

Shuffling

(1, 240)(2, 500)(2, 105)

(3, 100)(1, 200)(1, 500)

(1, 450)(3, 100)(3, 100)

(2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100])

groupByKey()

Transaction(id: Int, amount: Int)

We want to know how much money spent each client

Reduce before group

(1, 240)(2, 605)

(3, 100)(1, 700)

(1, 450)(3, 200)

(2, [605]) (1, [240, 700, 450]) (3, [100, 200])

groupByKey()

(1, 240)(2, 500)(2, 105)

(3, 100)(1, 200)(1, 500)

(1, 450)(3, 100)(3, 100)

reduceByKey(…)

Thanks :)

@Fruzenshtein

spark as a distributed scala

Data & Analytics