spark as a distributed scala
TRANSCRIPT
![Page 1: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/1.jpg)
Spark As A Distributed Scala
![Page 2: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/2.jpg)
Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark…
Who is who?Alexey Zvolinskiy ~4 years of Scala experience
Passing through Functional Programming in Scala Specialization on Coursera
@Fruzenshtein
![Page 3: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/3.jpg)
Why Scala?
![Page 4: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/4.jpg)
What makes Scala so great?1. Functional programming language*
2. Immutability
3. Type system
4. Collections API
5. Pattern matching
6. Implicit
![Page 5: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/5.jpg)
Functional programming language1. Function is a first class citizen
2. Totality
3. Determinism
4. Purity
A => B
A1A2…
An
B1B2…
Bn
A => BAi Bi A => BAi Bi
A => BAi Bi
![Page 6: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/6.jpg)
Immutability1. Makes a code more predictable
2. Reduces efforts to understand a code
3. Key to thread-safety
Books:Java concurrency in practiceEffective Java 2nd Edition
![Page 7: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/7.jpg)
Type system1. Static typing
2. Type inference
3. Bounds Map[V, K]
List[T1 <: T2]
Set[+T]
![Page 8: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/8.jpg)
Collections API
val numbers = List(1,2,3,4,5,6,7,8,9,10)numbers.filter(_ % 2 == 0) .map(_ * 10)//List(20, 40, 60, 80, 100)
filter(n:Int => Boolean)
//(n => n % 2 == 0)
//(n => n * 10)
map(n:Int => Int)
![Page 9: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/9.jpg)
Collections APIval groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)))
groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get//List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))
![Page 10: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/10.jpg)
And what?!=Parallelism=
![Page 11: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/11.jpg)
Idea of parallelism
How to divide a problem into subproblems?
How to use a hardware optimally?
![Page 12: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/12.jpg)
Parallelism background
![Page 13: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/13.jpg)
Scala parallel collections
val from0to100000: Range = 0 until 100000
val list = from0to100000.toList
//scala.collection.parallel.immutable.ParSeq[Int]val parList = list.par
![Page 14: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/14.jpg)
Some benchmarksval list = from0to100000.toListfor (i <- 1 to 10) { val t0 = System.currentTimeMillis() list.filter(isPrime(_)) println(System.currentTimeMillis - t0)}
def isPrime(n: Int): Boolean = ! ( (2 until n-1) exists (n % _ == 0))
val parList = list.parfor (i <- 1 to 10) { val t1 = System.currentTimeMillis() parList.filter(isPrime(_)) println(System.currentTimeMillis - t1)}
7106646763156275647887326543629662996286
5130510646494568458044464447443742904476
![Page 15: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/15.jpg)
Ok, but what about Spark?!
![Page 16: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/16.jpg)
Why distributed computations?
single machine (shared memory)
Multiple nodes (network)
Parallel collections (scala)
RDDs (spark)
Almost the same API
![Page 17: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/17.jpg)
RDD example
Spark
Spark
Sparkval tweets: RDD[Tweet] = …tweets.filter( _.contains(“bigdata”))
![Page 18: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/18.jpg)
Latency
Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee
![Page 19: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/19.jpg)
Computation modelmemory disk network
seconds -
days
weeks -
months
weeks -
years
![Page 20: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/20.jpg)
Scala transformations & actions1. Transformations are lazy
2. Actions are eager
mapfilterflatMap…
reducecollectcount…
val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body)
val tweets: RDD[Tweet] = …tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) .collect()
![Page 21: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/21.jpg)
Rules of thumb1. Cache
2. Apply efficiently
3. Avoid shuffling
val tweets: RDD[Tweet] = …val cachedTweets = tweets.cache()cachedTweets.filter(_.contains(“USA”)) .map(t => (t.author, t.body)
cachedTweets.map(t => (t.author, t.body) .filter(_.contains(“USA”))
![Page 22: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/22.jpg)
Shuffling
(1, 240)(2, 500)(2, 105)
(3, 100)(1, 200)(1, 500)
(1, 450)(3, 100)(3, 100)
(2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100])
groupByKey()
Transaction(id: Int, amount: Int)
We want to know how much money spent each client
![Page 23: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/23.jpg)
Reduce before group
(1, 240)(2, 605)
(3, 100)(1, 700)
(1, 450)(3, 200)
(2, [605]) (1, [240, 700, 450]) (3, [100, 200])
groupByKey()
(1, 240)(2, 500)(2, 105)
(3, 100)(1, 200)(1, 500)
(1, 450)(3, 100)(3, 100)
reduceByKey(…)
![Page 24: Spark as a distributed Scala](https://reader030.vdocument.in/reader030/viewer/2022021500/58ee889e1a28ab1d7e8b4639/html5/thumbnails/24.jpg)
Thanks :)
@Fruzenshtein