spark workshop
TRANSCRIPT
![Page 1: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/1.jpg)
Spark WorkshopBasics and Streaming
Wojciech PitułaJune 29, 2015
Grupa Wirtualna Polska
0
![Page 2: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/2.jpg)
Agenda
Scala
Spark
Development
Architecture
Spark SQL
Spark Streaming
1
![Page 3: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/3.jpg)
Scala
![Page 4: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/4.jpg)
Vals, vars and defs
[wpitula@wpitula-e540 tmp]$ sbt console...Welcome to Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.8.0_45).Type in expressions to have them evaluated.Type :help for more information.
scala> var foo = 1foo: Int = 1
scala> def fooMultipliedBy(x: Double) = foo*xfooMultipliedBy: (x: Double)Double
scala> val result = fooMultipliedBy(2)result: Double = 2.0
scala> result = fooMultipliedBy 3<console>:10: error: reassignment to val
scala> foo = 2foo: Int = 2
scala> fooMultipliedBy 2res1: Double = 4.0
3
![Page 5: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/5.jpg)
> pl.wp.sparkworkshop.scala.exercise1
4
![Page 6: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/6.jpg)
Classes and Objects
scala> class Person(age:Int = 22) {| def canDrink(limit:Int = 18) = age >= limit //public by default| }
defined class Person
scala> (new Person).canDrink()res2: Boolean = true
scala> (new Person(18)).canDrink(21)res3: Boolean = false
scala> object Person {| def inAgeRange(from: Int, to: Int) = new Foo(from+Random.nextInt(to-from))| }
defined object Person
scala> Person.inAgeRange(15, 17).canDrink()res4: Boolean = false
5
![Page 7: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/7.jpg)
Classes and Objects 2
∙ case class can be seen as plain and immutable data-holding objects thatshould exclusively depend on their constructor arguments.
∙ case class = class + factory method + pattern matching + eqals/hashcode +toString + copy
scala> case class Rational(n: Int, d: Int = 1)defined class Rational
scala> val (a, b, c) = (Rational(1,2), Rational(3,4), Rational(1,2))cBar1: Rational = Rational(1,2.0)cBar2: Rational = Rational(3,4.0)cBar3: Rational = Rational(1,2.0)
scala> a == cres0: Boolean = true
scala> a.copy(d = 3)res1: Rational = Rational(1,3)
6
![Page 8: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/8.jpg)
> pl.wp.sparkworkshop.scala.exercise2
7
![Page 9: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/9.jpg)
Higher order functions
scala> def add1(x:Int, y:Int) = x+y // ”method”add1: (x: Int, y: Int)Int
scala> val add2 = add1 _ // converted methodadd2: (Int, Int) => Int = <function2>
scala> val add3 = (x:Int, y:Int) => x+y // function literaladd3: (Int, Int) => Int = <function2>
scala> def magic(func: (Int, Int) => Int) = func(4,3)magic: (func: (Int, Int) => Int)Int
scala> magic(add1)res0: Int = 7
8
![Page 10: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/10.jpg)
Higher order functions 2
scala> def transformer(x:Int)(func: ((Int,Int) => Int)) = (y:Int) => func(x, y)transformer: (x: Int)(func: (Int, Int) => Int)Int => Int
scala> transformer _res0: Int => (((Int, Int) => Int) => (Int => Int)) = <function1>
9
![Page 11: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/11.jpg)
Higher order functions 3
scala> def transformer(x:Int)(func: ((Int,Int) => Int)) = (y:Int) => func(x, y)transformer: (x: Int)(func: (Int, Int) => Int)Int => Int
scala> transformer _res0: Int => (((Int, Int) => Int) => (Int => Int)) = <function1>
scala> val five = transformer(5) _five: ((Int, Int) => Int) => (Int => Int) = <function1>
scala> val fivePlus = fiveTransformer(_+_)fivePlus: Int => Int = <function1>
scala> val fivePlusThree = fivePlus(3)fivePlusThree: Int = 8
scala> transformer(5)(_+_)(3)res1: Int = 8
10
![Page 12: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/12.jpg)
Type params
class Value[T](x: T){def map[V](func: T => V): Value[V] = new Value(func(x))
}
case class Vector2[T : Numeric](x: T, y:T) {val num = implicitly[Numeric[T]]import num._
def transform[V : Numeric](func: T => V) = Vector2(func(x), func(y))
def join(other: Vector2[T], joinFunc: (T,T) => T) = ???
def +(other: Vector2[T]) = join(other, _+_)def -(other: Vector2[T]) = join(other, _-_)
// [2,3] ^ 2 = [4, 9]def ^(exp: Int): Vector2[T] = ???
}
> pl.wp.sparkworkshop.scala.exercise3
11
![Page 13: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/13.jpg)
Collections
1. scala.collection, scala.collection.immutable, andscala.collection.mutable
2. immutable imported by default3. (List, ListBuffer), (Array, ArrayBuffer), (String, StringBuffer),Set,Map
12
![Page 14: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/14.jpg)
Collections 2
scala> List(1,2,3,4,5,6) // alternatively (1 to 6).toListres0: List[Int] = List(1, 2, 3, 4, 5, 6)
scala> res0.map(_*3)res1: List[Int] = List(3, 6, 9, 12, 15, 18)
scala> res1.filter(_%2 == 0)res3: List[Int] = List(6, 12, 18)
scala> res3.foldLeft(0)(_+_)res4: Int = 36
scala> res3.foreach(println)61218
scala> for(x <- res3; y <- res1 if y%2==1) yield (x,y)res7: List[(Int, Int)] = List((6,3), (6,9), (6,15), (12,3), (12,9), (12,15), (18,3),
(18,9), (18,15))
13
![Page 15: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/15.jpg)
> pl.wp.sparkworkshop.scala.exercise4
14
![Page 16: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/16.jpg)
Pattern Matching
scala> case class Foo(foo: Any, bar: Any)
scala> def recognize(obj: Any) = {| obj match {| case str :String => s”string $str”| case Foo(Some(1), Foo(_, _)) => ”some very complicated case”| case (x,y) => s”tuple of $x and $y”| case _ => ”Boring”| }| }
scala> recognize(1)res0: String = Boring
scala> recognize(”something”)res1: String = string something
scala> recognize(Foo(Some(1), Foo(””,””)))res3: String = some very complicated case
scala> recognize((1,2))res4: String = tuple of 1 and 2
15
![Page 17: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/17.jpg)
> pl.wp.sparkworkshop.scala.exercise5
16
![Page 18: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/18.jpg)
Sbt
val sparkVersion = ”1.2.1”lazy val root = (project in file(”.”)).settings(name := ”spark-streaming-app”,organization := ”pl.wp.sparkworkshop”,version := ”1.0-SNAPSHOT”,scalaVersion := ”2.11.5”,libraryDependencies ++= Seq(”org.apache.spark” %% ”spark-core” % sparkVersion % ”provided”,”org.apache.spark” %% ”spark-streaming” % sparkVersion % ”provided”,”org.scalatest” %% ”scalatest” % ”2.2.1” % ”test”,”org.mockito” % ”mockito-core” % ”1.10.19” % ”test”
),resolvers ++= Seq(”My Repo” at ”http://repo/url”
)).settings(publishMavenStyle := true,publishArtifact in Test := false,pomIncludeRepository := { _ => false},publishTo := {val repo = ”http://repo/url”if (isSnapshot.value)Some(”snapshots” at nexus + ”content/repositories/snapshots”)
elseSome(”releases” at nexus + ”content/repositories/releases”)
})
17
![Page 19: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/19.jpg)
Exercise
”A prime number (or a prime) is a natural number which hasexactly two distinct natural number divisors: 1 and itself. Yourtask is to test whether the given number is a prime number.”def isPrime(x: Int): Boolean > pl.wp.sparkworkshop.scala.exercise6
18
![Page 20: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/20.jpg)
Exercise - Solution
implicit class PotentiallyPrime(x :Int) {def isPrime(): Boolean = {(1 to x).filter(x % _ == 0) == List(1, x)
}}
val is5Prime = 5.isPrime
19
![Page 21: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/21.jpg)
Spark
![Page 22: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/22.jpg)
Development
21
![Page 23: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/23.jpg)
RDD
An RDD is an immutable, deterministically re-computable,distributed dataset.
Each RDD remembers the lineage of deterministic operationsthat were used on a fault-tolerant input dataset to create it.
Each RDD can be operated on in parallel.
22
![Page 24: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/24.jpg)
Sources
val conf = new SparkConf().setAppName(”Simple Application”)val sc = new SparkContext(conf)
∙ Parallelized Collectionsval data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)
∙ External Datasets: Any storage source supported by Hadoop:local file system, HDFS, Cassandra, HBase, Amazon S3, etc.Spark supports text files, SequenceFiles, and any otherHadoop InputFormat.scala> val distFile = sc.textFile(”data.txt”)distFile: RDD[String] = MappedRDD@1d4cee08
23
![Page 25: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/25.jpg)
Transformations and Actions
RDDs support two types of operations:
∙ transformations, which create a new dataset from anexisting one
∙ actions, which return a value to the driver program afterrunning a computation on the dataset.
All transformations in Spark are lazy, in that they do notcompute their results right away. Instead, they just rememberthe transformations applied to some base dataset (e.g. a file).The transformations are only computed when an actionrequires a result to be returned to the driver program.
24
![Page 26: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/26.jpg)
Transformations
map[U](f: (T) => U): RDD[U]Return a new distributed dataset formed by passing each element ofthe source through a function func.
filter(f: (T) => Boolean): RDD[T]Return a new dataset formed by selecting those elements of thesource on which func returns true.
union(other: RDD[T]): RDD[T]Return a new dataset that contains the union of the elements in thesource dataset and the argument.
intersection(other: RDD[T]): RDD[T]Return a new RDD that contains the intersection of elements in thesource dataset and the argument.
groupByKey(): RDD[(K, Iterable[V])]When called on a dataset of (K, V) pairs, returns a dataset of (K,Iterable<V>) pairs.
and much more
25
![Page 27: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/27.jpg)
Actions
reduce(f: (T, T) => T): TAggregate the elements of the dataset using a function func (whichtakes two arguments and returns one)
collect(): Array[T]Return all the elements of the dataset as an array at the driverprogram.
count(): LongReturn the number of elements in the dataset.
foreach(f: (T) => Unit): UnitRun a function func on each element of the dataset.
and much more
26
![Page 28: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/28.jpg)
spark-shell
Just like Scala REPL but with SparkContext> ./bin/spark-shell --master ”local[4]”Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ ‘/ __/ ’_//___/ .__/\_,_/_/ /_/\_\ version 1.3.0/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31)Type in expressions to have them evaluated.Type :help for more information.Spark context available as sc.SQL context available as sqlContext.
scala> sc.parallelize(List(”Hello world”)).foreach(println)Hello world
27
![Page 29: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/29.jpg)
> pl.wp.sparkworkshop.spark.core.exercise1FirstCharsCount
28
![Page 30: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/30.jpg)
spark-sumbit
Application jarA jar containing the user’s Spark application. Users shouldcreate an ”uber jar” containing their application along with itsdependencies. The user’s jar should never include Hadoop orSpark libraries, however, these will be added at runtime.
./bin/spark-submit \--class org.apache.spark.examples.SparkPi \--master spark://10.0.0.1:7077,10.0.0.2:7077 \--executor-memory 20G \--total-executor-cores 100 \/path/to/examples.jar \1000
29
![Page 31: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/31.jpg)
> pl.wp.sparkworkshop.spark.core.exercise2LettersCount
30
![Page 32: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/32.jpg)
Shared variables
∙ Broadcast Variablesscala> val broadcastVar = sc.broadcast(Array(1, 2, 3))broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.valueres0: Array[Int] = Array(1, 2, 3)
∙ Accumulatorsscala> val accum = sc.accumulator(0, ”My Accumulator”)accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)...10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
scala> accum.valueres2: Int = 10
31
![Page 33: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/33.jpg)
Underlying Akka
”Akka is a toolkit and runtime for building highly concurrent,distributed, and resilient message-driven applications on theJVM.”case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {def receive = {case Greeting(who) => log.info(”Hello ” + who)
}}
val system = ActorSystem(”MySystem”)val greeter = system.actorOf(Props[GreetingActor], name = ”greeter”)greeter ! Greeting(”Charlie Parker”)
32
![Page 34: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/34.jpg)
Architecture
33
![Page 35: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/35.jpg)
Clusters
∙ Standalone∙ Apache Mesos∙ Hadoop YARN∙ local[*]
34
![Page 36: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/36.jpg)
Master, Worker, Executor and Driver
Driver programThe process running the main() function of theapplication and creating the SparkContext
Cluster managerAn external service for acquiring resources on thecluster (e.g. standalone manager, Mesos, YARN)
Worker nodeAny node that can run application code in the cluster
ExecutorA process launched for an application on a workernode, that runs tasks and keeps data in memory or diskstorage across them. Each application has its ownexecutors.
35
![Page 37: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/37.jpg)
Runnig standalone cluster
Master./sbin/start-master.sh# OR./bin/spark-class org.apache.spark.deploy.master.Master --ip ‘hostname‘ --port 7077
--webui-port 8080
Worker./bin/spark-class org.apache.spark.deploy.worker.Worker
spark://10.0.0.1:7077,10.0.0.2:7077
36
![Page 38: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/38.jpg)
Job, Stage, Task
JobA parallel computation consisting of multiple tasks thatgets spawned in response to a Spark action (e.g. save,collect).
StageEach job gets divided into smaller sets of tasks calledstages that depend on each other (similar to the map andreduce stages in MapReduce); you’ll see this term used inthe driver’s logs.
TaskA unit of work that will be sent to one executor
37
![Page 39: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/39.jpg)
SparkUI
38
![Page 40: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/40.jpg)
> src/main/scala/pl/wp/sparkworkshop/spark/core/exercise3/submit.sh
39
![Page 41: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/41.jpg)
Configuration - spark-defaults.conf
spark.eventLog.enabled truespark.eventLog.dir hdfs://some/path/on/hdfsspark.serializer org.apache.spark.serializer.KryoSerializerspark.rdd.compress true
spark.executor.extraJavaOptions-Dlog4j.loghost.Prefix=hadoop-spark-poc-display-executor-Dlog4j.localRollingFile.FileName=spark-poc-display-executor.log
spark.driver.extraJavaOptions -Dlog4j.loghost.Prefix=hadoop-spark-poc-display-driver-Dlog4j.localRollingFile.FileName=spark-poc-display-driver.log
spark.streaming.unpersist truespark.task.maxFailures 8
spark.executor.logs.rolling.strategy time
40
![Page 42: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/42.jpg)
Configuration - spark-env.sh
HADOOP_CONF_DIR=/etc/hadoopSPARK_SUBMIT_CLASSPATH=”/some/libs/to/put/on/classpath/”SPARK_LOCAL_DIRS=/tmp/dirSPARK_WORKER_CORES=8SPARK_WORKER_MEMORY=3gSPARK_WORKER_OPTS=”-Dlog4j.loghost.Prefix=node-spark-worker
-Dlog4j.localRollingFile.FileName=spark-worker.log”SPARK_DAEMON_JAVA_OPTS=”-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181”
41
![Page 43: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/43.jpg)
Spark SQL
![Page 44: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/44.jpg)
DataFrame
A DataFrame is a distributed collection of data organized intonamed columns.
DataFrame ≈ RDD[Row] ≈ Rdd[String] + schema
43
![Page 45: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/45.jpg)
DataFrame Operations
val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create the DataFrameval df = sqlContext.read.json(”examples/src/main/resources/people.json”)
// Show the content of the DataFramedf.show()// age name// null Michael// 30 Andy// 19 Justin
// Print the schema in a tree formatdf.printSchema()// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)
// Select only the ”name” columndf.select(”name”).show()// name// Michael// Andy// Justin
44
![Page 46: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/46.jpg)
DataFrame Operations 2
// Select everybody, but increment the age by 1df.select(df(”name”), df(”age”) + 1).show()// name (age + 1)// Michael null// Andy 31// Justin 20
// Select people older than 21df.filter(df(”age”) > 21).show()// age name// 30 Andy
// Count people by agedf.groupBy(”age”).count().show()// age count// null 1// 19 1// 30 1
45
![Page 47: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/47.jpg)
SQL Queries
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.val people = sc.textFile(”examples/src/main/resources/people.txt”)
.map(_.split(”,”)).map(p => Person(p(0), p(1).trim.toInt)).toDF()people.registerTempTable(”people”)
// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers =
sqlContext.sql(”SELECT name, age FROM people WHERE age >= 13 AND age <= 19”)
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val negativesQuery = s”””select event|from scoring.display_balanced_events lateral view explode(events) e as event|where event.label=0”””.stripMargin
val negatives = hc.sql(negativesQuery).limit(maxCount)
46
![Page 48: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/48.jpg)
Spark Streaming
![Page 49: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/49.jpg)
Overview
48
![Page 50: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/50.jpg)
DStream
DStreamdiscretized streamsequence of RDDs
49
![Page 51: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/51.jpg)
Receivers
∙ Directory
∙ Actors
∙ Custom
∙ Kafka∙ Flume∙ Kinesis∙ Twitter
50
![Page 52: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/52.jpg)
Transformations
All are lazy!
map, filter, flatmap, filter, count
updateStateByKey(func), reduceByKey, join
window(windowLength, slideInterval), countByWindow, reduceByWindow
51
![Page 53: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/53.jpg)
Outputs
∙ print∙ saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles∙ foreachRDD
52
![Page 54: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/54.jpg)
Example
> pl.wp.sparkworkshop.spark.streaming.exercise1.SocketWordsCount
val conf = new SparkConf().setAppName(”Example”)val ssc = new StreamingContext(conf, Seconds(10))
// Create a DStream that will connect to hostname:port, like localhost:9999val lines = ssc.socketTextStream(”localhost”, 9999)
// Split each line into wordsval words = lines.flatMap(_.split(” ”))val pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.print()
// Start the computationssc.start()ssc.awaitTermination() // Wait for the computation to terminate
53
![Page 55: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/55.jpg)
ForeachRDD
import org.apache.spark.streaming.dstream.DStream
val dstream : DStream[(String, String)] = ???
// we’re at the driverdstream.foreachRDD(rdd =>//still at the driverrdd.foreachPartition(partition =>//now we’re at the worker//anything has to be serialized or static to get herepartition.foreach(elem =>//still at the workerprintln(elem)
))
)
54
![Page 56: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/56.jpg)
Checkpoints
∙ Metadata checkpointing∙ Configuration∙ DStream operations∙ Incomplete batches
∙ Data checkpointing - Saving of the generated RDDs toreliable storage. In stateful transformations, the generatedRDDs depends on RDDs of previous batches, which causesthe length of the dependency chain to keep increasing withtime.
55
![Page 57: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/57.jpg)
Checkpoints - example
def ceateStreamingContext(): StreamingContext = {val ssc = new StreamingContext(...) // new contextssc.checkpoint(checkpointDirectory) // set checkpoint directory
val lines = ssc.socketTextStream(...) // create DStreamslines.checkpoint(Seconds(120))...ssc
}
// Get StreamingContext from checkpoint data or create a new oneval context = StreamingContext.getOrCreate(checkpointDirectory,
ceateStreamingContext _)
// Start the contextcontext.start()context.awaitTermination()
56
![Page 58: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/58.jpg)
>pl.wp.sparkworkshop.spark.streaming.exercise2
StreamLettersCount
57
![Page 59: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/59.jpg)
Tunning
∙ Reducing the processing time of each batch of data byefficiently using cluster resources.∙ Level of Parallelism in Data Receiving∙ Level of Parallelism in Data Processing∙ Data Serialization
∙ Setting the right batch size such that the batches of data canbe processed as fast as they are received (that is, dataprocessing keeps up with the data ingestion).
58
![Page 60: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/60.jpg)
Futher reading
∙ Programming guides(core, sql, streaming)∙ Integration guides(kafka, flume, etc.)∙ API Docs∙ Mailling list
59
![Page 61: Spark workshop](https://reader031.vdocument.in/reader031/viewer/2022030304/587893aa1a28ab375f8b6299/html5/thumbnails/61.jpg)
Questions?
60