spark 2013-04-17
DESCRIPTION
TRANSCRIPT
The Spark EcosystemThe Spark Ecosystem
Michael MalakMichael Malak
technicaltidbit.comtechnicaltidbit.com
AgendaAgenda
• What Hadoop gives usWhat Hadoop gives us• What everyone is complaining about in 2013What everyone is complaining about in 2013• SparkSpark– Berkeley TeamBerkeley Team– BDAS (Berkeley Data Analytics Stack)BDAS (Berkeley Data Analytics Stack)– RDDs (Resilient Distributed Datasets)RDDs (Resilient Distributed Datasets)– SharkShark– Spark StreamingSpark Streaming– Other Spark subsystemsOther Spark subsystems
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 22
What Hadoop Gives UsWhat Hadoop Gives Us
• HDFSHDFS• Map/ReduceMap/Reduce
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 33
Hadoop: HDFSHadoop: HDFS
Image from mark.chmarny.comImage from mark.chmarny.com
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 44
Hadoop: Map/ReduceHadoop: Map/Reduce
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 55
Image from people.apache.org/~rdonkin
Image from blog.octo.com
Map/Reduce ToolsMap/Reduce Tools
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 66
Linux
Hadoop
Hbase App
Pig Hive
HiveQLPig Script
Hadoop Distribution Dogs in the Hadoop Distribution Dogs in the RaceRace
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 77
Hadoop Distribution Query Tool
Stinger
Apache Drill
Other Open Source SolutionsOther Open Source Solutions
• DruidDruid• SparkSpark
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 88
Not just caching, but streamingNot just caching, but streaming
• 11stst generation: HDFS generation: HDFS• 22ndnd generation: Caching & “Push” Map/Reduce generation: Caching & “Push” Map/Reduce• 33rdrd generation: Streaming generation: Streaming
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 99
Berkeley TeamBerkeley Team• 40 students40 students• 8 faculty8 faculty• 3 staff software 3 staff software
engineersengineers• Silicon Valley style Silicon Valley style
skunkworks office skunkworks office spacespace
• 2 years into 6 year 2 years into 6 year programprogram
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1010
Image from Ian Stoica’s slides from Strata 2013 presentation
Spark
BDASBDAS(Berkeley Data Analytics Stack)(Berkeley Data Analytics Stack)
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1111
Linux
Mesos
Hadoop/HDFS
Bagel Shark Spark Streaming
Spark Streaming AppShark AppBagel App
Spark App
RDDsRDDs(Resilient Distributed Dataset)(Resilient Distributed Dataset)
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1212
Image from Matei Zaharia’s paper
RDDs: LazinessRDDs: Laziness
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1313
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
.filter(_.contains(“foo”))
cnt = errors.count
x => x.startsWith(“ERROR”)
All Lazy
Action!
RDDs: Transformations vs. ActionsRDDs: Transformations vs. Actions
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1414
Transformations
map(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]
Actions
reduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)
[K,V] in Scala same as <K,V> templates in C++, Java
Hive vs. SharkHive vs. Shark
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1515
HDFS files
Shark
Hiv
eQL
Hiv
eQL
HDFS files RDDs+
Hiv
eQL
Hiv
eQL
Shark: Copy from HDFS to RDDShark: Copy from HDFS to RDD
CREATE TABLE wiki_small_in_mem TBLPROPERTIES CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;("shark.cache" = "true") AS SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s Creates a table that is stored in a cluster’s memory using RDD.cache().memory using RDD.cache().
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1616
Shark: Just a ShimShark: Just a Shim
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1717
Shark
Images from Reynold Xin’s presentation
What about “Big Data”?What about “Big Data”?
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1818
PB
TB
GB
MB
KB
Shar
k Eff
ectiv
enes
sSh
ark
Effec
tiven
ess
Median Hadoop job input sizeMedian Hadoop job input size
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1919
Image from Reynold Xin’s presentation
Spark Streaming: MotivationSpark Streaming: Motivation
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2020
x1,000,000 clients HDFS
DStreamDStream
RDD
RDD
Spark Streaming: DStreamSpark Streaming: DStream
• ““A series of small batches”A series of small batches”
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2121
{{“id”: “hercman”}, “eventType”: “buyGoods”}}
{{“id”: “hercman”}, “eventType”: “buyGoods”}}
{{“id”: “shewolf”}, “eventType”: “error”}}
{{“id”: “shewolf”}, “eventType”: “error”}}
. . .RDD{{“id”: “catlover”},
“eventType”: “buyGoods”}}{{“id”: “hercman”}, “eventType”: “logOff”}}
2 sec
2 sec
2 sec
Spark Streaming: DAGSpark Streaming: DAG
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2222
Kafka DStream[String] (JSON)
Dstream.transform
DStream.filter(_.eventType==“error”)
Dstream.filter(_.eventType==“buyGoods”)
Dstream.map((_.id,1))
Dstream[EvObj]
Dstream[EvObj]
Dstream.groupByKey
Dstream.foreach(println)
Dstream.foreach(println)
The DAG
Spark Streaming: Example CodeSpark Streaming: Example Code
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2323
// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)
// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))
val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))
// Gossc.start
Stateful Spark StreamingStateful Spark Streaming
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2424
Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}
// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)
// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))
Other Spark SubsystemsOther Spark Subsystems
• Bagel (similar to Google Pregel)Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)Sparkler (Matrix decomposition)• (Machine Learning)(Machine Learning)
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2525
TeaserTeaser
• Future Meetup: Machine Future Meetup: Machine learning from real-time learning from real-time data streamsdata streams
Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2626