spark 2013-04-17

The Spark EcosystemThe Spark Ecosystem

Michael MalakMichael Malak

technicaltidbit.comtechnicaltidbit.com

AgendaAgenda

• What Hadoop gives usWhat Hadoop gives us• What everyone is complaining about in 2013What everyone is complaining about in 2013• SparkSpark– Berkeley TeamBerkeley Team– BDAS (Berkeley Data Analytics Stack)BDAS (Berkeley Data Analytics Stack)– RDDs (Resilient Distributed Datasets)RDDs (Resilient Distributed Datasets)– SharkShark– Spark StreamingSpark Streaming– Other Spark subsystemsOther Spark subsystems

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 22

What Hadoop Gives UsWhat Hadoop Gives Us

• HDFSHDFS• Map/ReduceMap/Reduce


Hadoop: HDFSHadoop: HDFS

Image from mark.chmarny.comImage from mark.chmarny.com


Hadoop: Map/ReduceHadoop: Map/Reduce


Image from people.apache.org/~rdonkin

Image from blog.octo.com

Map/Reduce ToolsMap/Reduce Tools


Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Hadoop Distribution Dogs in the Hadoop Distribution Dogs in the RaceRace


Hadoop Distribution Query Tool

Stinger

Apache Drill

Other Open Source SolutionsOther Open Source Solutions

• DruidDruid• SparkSpark


Not just caching, but streamingNot just caching, but streaming

• 11stst generation: HDFS generation: HDFS• 22ndnd generation: Caching & “Push” Map/Reduce generation: Caching & “Push” Map/Reduce• 33rdrd generation: Streaming generation: Streaming


Berkeley TeamBerkeley Team• 40 students40 students• 8 faculty8 faculty• 3 staff software 3 staff software

engineersengineers• Silicon Valley style Silicon Valley style

skunkworks office skunkworks office spacespace

• 2 years into 6 year 2 years into 6 year programprogram


Image from Ian Stoica’s slides from Strata 2013 presentation

Spark

BDASBDAS(Berkeley Data Analytics Stack)(Berkeley Data Analytics Stack)


Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

RDDsRDDs(Resilient Distributed Dataset)(Resilient Distributed Dataset)


Image from Matei Zaharia’s paper

RDDs: LazinessRDDs: Laziness


lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

.filter(_.contains(“foo”))

cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

RDDs: Transformations vs. ActionsRDDs: Transformations vs. Actions


Transformations

map(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actions

reduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Hive vs. SharkHive vs. Shark


HDFS files

Shark

Hiv

eQL

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Hiv

eQL

Shark: Copy from HDFS to RDDShark: Copy from HDFS to RDD

CREATE TABLE wiki_small_in_mem TBLPROPERTIES CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s Creates a table that is stored in a cluster’s memory using RDD.cache().memory using RDD.cache().


Shark: Just a ShimShark: Just a Shim


Shark

Images from Reynold Xin’s presentation

What about “Big Data”?What about “Big Data”?


PB

TB

GB

MB

KB

Shar

k Eff

ectiv

enes

sSh

ark

Effec

tiven

ess

Median Hadoop job input sizeMedian Hadoop job input size


Image from Reynold Xin’s presentation

Spark Streaming: MotivationSpark Streaming: Motivation


x1,000,000 clients HDFS

DStreamDStream

RDD

RDD

Spark Streaming: DStreamSpark Streaming: DStream

• ““A series of small batches”A series of small batches”


{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

. . .RDD{{“id”: “catlover”},

“eventType”: “buyGoods”}}{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Spark Streaming: DAGSpark Streaming: DAG


Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Spark Streaming: Example CodeSpark Streaming: Example Code


// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Stateful Spark StreamingStateful Spark Streaming


Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Other Spark SubsystemsOther Spark Subsystems

• Bagel (similar to Google Pregel)Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)Sparkler (Matrix decomposition)• (Machine Learning)(Machine Learning)


TeaserTeaser

• Future Meetup: Machine Future Meetup: Machine learning from real-time learning from real-time data streamsdata streams


spark 2013-04-17

Technology