spark 2013-04-17

The Spark EcosystemThe Spark Ecosystem

Michael MalakMichael Malak

technicaltidbit.comtechnicaltidbit.com

AgendaAgenda

• What Hadoop gives usWhat Hadoop gives us• What everyone is complaining about in 2013What everyone is complaining about in 2013• SparkSpark– Berkeley TeamBerkeley Team– BDAS (Berkeley Data Analytics Stack)BDAS (Berkeley Data Analytics Stack)– RDDs (Resilient Distributed Datasets)RDDs (Resilient Distributed Datasets)– SharkShark– Spark StreamingSpark Streaming– Other Spark subsystemsOther Spark subsystems

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 22

What Hadoop Gives UsWhat Hadoop Gives Us

• HDFSHDFS• Map/ReduceMap/Reduce

Hadoop: HDFSHadoop: HDFS

Image from mark.chmarny.comImage from mark.chmarny.com

Hadoop: Map/ReduceHadoop: Map/Reduce

Image from people.apache.org/~rdonkin

Image from blog.octo.com

Map/Reduce ToolsMap/Reduce Tools

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Hadoop Distribution Dogs in the Hadoop Distribution Dogs in the RaceRace

Hadoop Distribution Query Tool

Stinger

Apache Drill

Other Open Source SolutionsOther Open Source Solutions

• DruidDruid• SparkSpark

Not just caching, but streamingNot just caching, but streaming

• 11stst generation: HDFS generation: HDFS• 22ndnd generation: Caching & “Push” Map/Reduce generation: Caching & “Push” Map/Reduce• 33rdrd generation: Streaming generation: Streaming

Berkeley TeamBerkeley Team• 40 students40 students• 8 faculty8 faculty• 3 staff software 3 staff software

engineersengineers• Silicon Valley style Silicon Valley style

skunkworks office skunkworks office spacespace

• 2 years into 6 year 2 years into 6 year programprogram

Image from Ian Stoica’s slides from Strata 2013 presentation

BDASBDAS(Berkeley Data Analytics Stack)(Berkeley Data Analytics Stack)

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

RDDsRDDs(Resilient Distributed Dataset)(Resilient Distributed Dataset)

Image from Matei Zaharia’s paper

RDDs: LazinessRDDs: Laziness

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

.filter(_.contains(“foo”))

cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

RDDs: Transformations vs. ActionsRDDs: Transformations vs. Actions

Transformations

map(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actions

reduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Hive vs. SharkHive vs. Shark

HDFS files

HDFS files RDDs+

Shark: Copy from HDFS to RDDShark: Copy from HDFS to RDD

CREATE TABLE wiki_small_in_mem TBLPROPERTIES CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s Creates a table that is stored in a cluster’s memory using RDD.cache().memory using RDD.cache().

Shark: Just a ShimShark: Just a Shim

Images from Reynold Xin’s presentation

What about “Big Data”?What about “Big Data”?

Median Hadoop job input sizeMedian Hadoop job input size

Image from Reynold Xin’s presentation

Spark Streaming: MotivationSpark Streaming: Motivation

x1,000,000 clients HDFS

DStreamDStream

Spark Streaming: DStreamSpark Streaming: DStream

• ““A series of small batches”A series of small batches”

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

. . .RDD{{“id”: “catlover”},

“eventType”: “buyGoods”}}{{“id”: “hercman”}, “eventType”: “logOff”}}

Spark Streaming: DAGSpark Streaming: DAG

Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

The DAG

Spark Streaming: Example CodeSpark Streaming: Example Code

// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Stateful Spark StreamingStateful Spark Streaming

Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Other Spark SubsystemsOther Spark Subsystems

• Bagel (similar to Google Pregel)Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)Sparkler (Matrix decomposition)• (Machine Learning)(Machine Learning)

TeaserTeaser

• Future Meetup: Machine Future Meetup: Machine learning from real-time learning from real-time data streamsdata streams

spark 2013-04-17

Technology

e-spark 04

introduction to big data analytics using spark - westgrid |...

spark: the early years - home | ubc...

03 04-17 1 03 04-17

spark tutorial for text analysis - cleveland state...

advanced spark and tensorflow meetup 08-04-2016 one click...

your imagination issue 29 week of 4/24/17—4/28/17 the...

particle number characterization from a spark ignition...

deep dive with spark streaming - tathagata das - spark...

streaming big data with spark streaming, kafka, cassandra...

pilots manual - flyozone.com · 2021. 3. 17. · your spark...

dataframes for large-scale data science - github pagesfeb...

120621 f2215-04 spark-e-mate us · 2015. 9. 8. · 120621...

scott spark - amazon web...

spark streaming , spark sql

the school of innovation elementary school spark remark soi...

object movedtranslate this pagepdf-1.5 %âãÏÓ 1 0 obj >>>...

spark, spark streaming & tachyon

hadoop architecture and ecosystem...input stream 17 test...

spark & spark sql