spark 2013-04-17

Post on 26-Jan-2015

109 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

The Spark EcosystemThe Spark Ecosystem

Michael MalakMichael Malak

technicaltidbit.comtechnicaltidbit.com

AgendaAgenda

• What Hadoop gives usWhat Hadoop gives us• What everyone is complaining about in 2013What everyone is complaining about in 2013• SparkSpark– Berkeley TeamBerkeley Team– BDAS (Berkeley Data Analytics Stack)BDAS (Berkeley Data Analytics Stack)– RDDs (Resilient Distributed Datasets)RDDs (Resilient Distributed Datasets)– SharkShark– Spark StreamingSpark Streaming– Other Spark subsystemsOther Spark subsystems

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 22

What Hadoop Gives UsWhat Hadoop Gives Us

• HDFSHDFS• Map/ReduceMap/Reduce

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 33

Hadoop: HDFSHadoop: HDFS

Image from mark.chmarny.comImage from mark.chmarny.com

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 44

Hadoop: Map/ReduceHadoop: Map/Reduce

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 55

Image from people.apache.org/~rdonkin

Image from blog.octo.com

Map/Reduce ToolsMap/Reduce Tools

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 66

Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Hadoop Distribution Dogs in the Hadoop Distribution Dogs in the RaceRace

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 77

Hadoop Distribution Query Tool

Stinger

Apache Drill

Other Open Source SolutionsOther Open Source Solutions

• DruidDruid• SparkSpark

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 88

Not just caching, but streamingNot just caching, but streaming

• 11stst generation: HDFS generation: HDFS• 22ndnd generation: Caching & “Push” Map/Reduce generation: Caching & “Push” Map/Reduce• 33rdrd generation: Streaming generation: Streaming

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 99

Berkeley TeamBerkeley Team• 40 students40 students• 8 faculty8 faculty• 3 staff software 3 staff software

engineersengineers• Silicon Valley style Silicon Valley style

skunkworks office skunkworks office spacespace

• 2 years into 6 year 2 years into 6 year programprogram

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1010

Image from Ian Stoica’s slides from Strata 2013 presentation

Spark

BDASBDAS(Berkeley Data Analytics Stack)(Berkeley Data Analytics Stack)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1111

Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

RDDsRDDs(Resilient Distributed Dataset)(Resilient Distributed Dataset)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1212

Image from Matei Zaharia’s paper

RDDs: LazinessRDDs: Laziness

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1313

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

.filter(_.contains(“foo”))

cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

RDDs: Transformations vs. ActionsRDDs: Transformations vs. Actions

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1414

Transformations

map(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actions

reduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Hive vs. SharkHive vs. Shark

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1515

HDFS files

Shark

Hiv

eQL

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Hiv

eQL

Shark: Copy from HDFS to RDDShark: Copy from HDFS to RDD

CREATE TABLE wiki_small_in_mem TBLPROPERTIES CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s Creates a table that is stored in a cluster’s memory using RDD.cache().memory using RDD.cache().

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1616

Shark: Just a ShimShark: Just a Shim

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1717

Shark

Images from Reynold Xin’s presentation

What about “Big Data”?What about “Big Data”?

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1818

PB

TB

GB

MB

KB

Shar

k Eff

ectiv

enes

sSh

ark

Effec

tiven

ess

Median Hadoop job input sizeMedian Hadoop job input size

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1919

Image from Reynold Xin’s presentation

Spark Streaming: MotivationSpark Streaming: Motivation

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2020

x1,000,000 clients HDFS

DStreamDStream

RDD

RDD

Spark Streaming: DStreamSpark Streaming: DStream

• ““A series of small batches”A series of small batches”

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2121

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

. . .RDD{{“id”: “catlover”},

“eventType”: “buyGoods”}}{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Spark Streaming: DAGSpark Streaming: DAG

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2222

Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Spark Streaming: Example CodeSpark Streaming: Example Code

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2323

// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Stateful Spark StreamingStateful Spark Streaming

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2424

Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Other Spark SubsystemsOther Spark Subsystems

• Bagel (similar to Google Pregel)Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)Sparkler (Matrix decomposition)• (Machine Learning)(Machine Learning)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2525

TeaserTeaser

• Future Meetup: Machine Future Meetup: Machine learning from real-time learning from real-time data streamsdata streams

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2626

top related