in-memory processing with apache spark -...

In-MemoryProcessingwithApacheSpark

VincentLeroy

Sources

•  ResilientDistributedDatasets,HenggangCui•  CourseraIntroducBontoApacheSpark,UniversityofCalifornia,Databricks

DatacenterOrganizaBon

CPUs: 10 GB/s

100 MB/s

0.1 ms random access

$0.45 per GB

600 MB/s

3-12 ms random access

$0.05 per GB

1 Gb/s or 125 MB/s

Network

0.1 Gb/s

Nodes in another rack

Nodes in same rack

1 Gb/s or 125 MB/s

Datacenter Organization

MapReduceExecuBonMap Reduce: Distributed Execution

MAP REDUCEEach stage passes through the hard drives

IteraBveJobs

•  DiskI/OforeachrepeBBonàSlowwhenexecuBngmanysmalliteraBons

Map Reduce: Iterative Jobs• Iterative jobs involve a lot of disk I/O for each repetition

Disk I/O is very slow!

MemoryCostTech Trend: Cost of Memory

Memory

disk flash

http://www.jcmit.com/mem2014.htm

2010: 1 ¢/MB

Lower cost means can put more memory in each server

In-MemoryProcessing

•  Manydatasetsfitinmemory(ofacluster)•  MemoryisfastandavoiddiskI/OàSparkdistributedexecuBonengine

ReplaceDiskwithMemoryUse Memory Instead of Disk

iteration 1 iteration 2 . . .Input

HDFS read

HDFS write

HDFS read

HDFS write

query 1

query 2

query 3

result 1

result 2

result 3

HDFS read

ReplaceDiskwithMemory

In-Memory Data Sharing

iteration 1 iteration 2 . . .Input

HDFS read

query 1

query 2

query 3

result 1

result 2

result 3. . .

one-time processing

Distributedmemory

10-100x faster than network and disk

SparkArchitectureApache Spark Components

Apache Spark

Spark Streaming

Spark SQL

MLlib & ML

(machine learning)

GraphX (graph)

ResilientDistributedDatasets(RDDs)

•  DataCollecBon– Distributed– Read-only–  In-memory– BuiltfromstablestorageorotherRDDs

RDDCreaBon

Parallelize

# Parallelize in Python wordsRDD = sc.parallelize(["fish", "cats", "dogs"])

Take an existing in-memory collection and pass it to SparkContext’s parallelize method

There are other methods to read data from HDFS, C*, S3, HBase, etc.

Read from Text File # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md")

Create a Base RDD

OperaBonsonRDDs

•  TransformaBons:lazyexecuBon– Map,filter,intersecBon,groupByKey,zipWithIndex…

•  AcBons:triggerexecuBonoftransformaBons– Collect,count,reduce,saveAsTextFile…

•  Caching– AvoidmulBpleexecuBonsoftransformaBonswhenre-usingRDD

RDDExample

Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1

Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8

Error, ts, msg3 Info, ts, msg5 Info, ts, msg5

Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1

logLinesRDD

Error, ts, msg1 Error, ts, msg1

Error, ts, msg3

errorsRDD

.filter( )

(input/base RDD)

RDDExample

errorsRDD

.coalesce(2)

Error, ts, msg1 Error, ts, msg3 Error, ts, msg1

cleanedRDD

Error, ts, msg3

.collect( )

Driver

RDDLineage

.collect()

logLinesRDD

errorsRDD

cleanedRDD

.filter( )

.coalesce( 2 )

Driver

ExecuBon

Driver

logLinesRDD

errorsRDD

cleanedRDD

Execution

RDDFaultTolerance

•  HadoopconservaBve/pessimisBcapproachà Gotodisk/stablestorage(HDFS)

•  SparkopBmisBcàDon’t“waste”BmewriBngtodisk,re-computeincaseofcrashusinglineage

RDDDependenciesRDD Dependencies

RDDDependencies

•  Narrowdependencies– allowforpipelinedexecuBonononeclusternode– easyfaultrecovery

•  Widedependencies–  requiredatafromallparentparBBonstobeavailableandtobeshuffledacrossthenodes

– asinglefailednodemightcauseacompletere-execuBon.

JobScheduling

•  ToexecuteanacBononanRDD– schedulerdecidethestagesfromtheRDD’slineagegraph

– eachstagecontainsasmanypipelinedtransformaBonswithnarrowdependenciesaspossible

JobSchedulingJob Scheduling

SparkInteracBveShellscala> val wc = lesMiserables.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:14scala> wc.take(5).foreach(println)(créanciers;,1)(abondent.,1)(plaisir,,5)(déplaçaient,1)(sociale,,7)scala> val cw = wc.map(p => (p._2, p._1))

cw: org.apache.spark.rdd.RDD[(Int, String)] = MappedRDD[5] at map at <console>:16scala> val sortedCW = cw.sortByKey(false)

sortedCW: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[11] at sortByKey at <console>:18scala> sortedCW.take(5).foreach(println)(16757,de)(14683,)(11025,la)(9794,et)(8471,le)scala> sortedCW.filter(x => "Cosette".equals(x._2)).collect.foreach(println)(353,Cosette)

Wordcount en Spark

in-memory processing with apache spark -...

Documents

cost model for pregel on...

traitement de données en mémoire : introduc2on à...

apache spark graphx & graphframe synthetic id fraud use case

parallel dbms: competitive in the graph analytics...

solving all-pairs shortest-paths problem in large...

scanned by camscanner · tools like cassandra architecture,...

stratio big data science platform - files.meetup.com big...

evolution from shark to spark sql -...

url: mesh.cs.umn.edu e3e1...

graphx - stanford...

webinar: lateral movement€¦ · data scientists....

implementing near-realtime datacenter health analytics using...

graph processing - powergraph and graphx

an excursion into graph analytics with apache spark graphx

7 steps for a developer to learn apache spark · learning...

property graphs with time - amazon s3 · october 25, 2017...

graphx: graph analytics on spark joseph gonzalez, reynold...

the pregel programming model with spark graphx

utilizing accelerators to speedup etl, ml, and dl … ·...

spark - texas a&m...