big data processing with mapreduce and spark
DESCRIPTION
Big Data Processing with MapReduce and Spark. Matei Zaharia UC Berkeley AMPLab spark-project.org. UC BERKELEY. Outline. The big data problem MapReduce model Limitations of MapReduce Spark model Future directions. The Big Data Problem. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/1.jpg)
Matei Zaharia
UC Berkeley AMPLab
spark-project.orgUC BERKELEY
Big Data Processing with MapReduce and Spark
![Page 2: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/2.jpg)
OutlineThe big data problemMapReduce modelLimitations of MapReduceSpark modelFuture directions
![Page 3: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/3.jpg)
The Big Data ProblemData is growing faster than computation speedsGrowing data sources
»Web, mobile, scientific, …
Cheap storage»Doubling every 18
months
Stalling CPU speeds»Even multicores not
enough
![Page 4: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/4.jpg)
ExamplesFacebook’s daily logs: 60 TB1000 genomes project: 200 TBGoogle web index: 10+ PB
Cost of 1 TB of disk: $50Time to read 1 TB from disk: 6 hours (50 MB/s)
![Page 5: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/5.jpg)
The Big Data ProblemSingle machine can no longer process or even store all the data!Only solution is to distribute over large clusters
![Page 6: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/6.jpg)
Google Datacenter
How do we program this thing?
![Page 7: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/7.jpg)
Traditional Network ProgrammingMessage-passing between nodesReally hard to do at scale:
»How to split problem across nodes?• Must consider network, data locality
»How to deal with failures?• 1 server fails every 3 years => 10K nodes see
10 faults/day»Even worse: stragglers (node is not failed,
but slow)Almost nobody does message passing!
![Page 8: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/8.jpg)
Data-Parallel ModelsRestrict the programming interface so that the system can do more automatically“Here’s an operation, run it on all of the data”
»I don’t care where it runs (you schedule that)
»In fact, feel free to run it twice on different nodes
Biggest example: MapReduce
![Page 9: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/9.jpg)
MapReduceFirst widely popular programming model for data-intensive apps on clustersPublished by Google in 2004
»Processes 20 PB of data / day
Popularized by open-source Hadoop project
»40,000 nodes at Yahoo!, 70 PB at Facebook
![Page 10: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/10.jpg)
MapReduce Programming ModelData type: key-value records
Map function:(Kin, Vin) list(Kinter, Vinter)
Reduce function:(Kinter, list(Vinter)) list(Kout, Vout)
![Page 11: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/11.jpg)
Example: Word Countdef mapper(line): foreach word in line.split(): output(word, 1)
def reducer(key, values): output(key, sum(values))
![Page 12: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/12.jpg)
Word Count Execution
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2how, 1now, 1the, 3
ate, 1cow, 1mouse,
1quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
![Page 13: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/13.jpg)
MapReduce ExecutionAutomatically split work into many small tasks
Send map tasks to nodes based on data localityLoad-balance dynamically as tasks finish
![Page 14: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/14.jpg)
Fault Recovery1. If a task crashes:
»Retry on another node• OK for a map because it had no dependencies• OK for reduce because map outputs are on
disk»If the same task repeatedly fails, end the
jobRequires user code to be
deterministic
![Page 15: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/15.jpg)
Fault Recovery2. If a node crashes:
»Relaunch its current tasks on other nodes»Relaunch any maps the node previously ran• Necessary because their output files were lost
along with the crashed node
![Page 16: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/16.jpg)
Fault Recovery3. If a task is going slowly (straggler):
»Launch second copy of task on another node
»Take the output of whichever copy finishes first, and kill the other one
![Page 17: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/17.jpg)
Example Applications
![Page 18: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/18.jpg)
1. SearchInput: (lineNumber, line) recordsOutput: lines matching a given patternMap:
if(line matches pattern): output(line)
Reduce: identity function–Alternative: no reducer (map-only job)
![Page 19: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/19.jpg)
2. SortInput: (key, value) recordsOutput: same records, sorted by keyMap: identity functionReduce: identify function
Trick: Pick partitioningfunction p so thatk1 < k2 => p(k1) < p(k2)
pigsheepyakzebra
aardvarkant
beecowelephant
Map
Map
Map
Reduce
Reduce
ant, bee
zebra
aardvark,elephant
cow
pig
sheep, yak
[A-M]
[N-Z]
![Page 20: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/20.jpg)
3. Inverted IndexInput: (filename, text) recordsOutput: list of files containing each wordMap:
foreach word in text.split(): output(word, filename)
Reduce: def reduce(word, filenames): output(word, unique(filenames))
![Page 21: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/21.jpg)
Inverted Index Example
afraid, (12th.txt)be, (12th.txt, hamlet.txt)greatness, (12th.txt)not, (12th.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)
to be or not to
be
hamlet.txt
be not afraid of greatnes
s
12th.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txtbe, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txt
![Page 22: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/22.jpg)
4. Most Popular WordsInput: (filename, text) recordsOutput: the 100 words occurring in most filesTwo-stage solution:–Job 1:• Create inverted index, giving (word, list(file)) records
–Job 2:• Map each (word, list(file)) to (count, word)• Sort these records by count as in sort job
![Page 23: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/23.jpg)
SummaryBy providing a data-parallel model, MapReduce greatly simplified cluster programming:
»Automatic division of job into tasks»Locality-aware scheduling»Load balancing»Recovery from failures & stragglers
But… the story doesn’t end here!
![Page 24: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/24.jpg)
OutlineThe big data problemMapReduce modelLimitations of MapReduceSpark modelFuture directions
![Page 25: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/25.jpg)
When an Abstraction is Useful…People want to compose it!Most real applications require multiple MR steps
»Google indexing pipeline: 21 steps»Analytics queries (e.g. sessions, top K): 2-5
steps»Iterative algorithms (e.g. PageRank): 10’s
of steps
Problems: programmability & performance
![Page 26: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/26.jpg)
ProgrammabilityMulti-step jobs create spaghetti code
»21 MR steps -> 21 mapper and reducer classes
Lots of boilerplate wrapper code per stepAPI doesn’t provide type safety
![Page 27: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/27.jpg)
PerformanceMR only provides one pass of computation
»Must write out data to file system in-between
Expensive for apps that need to reuse data
»Multi-step algorithms (e.g. PageRank)»Interactive data mining (many queries on
same data)
Users often hand-optimize by merging steps
![Page 28: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/28.jpg)
SparkAims to address both problemsProgrammability: clean, functional API
»Parallel transformations on collections»5-10x less code than MR»Available in Scala, Java and Python
Performance:»In-memory computing primitives»Automatic optimization across operators
![Page 29: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/29.jpg)
Spark Programmability
#include "mapreduce/mapreduce.h"
// User’s map functionclass SplitWords: public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while (i < n && isspace(text[i])) i++; // Find word end int start = i; while (i < n && !isspace(text[i])) i++; if (start < i) Emit(text.substr( start,i-start),"1"); } }};
REGISTER_MAPPER(SplitWords);
// User’s reduce functionclass Sum: public Reducer { public: virtual void Reduce(ReduceInput* input) { // Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt( input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value)); }};
REGISTER_REDUCER(Sum);
int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* in= spec.add_input(); in->set_format("text"); in->set_filepattern(argv[i]); in->set_mapper_class("SplitWords"); }
// Specify the output files MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); out->set_num_tasks(100); out->set_format("text"); out->set_reducer_class("Sum");
// Do partial sums within map out->set_combiner_class("Sum");
// Tuning parameters spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); // Now run it MapReduceResult result; if (!MapReduce(spec, &result)) abort(); return 0; }
Google MapReduce WordCount:
![Page 30: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/30.jpg)
Spark Programmability
Spark WordCount:
val file = spark.textFile(“hdfs://...”)
val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _)
counts.save(“out.txt”)
![Page 31: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/31.jpg)
Spark PerformanceIterative algorithms:
0 20 40 60 80 100 120 1404.1
121K-means Clustering
Hadoop MRSpark
sec
0 10 20 30 40 50 60 70 80 900.96
80Logistic Regression
Hadoop MRSpark
sec
![Page 32: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/32.jpg)
Spark ConceptsResilient distributed datasets (RDDs)
»Immutable, partitioned collections of objects»May be cached in memory for fast reuse
Operations on RDDs»Transformations (build RDDs), actions
(compute results)
Restricted shared variables»Broadcast, accumulators
![Page 33: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/33.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count. . .
tasksresults
Cache 1
Cache 2
Cache 3
Base RDDTransformed
RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: search 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
![Page 34: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/34.jpg)
Fault RecoveryRDDs track lineage information that can be used to efficiently reconstruct lost partitionsEx:
messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))
HDFS File Filtered RDD
Mapped RDDfilter
(func = _.contains(...))map
(func = _.split(...))
![Page 35: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/35.jpg)
Demo
![Page 36: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/36.jpg)
Example: Logistic RegressionGoal: find best line separating two sets of points
+
–
+ ++
+
+
++ +
– ––
–
–
–– –
+
target
–
random initial line
![Page 37: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/37.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}
println("Final w: " + w)
w automatically shipped to
cluster
![Page 38: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/38.jpg)
Logistic Regression Performance
110 s / iteration
first iteration 80 s
further iterations 1 s
1 10 20 300
102030405060
HadoopSpark
Number of Iterations
Runn
ing
Tim
e (m
in)
![Page 39: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/39.jpg)
Other RDD OperationsTransformation
s(define a new
RDD)
mapfilter
samplegroupByKeyreduceByKey
cogroup
flatMapunionjoin
crossmapValues
...
Actions(output a result)
collectreduce
takefold
countsaveAsTextFile
saveAsHadoopFile...
![Page 40: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/40.jpg)
Spark in Java and PythonJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
lines = sc.textFile(...)lines.filter(lambda x: “error” in x).count()
![Page 41: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/41.jpg)
Shared VariablesSo far we’ve seen that RDD operations can use variables from outside their scopeBy default, each task gets a read-only copy of each variable (no sharing)Good place to enable other sharing patterns!
![Page 42: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/42.jpg)
Example: Collaborative FilteringGoal: predict users’ movie ratings based on past ratings of other movies
R =1 ? ? 4
5 ? 3? ? 3 5
? ? 35 ? 5 ?
? ? 14 ? ? ?
? 2 ?
Movies
Users
![Page 43: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/43.jpg)
Model and AlgorithmModel R as product of user and movie feature matrices A and B of size U×K and M×K
Alternating Least Squares (ALS)»Start with random A & B»Optimize user vectors (A) based on movies»Optimize movie vectors (B) based on users»Repeat until converged
R A= BT
![Page 44: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/44.jpg)
Serial ALSvar R = readRatingsMatrix(...)
var A = // array of U random vectorsvar B = // array of M random vectors
for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R))}
Range objects
![Page 45: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/45.jpg)
Naïve Spark ALSvar R = readRatingsMatrix(...)
var A = // array of U random vectorsvar B = // array of M random vectors
for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect()}
Problem:
R re-sent to all
nodes in each
iteration
![Page 46: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/46.jpg)
Efficient Spark ALSvar R = spark.broadcast(readRatingsMatrix(...))
var A = // array of U random vectorsvar B = // array of M random vectors
for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect()}
Solution: mark R
as broadcas
t variable
Result: 3× performance improvement
![Page 47: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/47.jpg)
AccumulatorsApart from broadcast, another common sharing pattern is aggregation
»Add up multiple statistics about data»Count various events for debugging
Spark’s reduce operation does aggregation, but accumulators are another nice way to express it
![Page 48: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/48.jpg)
Usageval badRecords = sc.accumulator(0)val badBytes = sc.accumulator(0.0)
records.filter(r => { if (isBad(r)) { badRecords += 1 badBytes += r.size false } else { true }}).save(...)
printf(“Total bad records: %d, avg size: %f\n”, badRecords.value, badBytes.value / badRecords.value)
![Page 49: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/49.jpg)
Accumulator RulesCreate with SparkContext.accumulator(initialVal)
“Add” to the value with += inside tasks»Each task’s effect only counted once
Access with .value, but only on master»Exception if you try it on workers
Retains efficiency and fault tolerance!
![Page 50: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/50.jpg)
Job SchedulerCaptures RDD dependency graphPipelines functionsinto “stages”Cache-aware fordata reuse & localityPartitioning-awareto avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached partition
![Page 51: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/51.jpg)
User Community3000 people attended online training600 meetup members15 companies contributing
![Page 52: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/52.jpg)
OutlineThe big data problemMapReduce modelLimitations of MapReduceSpark modelFuture directions
![Page 53: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/53.jpg)
Future DirectionsAs “big data” starts to be used for more apps, users’ demands are also growing:
»Latency: instead of training a model every night, can you train in real-time?
»High-level abstractions: matrices, graphs, etc – what is the equivalent of Matlab or R for clusters?
![Page 54: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/54.jpg)
Spark StreamingExtends Spark API to do stream processing
»Run as a series of small, deterministic batches
Intermix with batch and ad-hoc queriessc.twitterStream(...) .filter(_.contains(“spark”) ) .map(t => (t.user, 1)) .runningReduce(_ + _)
t = 1:
t = 2:
tweets pairs counts
map reduce
. . .= RDD = partition
![Page 55: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/55.jpg)
Streaming ResultsBetter performance than other models, while providing fault recovery properties they lack
0 20 40 60 80 1000
10
20
30
Nodes in Cluster
Reco
rds/
s (m
il-lio
ns)
Sliding WordCount + Top K
012345 30s ckpts, 20
nodes30s ckpts, 40 nodes
Proc
essi
ng T
ime
(s)
Scalability Fault Recovery
![Page 56: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/56.jpg)
Higher-Level AbstractionsSparkGraph: graph processing modelMLbase: declarative machine learning libraryShark: SQL queries
Selec-tion
020406080
1001.
1
Ag-grega-
tion
0
200
400
600
32
Join0
300600900
120015001800
SharkHadoop
![Page 57: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/57.jpg)
ConclusionCommodity clusters are needed to handle big data, but pose key challenges (faults, stragglers)Data-parallel models like MapReduce and Spark handle these automaticallyLook for similar models for new problems
www.spark-project.org
![Page 58: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/58.jpg)
Other ResourcesHadoop MapReduce: http://hadoop.apache.org/Spark: http://spark-project.org Hadoop video tutorials: www.cloudera.com/hadoop-trainingAmazon Elastic MapReduce:http://aws.amazon.com/elasticmapreduce/
![Page 59: Big Data Processing with MapReduce and Spark](https://reader036.vdocument.in/reader036/viewer/2022062302/568166ff550346895ddb65c7/html5/thumbnails/59.jpg)
Behavior with Not Enough RAM
Cache disabled
25% 50% 75% Fully cached
020406080
10068
.8
58.1
40.7
29.7
11.5
% of working set in memory
Iter
atio
n ti
me
(s)