![Page 1: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/1.jpg)
UC#BERKELEY#
Introduction to Big Data!with Apache Spark"
![Page 2: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/2.jpg)
This Lecture"The Big Data Problem"
Hardware for Big Data"
Distributing Work"
Handling Failures and Slow Machines"
Map Reduce and Complex Jobs"
Apache Spark"
![Page 3: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/3.jpg)
Some Traditional Analysis Tools"• Unix shell commands, Pandas, R"
All run on a !single machine!"
![Page 4: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/4.jpg)
The Big Data Problem"• Data growing faster than computation speeds"
• Growing data sources"» Web, mobile, scientific, …"
• Storage getting cheaper"» Size doubling every 18 months"
• But, stalling CPU speeds and storage !bottlenecks"
![Page 5: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/5.jpg)
Big Data Examples"• Facebook’s daily logs: 60 TB"
• 1,000 genomes project: 200 TB"
• Google web index: 10+ PB"
• Cost of 1 TB of disk: ~$35"
• Time to read 1 TB from disk: 3 hours !(100 MB/s)"
![Page 6: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/6.jpg)
The Big Data Problem"• A single machine can no longer process or even store all
the data!"
• Only solution is to distribute data over large clusters"
![Page 7: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/7.jpg)
Google Datacenter"
How do we program this thing?!
![Page 8: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/8.jpg)
Hardware for Big Data"
Lots of hard drives" … and CPUs"
![Page 9: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/9.jpg)
Hardware for Big Data"One big box?"(1990’s solution)"
But, expensive"» Low volume"» All “premium” hardware"
And, still not big enough!!Image: Wikimedia Commons / User:Tonusamuel
![Page 10: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/10.jpg)
Hardware for Big Data"Consumer-grade hardware""Not “gold plated”"
"Many desktop-like servers""Easy to add capacity""Cheaper per CPU/disk"
"Complexity in software""
Image: Steve Jurvetson/Flickr
![Page 11: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/11.jpg)
Problems with Cheap Hardware"Failures, Google’s numbers:""1-5% hard drives/year""0.2% DIMMs/year"
"Network speeds versus shared memory""Much more latency""Network slower than storage"
"Uneven performance""
![Page 12: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/12.jpg)
What’s Hard About Cluster Computing?"• How do we split work across machines?"
![Page 13: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/13.jpg)
How do you count the number of occurrences of each word in a document?"
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
I: 3"am: 3"Sam: 3"do: 1"you: 1"like: 1"
…"
![Page 14: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/14.jpg)
One Approach: Use a Hash Table"
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
{}"
![Page 15: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/15.jpg)
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
One Approach: Use a Hash Table"
{I :1}"
![Page 16: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/16.jpg)
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
One Approach: Use a Hash Table"
{I: 1,"am: 1}"
![Page 17: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/17.jpg)
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
One Approach: Use a Hash Table"
{I: 1,"am: 1,"
Sam: 1}"
![Page 18: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/18.jpg)
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?”"
One Approach: Use a Hash Table"
{I: 2,"am: 1,"
Sam: 1}"
![Page 19: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/19.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
![Page 20: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/20.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 3,"am: 3,"Sam: 3"
{do: 2, … }"
{Sam: 1,"… }"
{Would:1, … }"
{I: 6,"am: 4,"Sam: 4,"do: 3"… }"
What’s the problem with this approach?!
Machines 1- 4"
Machine 5"
![Page 21: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/21.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 3,"am: 3,"Sam: 3"
{do: 2, … }"
{Sam: 1,"… }"
{Would:1, … }"
{I: 6,"am: 4,"Sam: 4,"do: 3"… }"
Results have to fit on one machine!
Machines 1- 4"
Machine 5"
![Page 22: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/22.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 3,"am: 3,"Sam: 3"
{do: 2, … }"
{Sam: 1,"… }"
{Would:1, … }"
{I: 4,"am: 3, … }"
{I: 2,"do: 1,"… }"
{I: 6,"am: 3,"you: 2, not: 1, … }"
Can add aggregation layers but results still must fit on one machine!
![Page 23: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/23.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
Use Divide and Conquer!!!
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
{I: 6,"do: 3, "… }"
Machines 1- 4"
![Page 24: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/24.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
Use Divide and Conquer!!!
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
{I: 6,"do: 3, "… }"
{am:5,"Sam: 4, … }"
{you: 2,"…}"
{Would: 1,"…}"
Machines 1- 4" Machines 1- 4"
![Page 25: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/25.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
Use Divide and Conquer!!!
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
{I: 6,"do: 3, "… }"
{am:5,"Sam: 4, … }"
{you: 2,"…}"
{Would: 1,"…}"
MAP"
![Page 26: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/26.jpg)
What if the Document is Really Big?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}" Google !
Map Reduce 2004"
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
{I: 6,"do: 3, "… }"
{am:5,"Sam: 4, … }"
{you: 2,"…}"
{Would: 1,"…}"
MAP" REDUCE"
http://research.google.com/archive/mapreduce.html"
![Page 27: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/27.jpg)
Map Reduce for Sorting"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
{1: would, 2: you, … }"
{3: do,"4: Sam, … }"
{5: am, … }"
{6: I"…}"
“What word is used most?”"
��2"
> 2,!��4"
> 4,!��5"
> 5"
![Page 28: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/28.jpg)
What’s Hard About Cluster Computing?"• How to divide work across machines?"» Must consider network, data locality"» Moving data may be very expensive"
• How to deal with failures?"» 1 server fails every 3 years ! with 10,000 nodes see 10 faults/day"» Even worse: stragglers (not failed, but slow nodes)"
![Page 29: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/29.jpg)
How Do We Deal with Failures?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
![Page 30: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/30.jpg)
How Do We Deal with Machine Failures?"
“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
Launch another task!"
![Page 31: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/31.jpg)
How Do We Deal with Slow Tasks?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
![Page 32: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/32.jpg)
How Do We Deal with Slow Tasks?"“I am Sam"I am Sam"Sam I am"
Do you like"Green eggs and ham?"
I do not like them"Sam I am"
I do not like"Green eggs and ham"Would you like them"
Here or there?"…”"
{I: 1,"am: 1,"…}"
{Would: 1,"you: 1,"
…}"
{Would: 1,"you: 1,"
…}"
{do: 1,"you: 1, "
…}"
Launch another task!"
![Page 33: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/33.jpg)
Map Reduce: Distributed Execution"MAP" REDUCE"
Each stage passes through the hard drives "
![Page 34: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/34.jpg)
Map Reduce: Iterative Jobs"• Iterative jobs involve a lot of disk I/O for each repetition"
Stag
e 1"
Stag
e 2"
Stag
e 3"
Disk I/O is very slow!"
![Page 35: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/35.jpg)
Apache Spark Motivation"• Using Map Reduce for complex jobs, interactive queries
and online processing involves lots of disk I/O!
Query 1"Query 2"Query 3"
Interactive mining"Jo
b 1"
Job
2"
…"
Stream processing"Also, iterative jobs"
Disk I/O is very slow"
![Page 36: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/36.jpg)
Tech Trend: Cost of Memory"PR
ICE"
"
YEAR""
Memory"
disk" flash"
http://www.jcmit.com/mem2014.htm "
2010: 1 ¢/MB"
Lower cost means can put more memory in each server"
![Page 37: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/37.jpg)
Hardware for Big Data"
Lots of hard drives"
… and memory!"
… and CPUs"
![Page 38: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/38.jpg)
Opportunity"• Keep more data in-memory!
• Create new distributed execution engine:"
http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf "
![Page 39: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/39.jpg)
Use Memory Instead of Disk"
iteration 1" iteration 2" . . ."Input"
HDFS !read"
HDFS !write"
HDFS !read"
HDFS !write"
Input"
query 1"
query 2"
query 3"
result 1"
result 2"
result 3"
. . ."
HDFS !read"
![Page 40: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/40.jpg)
In-Memory Data Sharing"
iteration 1" iteration 2" . . ."Input"
HDFS !read"
Input"
query 1"
query 2"
query 3"
result 1"
result 2"
result 3". . ."
one-time"processing"
Distributed!memory"
10-100x faster than network and disk"
![Page 41: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/41.jpg)
Resilient Distributed Datasets (RDDs)"• Write programs in terms of operations on distributed datasets"
• Partitioned collections of objects spread across a cluster, stored in memory or on disk "
• RDDs built and manipulated through a diverse !set of parallel transformations (map, filter, join) !and actions (count, collect, save)"
• RDDs automatically rebuilt on machine failure"
![Page 42: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/42.jpg)
The Spark Computing Framework"• Provides programming abstraction and parallel runtime
to hide complexities of fault-tolerance and slow machines"
• “Here’s an operation, run it on all of the data”"» I don’t care where it runs (you schedule that)"» In fact, feel free to run it twice on different !
nodes"
![Page 43: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/43.jpg)
Spark Tools"
Apache Spark"
Spark Streaming"
Spark SQL"
MLlib (machine learning)"
GraphX (graph)"
![Page 44: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/44.jpg)
Spark and Map Reduce Differences "
Hadoop"Map Reduce"
Spark"
Storage" Disk only" In-memory or on disk"Operations" Map and
Reduce"Map, Reduce, Join, Sample, etc…"
Execution model" Batch" Batch, interactive, streaming"
Programming"environments"
Java" Scala, Java, R, and Python"
![Page 45: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/45.jpg)
Other Spark and Map Reduce Differences "• Generalized patterns!
� unified engine for many use cases"
• Lazy evaluation of the lineage graph ! � reduces wait states, better pipelining"
• Lower overhead for starting jobs"
• Less expensive shuffles"
![Page 46: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/46.jpg)
In-Memory Can Make a Big Difference"• Two iterative Machine Learning algorithms:"
4.1"121"
0" 50" 100" 150"
K-means Clustering"Hadoop MR"Spark"
sec"
0.96"80"
0" 20" 40" 60" 80" 100"
Logistic Regression"Hadoop MR"Spark"
sec"
![Page 47: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/47.jpg)
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!
First Public Cloud Petabyte Sort"
57!
Daytona Gray 100 TB sort benchmark record (tied for 1st place)"
![Page 48: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/48.jpg)
http://www.oreilly.com/data/free/2014-data-science-salary-survey.csp "
Spark Expertise Tops Big Data Median Salaries"
58!
Over 800 respondents across 53 countries and 41 U.S. states"
![Page 49: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/49.jpg)
History Review"
61!
![Page 50: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/50.jpg)
Historical References"• circa 1979 – Stanford, MIT, CMU, etc.: set/list operations in LISP, Prolog, etc., for parallel processing!
http://www-formal.stanford.edu/jmc/history/lisp/lisp.htm"• circa 2004 – Google: MapReduce: Simplified Data Processing on Large Clusters"
Jeffrey Dean and Sanjay Ghemawat!http://research.google.com/archive/mapreduce.html"
• circa 2006 – Apache Hadoop, originating from the Yahoo!’s Nutch Project! Doug Cutting!http://research.yahoo.com/files/cutting.pdf"
• circa 2008 – Yahoo!: web scale search indexing! Hadoop Summit, HUG, etc.!http://developer.yahoo.com/hadoop/"
• circa 2009 – Amazon AWS: Elastic MapReduce! Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.!http://aws.amazon.com/elasticmapreduce/"
"
62!
![Page 51: Introduction to Big Data with Apache Spark · Introduction to Big Data! with Apache Spark" This Lecture" The Big Data Problem" Hardware for Big Data" ... » Low volume" » All “premium”](https://reader030.vdocument.in/reader030/viewer/2022041015/5ec5ff8335716311ae1092da/html5/thumbnails/51.jpg)
Spark Research Papers"• Spark: Cluster Computing with Working Sets"
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica!USENIX HotCloud (2010) "people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf!
• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for "In-Memory Cluster Computing"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, !Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, !Scott Shenker, Ion Stoica!NSDI (2012)!usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf"