an introduction to big data analysis using sparklia.deis.unibo.it/courses/compnetworksm/1617/... ·...
TRANSCRIPT
![Page 1: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/1.jpg)
An Introduction to Big Data Analysis using Spark
Mohamad Jaber
American University of Beirut-
Faculty of Arts & Sciences - Department of Computer Science
May 17, 2017
Mohamad Jaber (AUB) Spark May 17, 2017 1 / 43
![Page 2: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/2.jpg)
Big Data
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 2 / 43
![Page 3: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/3.jpg)
Big Data
Big Data
We live in the data age
Mohamad Jaber (AUB) Spark May 17, 2017 3 / 43
![Page 4: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/4.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 5: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/5.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 6: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/6.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 7: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/7.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 8: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/8.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 9: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/9.jpg)
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
![Page 10: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/10.jpg)
Big Data
Simple Java Program to Analyze Data
p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {
// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));
l ong score = 0;String line = n u l l ;
// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {
score += analyzer.analyze(line);}r e t u r n score;
}
Throughput 1GB per hour
10GB data set ⇒ 10 hours
Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43
![Page 11: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/11.jpg)
Big Data
Simple Java Program to Analyze Data
p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {
// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));
l ong score = 0;String line = n u l l ;
// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {
score += analyzer.analyze(line);}r e t u r n score;
}
Throughput 1GB per hour
10GB data set ⇒ 10 hours
Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43
![Page 12: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/12.jpg)
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
![Page 13: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/13.jpg)
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
![Page 14: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/14.jpg)
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
![Page 15: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/15.jpg)
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
![Page 16: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/16.jpg)
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
![Page 17: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/17.jpg)
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
![Page 18: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/18.jpg)
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
![Page 19: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/19.jpg)
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
![Page 20: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/20.jpg)
Apache Spark
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 8 / 43
![Page 21: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/21.jpg)
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
![Page 22: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/22.jpg)
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
![Page 23: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/23.jpg)
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
![Page 24: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/24.jpg)
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
![Page 25: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/25.jpg)
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
![Page 26: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/26.jpg)
Apache Spark
Why Spark?
Spark is
More expressive. APIs modeled after Scala collections. Look like functionallists! Richer, more composable operations possible than in MapReduce
Efficient. Not only performant in terms of running time... But also in termsof developer productivity! Interactive!
Good for data science. Not just because of performance, but because itenables iteration, which is required by most algorithms in a data scientist’stoolbox (e.g., machine learning, graph analytics)
Mohamad Jaber (AUB) Spark May 17, 2017 10 / 43
![Page 27: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/27.jpg)
Apache Spark
Scala Quick Tour
Scala is a high-level language for the Java VM (object oriented + functionalprogramming) – supports interactive shell
// declare variablesv a r x: Int = 7v a r x = 7 // type inferredv a l y = "hi" // read -only
// Functionsdef square(x: Int): Int = x * xdef square(x: Int): Int = {
x*x}
def announce(text: String) {println(text)
}
// Generic Typesv a r arr = new Array[Int ](8)v a r lst = List(1, 2, 3)arr (5) = 7
// processing collectionsv a l list = List(1, 2, 3)list.foreach(x => println(x))list.foreach(println) // shortcut
v a l incMap = list.map(x => x + 2)// same with place holder
notationv a l incMap = list.map(_ + 2)
v a l f = list.filter(x => x % 2 ==1)
v a l f = list.filter(_ % 2 == 1)
v a l n = list.reduce ((x,y) => x + y)
v a l n = list.reduce(_ + _)// List is immutable
Mohamad Jaber (AUB) Spark May 17, 2017 11 / 43
![Page 28: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/28.jpg)
Apache Spark
Visualizing Shared Memory Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Shared Memory Data Parallelism
Split the data
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
Mohamad Jaber (AUB) Spark May 17, 2017 12 / 43
![Page 29: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/29.jpg)
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern: now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
![Page 30: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/30.jpg)
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern:
now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
![Page 31: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/31.jpg)
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern: now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
![Page 32: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/32.jpg)
Apache Spark
Apache Spark
Apache Spark is a framework for distributed data processing!
Spark implements a distributed data parallel model calledResilient Distributed Datasets (RDDs)
Mohamad Jaber (AUB) Spark May 17, 2017 14 / 43
![Page 33: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/33.jpg)
Apache Spark
Distributed Data Parallel: High-Level
Given some large dataset that cannot fit into memory on a single node
Mohamad Jaber (AUB) Spark May 17, 2017 15 / 43
![Page 34: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/34.jpg)
Apache Spark
Distributed Data Parallel: High-Level
Chunk up (partition) the data
Distribute it over a cluster of machines
From there think of your distributed data like a single collection!
Example (transform the text of all wiki articles to lowercase)
v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)
Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43
![Page 35: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/35.jpg)
Apache Spark
Distributed Data Parallel: High-Level
Chunk up (partition) the data
Distribute it over a cluster of machines
From there think of your distributed data like a single collection!
Example (transform the text of all wiki articles to lowercase)
v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)
Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43
![Page 36: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/36.jpg)
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
![Page 37: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/37.jpg)
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
![Page 38: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/38.jpg)
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
![Page 39: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/39.jpg)
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
![Page 40: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/40.jpg)
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
![Page 41: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/41.jpg)
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
![Page 42: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/42.jpg)
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
![Page 43: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/43.jpg)
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
![Page 44: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/44.jpg)
Apache Spark
Hadoop MapReduce
MapReduce works by breaking the processing into two phases
Each phase has key-value pairs as input and output
Map: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)
Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)
Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43
![Page 45: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/45.jpg)
Apache Spark
Hadoop MapReduce
MapReduce works by breaking the processing into two phases
Each phase has key-value pairs as input and outputMap: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)
Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)
Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43
![Page 46: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/46.jpg)
Apache Spark
Why Spark?
Fault-tolerance in Hadoop MapReduce comes at a cost
Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk
Cons. of Hadoop MapReduce
Not efficient to use the same data multiple times (iterative and interactive)
Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization
Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43
![Page 47: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/47.jpg)
Apache Spark
Why Spark?
Fault-tolerance in Hadoop MapReduce comes at a cost
Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk
Cons. of Hadoop MapReduce
Not efficient to use the same data multiple times (iterative and interactive)
Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization
Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43
![Page 48: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/48.jpg)
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
![Page 49: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/49.jpg)
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
![Page 50: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/50.jpg)
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
![Page 51: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/51.jpg)
Apache Spark
Spark vs Hadoop Performance
Mohamad Jaber (AUB) Spark May 17, 2017 22 / 43
![Page 52: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/52.jpg)
Apache Spark
Spark vs Hadoop Popularity
According to Google trends, Spark has surpassed Hadoop in popularity!
Mohamad Jaber (AUB) Spark May 17, 2017 23 / 43
![Page 53: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/53.jpg)
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
![Page 54: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/54.jpg)
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
![Page 55: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/55.jpg)
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
![Page 56: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/56.jpg)
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
![Page 57: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/57.jpg)
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
![Page 58: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/58.jpg)
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
![Page 59: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/59.jpg)
Apache Spark
Programming with RDDs - Spark Context
Spark Context
Main entry point to Spark functionality
Available in shell as variable sc
scala > val rdd = sc.textFile("input.txt")
Standalone application
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43
![Page 60: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/60.jpg)
Apache Spark
Programming with RDDs - Spark Context
Spark Context
Main entry point to Spark functionality
Available in shell as variable sc
scala > val rdd = sc.textFile("input.txt")
Standalone application
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43
![Page 61: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/61.jpg)
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
![Page 62: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/62.jpg)
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
![Page 63: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/63.jpg)
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
![Page 64: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/64.jpg)
Apache Spark
Actions on RDDs
v a l nums = sc.parallelize(List(1, 2, 3))
// Retrieve RDD contents as a local collectionnums.collect () // => Array(1, 2, 3)
// Return first K elementsnums.take (2) // => Array(1, 2)
// Count number of elementsnums.count() // => 3
// Merge elements with an associative functionnums.reduce(_ + _) // => 6 -- equivalent to nums.reduce ((x,y) => x + y)
// Write elements to a text filenums.saveAsTextFile("hdfs :// file.txt")
// loop over all elementsnums.foreach(println)
Mohamad Jaber (AUB) Spark May 17, 2017 28 / 43
![Page 65: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/65.jpg)
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
![Page 66: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/66.jpg)
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
![Page 67: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/67.jpg)
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
![Page 68: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/68.jpg)
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))logDataFilter.cache()v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
![Page 69: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/69.jpg)
Apache Spark
Pair RDDs
Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs
// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b
// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))
Some transformations: reduceByKey, join, sortByKey, mapValues
v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {
v a l split = v.split("\\s+")(split (0), split (1).toInt)
}).cache ()
// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",
1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)
Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43
![Page 70: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/70.jpg)
Apache Spark
Pair RDDs
Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs
// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b
// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))
Some transformations: reduceByKey, join, sortByKey, mapValues
v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {
v a l split = v.split("\\s+")(split (0), split (1).toInt)
}).cache ()
// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",
1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)
Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43
![Page 71: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/71.jpg)
Apache Spark
Example: Word Count
v a l lines = sc.textFile("input.txt")v a l counts = lines.flatMap(line => line.split("\\s+"))
.map(word => (word , 1))
.reduceByKey(_ + _)
Mohamad Jaber (AUB) Spark May 17, 2017 31 / 43
![Page 72: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/72.jpg)
Apache Spark
Example: Simple Linear Regression
v a l rddData = sc.textFile("input")v a r teta = Math.random ()v a l learningRate = 0.0001v a r i = 0v a l iterations = 100
v a l rddDataXY = rddData.map(item => {v a l itemSplit = item.split(" ")(itemSplit (0).toDouble , itemSplit (1).toDouble)
}).cache ()
f o r (i <− 1 to iterations) {v a l rddInnerGradient = rddDataXY.map(item => 2 * (teta * item._1 - item._2
) * item._1)v a l gradient = rddInnerGradient.reduce ((v1 , v2) => v1 + v2)teta = teta - learningRate * gradient.toDouble
}
Mohamad Jaber (AUB) Spark May 17, 2017 32 / 43
![Page 73: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/73.jpg)
Apache Spark
Fault Tolerance
One option to do fault tolerance is to replicate the data (into multiple nodes)
However, this may drastically affect the performance (disk and network I/O)
Spark uses method called lineage
Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!
Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43
![Page 74: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/74.jpg)
Apache Spark
Fault Tolerance
One option to do fault tolerance is to replicate the data (into multiple nodes)
However, this may drastically affect the performance (disk and network I/O)
Spark uses method called lineage
Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!
Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43
![Page 75: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/75.jpg)
Apache Spark
Spark Execution Engine - Stages and Tasks
sc.textFile("hdfs ://master -node/input/data").map(x => (x(0), x)).groupByKey ().mapValues(f => f.count(x => t r u e ))
stage1
stage2
Mohamad Jad Mary
(M, Mohamad) (J, Jad) (M, Mary)
M, (Mohamad, Mary) J, (Jad)
M, 2 J, 1
Mohamad Jaber (AUB) Spark May 17, 2017 34 / 43
![Page 76: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/76.jpg)
Apache Spark
Spark Execution Engine - Stages and Tasks
stage 1 stage 2
stage 3
RDD1 RDD2
RDD3 RDD4
RDD5
RDD6
RDD7
map map
filter
join
map
DAGScheduler is the schedulinglayer of Spark that implementsstage-oriented scheduling
It transforms a logical executionplan to a physical execution plan(using stages)
Stages are submitted as tasks
When the result generate isindependent of any other data thenwe can pipeline!
Mohamad Jaber (AUB) Spark May 17, 2017 35 / 43
![Page 77: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/77.jpg)
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
![Page 78: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/78.jpg)
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
![Page 79: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/79.jpg)
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
![Page 80: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/80.jpg)
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
![Page 81: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/81.jpg)
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
![Page 82: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/82.jpg)
Apache Spark
More about Spark
Modules on top of Spark
Spark also supports a rich set of higher-level tools
GraphX forgraph processing
MLLib for machine learning
Spark SQL for structured data processing
Spark Streaming
Geo and Spatial Spark: geographical and spatial data
Mohamad Jaber (AUB) Spark May 17, 2017 37 / 43
![Page 83: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/83.jpg)
Distributed File System
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 38 / 43
![Page 84: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/84.jpg)
Distributed File System
Distributed File Systems
Client/Server-based Distributed File Systems
The actual file service is offered/stored by a single machine
Network File System (NFS)
Andrew File System (AFS)
ServerClient 1
Client 2
Client 3
HDD
Cluster-based Distributed File Systems
Divide files among tens, hundreds, thousands or tens of thousands of machines
Google File System (GFS - appeared inSOSP 2003)
Hadoop Distributed File System (HDFS)
Server Server
Server Server
Client 1
Client 2
Client 3
HDD HDD
HDD HDD
Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43
![Page 85: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/85.jpg)
Distributed File System
Distributed File Systems
Client/Server-based Distributed File Systems
The actual file service is offered/stored by a single machine
Network File System (NFS)
Andrew File System (AFS)
ServerClient 1
Client 2
Client 3
HDD
Cluster-based Distributed File Systems
Divide files among tens, hundreds, thousands or tens of thousands of machines
Google File System (GFS - appeared inSOSP 2003)
Hadoop Distributed File System (HDFS)
Server Server
Server Server
Client 1
Client 2
Client 3
HDD HDD
HDD HDD
Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43
![Page 86: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/86.jpg)
Distributed File System
Hadoop Distributed File System (HDFS)
HDFS (open source) is inspired by GFS
DataNode DataNode DataNode DataNode DataNode
NameNode
FsImage EditLog
metadata
file
— — — — — — — — — — — —
— — — — — — — — — — — —
— — — — — — — — — — — —
Mohamad Jaber (AUB) Spark May 17, 2017 40 / 43
![Page 87: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/87.jpg)
Distributed File System
HDFS Components in Cluster
Mohamad Jaber (AUB) Spark May 17, 2017 41 / 43
![Page 88: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/88.jpg)
Distributed File System
HDFS Commands
HDFS provides a shell like and a list of commands are available to interact with it
# format the file systemhdfs namenode -format
# starts namenode and datanode daemonsstart -dfs.sh
# Operations on the HDFShdfs dfs <args >
# Create directoryhdfs dfs -mkdir /input
# List Fileshdfs dfs -ls /
# Transfer and store a data file from local systems to the HDFShdfs dfs -put /home/jaber/file.txt /input
# View the data from HDFS using cat commandhdfs dfs -cat /input/file.txt
# Get the file from HDFS to the local file systemhdfs dfs -get /input/file.txt /home/jaber/Desktop
Mohamad Jaber (AUB) Spark May 17, 2017 42 / 43
![Page 89: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University](https://reader033.vdocument.in/reader033/viewer/2022050203/5f56e21970dd101ac02ea8d2/html5/thumbnails/89.jpg)
Distributed File System
Hadoop - Yarn Cluster
Mohamad Jaber (AUB) Spark May 17, 2017 43 / 43