an introduction to big data analysis using sparklia.deis.unibo.it/courses/compnetworksm/1617/... ·...
TRANSCRIPT
An Introduction to Big Data Analysis using Spark
Mohamad Jaber
American University of Beirut-
Faculty of Arts & Sciences - Department of Computer Science
May 17, 2017
Mohamad Jaber (AUB) Spark May 17, 2017 1 / 43
Big Data
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 2 / 43
Big Data
Big Data
We live in the data age
Mohamad Jaber (AUB) Spark May 17, 2017 3 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Big Data - Some Numbers (2013-14)
Storage and Processing
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month
New York Stock Exchange generates about 4 − 5 terabytes of data per day
Google processes 20 petabytes of information per day
. . .
Estimation
The size of digital universe is 4.4 zettabytes (1021) in 2013
2020: 44 zettabytes
Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43
Big Data
Simple Java Program to Analyze Data
p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {
// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));
l ong score = 0;String line = n u l l ;
// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {
score += analyzer.analyze(line);}r e t u r n score;
}
Throughput 1GB per hour
10GB data set ⇒ 10 hours
Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43
Big Data
Simple Java Program to Analyze Data
p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {
// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));
l ong score = 0;String line = n u l l ;
// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {
score += analyzer.analyze(line);}r e t u r n score;
}
Throughput 1GB per hour
10GB data set ⇒ 10 hours
Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
Big Data
How can we Improve the Performance?
Faster CPU – Scale up (vertically)
More/Faster memory – Scale up (vertically)
Increase the number of cores
Increase the number of threads
Increase the number of threads and cores
Shared Memory (pthreads)Message Passing (MPI)
Multi-threaded
Throughput 10GB per hour
1PB?
Fault?
Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
Big Data
What do we Need?
We need a framework that abstracts away / hides:
Scale Out (horizontally)
Parallelization
Data distribution
Fault-tolerance
Load Balancing
Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43
Apache Spark
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 8 / 43
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
Apache Spark
Why Spark?
Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.
If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale
You have to re-implement everything in some other language or system
Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large
According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.
Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43
Apache Spark
Why Spark?
Spark is
More expressive. APIs modeled after Scala collections. Look like functionallists! Richer, more composable operations possible than in MapReduce
Efficient. Not only performant in terms of running time... But also in termsof developer productivity! Interactive!
Good for data science. Not just because of performance, but because itenables iteration, which is required by most algorithms in a data scientist’stoolbox (e.g., machine learning, graph analytics)
Mohamad Jaber (AUB) Spark May 17, 2017 10 / 43
Apache Spark
Scala Quick Tour
Scala is a high-level language for the Java VM (object oriented + functionalprogramming) – supports interactive shell
// declare variablesv a r x: Int = 7v a r x = 7 // type inferredv a l y = "hi" // read -only
// Functionsdef square(x: Int): Int = x * xdef square(x: Int): Int = {
x*x}
def announce(text: String) {println(text)
}
// Generic Typesv a r arr = new Array[Int ](8)v a r lst = List(1, 2, 3)arr (5) = 7
// processing collectionsv a l list = List(1, 2, 3)list.foreach(x => println(x))list.foreach(println) // shortcut
v a l incMap = list.map(x => x + 2)// same with place holder
notationv a l incMap = list.map(_ + 2)
v a l f = list.filter(x => x % 2 ==1)
v a l f = list.filter(_ % 2 == 1)
v a l n = list.reduce ((x,y) => x + y)
v a l n = list.reduce(_ + _)// List is immutable
Mohamad Jaber (AUB) Spark May 17, 2017 11 / 43
Apache Spark
Visualizing Shared Memory Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Shared Memory Data Parallelism
Split the data
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
Mohamad Jaber (AUB) Spark May 17, 2017 12 / 43
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern: now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern:
now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
Apache Spark
Visualizing Distributed Data Parallelism
v a l res = jar.map(jellyBean => doSomething(jellyBean))
Distributed Data Parallelism
Split the data over several nodes (machines)
Workers/threads independently operates on the data shared in parallel
Combine when done (if necessary)
New concern: now we need to worry about network latency (combining)!
Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43
Apache Spark
Apache Spark
Apache Spark is a framework for distributed data processing!
Spark implements a distributed data parallel model calledResilient Distributed Datasets (RDDs)
Mohamad Jaber (AUB) Spark May 17, 2017 14 / 43
Apache Spark
Distributed Data Parallel: High-Level
Given some large dataset that cannot fit into memory on a single node
Mohamad Jaber (AUB) Spark May 17, 2017 15 / 43
Apache Spark
Distributed Data Parallel: High-Level
Chunk up (partition) the data
Distribute it over a cluster of machines
From there think of your distributed data like a single collection!
Example (transform the text of all wiki articles to lowercase)
v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)
Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43
Apache Spark
Distributed Data Parallel: High-Level
Chunk up (partition) the data
Distribute it over a cluster of machines
From there think of your distributed data like a single collection!
Example (transform the text of all wiki articles to lowercase)
v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)
Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
Apache Spark
Distribution
Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)
Partial failure: crash failure of a subset of machines
Latency: certain operations (combining) have a much higher latency thanother operations due to network communication
Important Latency Numbers
Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns
Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
Apache Spark
Big Data Processing and Latency
Network communication and disk operations can be very expensive!
How do these latency numbers related to big data processing?
To answer this question let us discuss Spark’s predecessor, Hadoop
Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)
Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)
Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43
Apache Spark
Hadoop MapReduce
MapReduce works by breaking the processing into two phases
Each phase has key-value pairs as input and output
Map: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)
Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)
Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43
Apache Spark
Hadoop MapReduce
MapReduce works by breaking the processing into two phases
Each phase has key-value pairs as input and outputMap: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)
Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)
Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43
Apache Spark
Why Spark?
Fault-tolerance in Hadoop MapReduce comes at a cost
Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk
Cons. of Hadoop MapReduce
Not efficient to use the same data multiple times (iterative and interactive)
Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization
Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43
Apache Spark
Why Spark?
Fault-tolerance in Hadoop MapReduce comes at a cost
Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk
Cons. of Hadoop MapReduce
Not efficient to use the same data multiple times (iterative and interactive)
Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization
Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
Apache Spark
Why Spark?
Retains fault-tolerance
Different strategy handling latency
Achieves this using ideas from functional programming
Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset
Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!
Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43
Apache Spark
Spark vs Hadoop Performance
Mohamad Jaber (AUB) Spark May 17, 2017 22 / 43
Apache Spark
Spark vs Hadoop Popularity
According to Google trends, Spark has surpassed Hadoop in popularity!
Mohamad Jaber (AUB) Spark May 17, 2017 23 / 43
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
Apache Spark
Spark - RDD
Spark extends MapReduce model to better support two common classesanalytics applications
Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively
Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)
RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster
Spark provides high-level APIs in Java, Scala, Python and R
Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
Apache Spark
RDD
An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)
It is also possible to execute actions on RDD
An action returns single values (not collections) as results (e.g., reduce,count, first)
RDD can be cashed for efficient (later) use
Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43
Apache Spark
Programming with RDDs - Spark Context
Spark Context
Main entry point to Spark functionality
Available in shell as variable sc
scala > val rdd = sc.textFile("input.txt")
Standalone application
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43
Apache Spark
Programming with RDDs - Spark Context
Spark Context
Main entry point to Spark functionality
Available in shell as variable sc
scala > val rdd = sc.textFile("input.txt")
Standalone application
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
Apache Spark
Create RDDs
An RDD can be created either from a stable storage (e.g., local, HDFS):
v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")
v a l sc = new SparkContext(conf)
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions
Or, through parallel transformation of another RDD
// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))
// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))
// combinedv a l countData = logData.filter(x => !x.contains("error")).
map(x => x.split("\\s+").count(x => t r u e ))
Or, you can turn a Scala collection into an RDD
v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43
Apache Spark
Actions on RDDs
v a l nums = sc.parallelize(List(1, 2, 3))
// Retrieve RDD contents as a local collectionnums.collect () // => Array(1, 2, 3)
// Return first K elementsnums.take (2) // => Array(1, 2)
// Count number of elementsnums.count() // => 3
// Merge elements with an associative functionnums.reduce(_ + _) // => 6 -- equivalent to nums.reduce ((x,y) => x + y)
// Write elements to a text filenums.saveAsTextFile("hdfs :// file.txt")
// loop over all elementsnums.foreach(println)
Mohamad Jaber (AUB) Spark May 17, 2017 28 / 43
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))
logDataFilter.cache()
v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
Apache Spark
Lazy Operations and Caching
All transformations in Spark are lazy
They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program
Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)
Executing another action would repeat the reconstruction from the beginning
However, you can cache some RDDs!
v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))logDataFilter.cache()v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())
countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))
print(countData.count()) // will repeat from the begining
Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43
Apache Spark
Pair RDDs
Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs
// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b
// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))
Some transformations: reduceByKey, join, sortByKey, mapValues
v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {
v a l split = v.split("\\s+")(split (0), split (1).toInt)
}).cache ()
// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",
1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)
Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43
Apache Spark
Pair RDDs
Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs
// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b
// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))
Some transformations: reduceByKey, join, sortByKey, mapValues
v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {
v a l split = v.split("\\s+")(split (0), split (1).toInt)
}).cache ()
// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",
1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)
Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43
Apache Spark
Example: Word Count
v a l lines = sc.textFile("input.txt")v a l counts = lines.flatMap(line => line.split("\\s+"))
.map(word => (word , 1))
.reduceByKey(_ + _)
Mohamad Jaber (AUB) Spark May 17, 2017 31 / 43
Apache Spark
Example: Simple Linear Regression
v a l rddData = sc.textFile("input")v a r teta = Math.random ()v a l learningRate = 0.0001v a r i = 0v a l iterations = 100
v a l rddDataXY = rddData.map(item => {v a l itemSplit = item.split(" ")(itemSplit (0).toDouble , itemSplit (1).toDouble)
}).cache ()
f o r (i <− 1 to iterations) {v a l rddInnerGradient = rddDataXY.map(item => 2 * (teta * item._1 - item._2
) * item._1)v a l gradient = rddInnerGradient.reduce ((v1 , v2) => v1 + v2)teta = teta - learningRate * gradient.toDouble
}
Mohamad Jaber (AUB) Spark May 17, 2017 32 / 43
Apache Spark
Fault Tolerance
One option to do fault tolerance is to replicate the data (into multiple nodes)
However, this may drastically affect the performance (disk and network I/O)
Spark uses method called lineage
Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!
Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43
Apache Spark
Fault Tolerance
One option to do fault tolerance is to replicate the data (into multiple nodes)
However, this may drastically affect the performance (disk and network I/O)
Spark uses method called lineage
Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!
Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43
Apache Spark
Spark Execution Engine - Stages and Tasks
sc.textFile("hdfs ://master -node/input/data").map(x => (x(0), x)).groupByKey ().mapValues(f => f.count(x => t r u e ))
stage1
stage2
Mohamad Jad Mary
(M, Mohamad) (J, Jad) (M, Mary)
M, (Mohamad, Mary) J, (Jad)
M, 2 J, 1
Mohamad Jaber (AUB) Spark May 17, 2017 34 / 43
Apache Spark
Spark Execution Engine - Stages and Tasks
stage 1 stage 2
stage 3
RDD1 RDD2
RDD3 RDD4
RDD5
RDD6
RDD7
map map
filter
join
map
DAGScheduler is the schedulinglayer of Spark that implementsstage-oriented scheduling
It transforms a logical executionplan to a physical execution plan(using stages)
Stages are submitted as tasks
When the result generate isindependent of any other data thenwe can pipeline!
Mohamad Jaber (AUB) Spark May 17, 2017 35 / 43
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
Apache Spark
Spark Execution Flow
Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)
SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN
> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \
[options] <app jar > [app options]
Once connected, Spark acquires executors on nodes in the cluster
Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors
Finally, SparkContext sends tasks to the executors to run
Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43
Apache Spark
More about Spark
Modules on top of Spark
Spark also supports a rich set of higher-level tools
GraphX forgraph processing
MLLib for machine learning
Spark SQL for structured data processing
Spark Streaming
Geo and Spatial Spark: geographical and spatial data
Mohamad Jaber (AUB) Spark May 17, 2017 37 / 43
Distributed File System
1 Big Data
2 Apache Spark
3 Distributed File System
Mohamad Jaber (AUB) Spark May 17, 2017 38 / 43
Distributed File System
Distributed File Systems
Client/Server-based Distributed File Systems
The actual file service is offered/stored by a single machine
Network File System (NFS)
Andrew File System (AFS)
ServerClient 1
Client 2
Client 3
HDD
Cluster-based Distributed File Systems
Divide files among tens, hundreds, thousands or tens of thousands of machines
Google File System (GFS - appeared inSOSP 2003)
Hadoop Distributed File System (HDFS)
Server Server
Server Server
Client 1
Client 2
Client 3
HDD HDD
HDD HDD
Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43
Distributed File System
Distributed File Systems
Client/Server-based Distributed File Systems
The actual file service is offered/stored by a single machine
Network File System (NFS)
Andrew File System (AFS)
ServerClient 1
Client 2
Client 3
HDD
Cluster-based Distributed File Systems
Divide files among tens, hundreds, thousands or tens of thousands of machines
Google File System (GFS - appeared inSOSP 2003)
Hadoop Distributed File System (HDFS)
Server Server
Server Server
Client 1
Client 2
Client 3
HDD HDD
HDD HDD
Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43
Distributed File System
Hadoop Distributed File System (HDFS)
HDFS (open source) is inspired by GFS
DataNode DataNode DataNode DataNode DataNode
NameNode
FsImage EditLog
metadata
file
— — — — — — — — — — — —
— — — — — — — — — — — —
— — — — — — — — — — — —
Mohamad Jaber (AUB) Spark May 17, 2017 40 / 43
Distributed File System
HDFS Components in Cluster
Mohamad Jaber (AUB) Spark May 17, 2017 41 / 43
Distributed File System
HDFS Commands
HDFS provides a shell like and a list of commands are available to interact with it
# format the file systemhdfs namenode -format
# starts namenode and datanode daemonsstart -dfs.sh
# Operations on the HDFShdfs dfs <args >
# Create directoryhdfs dfs -mkdir /input
# List Fileshdfs dfs -ls /
# Transfer and store a data file from local systems to the HDFShdfs dfs -put /home/jaber/file.txt /input
# View the data from HDFS using cat commandhdfs dfs -cat /input/file.txt
# Get the file from HDFS to the local file systemhdfs dfs -get /input/file.txt /home/jaber/Desktop
Mohamad Jaber (AUB) Spark May 17, 2017 42 / 43
Distributed File System
Hadoop - Yarn Cluster
Mohamad Jaber (AUB) Spark May 17, 2017 43 / 43