an introduction to big data analysis using sparklia.deis.unibo.it/courses/compnetworksm/1617/... ·...

89
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17, 2017 1 / 43

Upload: others

Post on 18-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

An Introduction to Big Data Analysis using Spark

Mohamad Jaber

American University of Beirut-

Faculty of Arts & Sciences - Department of Computer Science

May 17, 2017

Mohamad Jaber (AUB) Spark May 17, 2017 1 / 43

Page 2: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

1 Big Data

2 Apache Spark

3 Distributed File System

Mohamad Jaber (AUB) Spark May 17, 2017 2 / 43

Page 3: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data

We live in the data age

Mohamad Jaber (AUB) Spark May 17, 2017 3 / 43

Page 4: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 5: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 6: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 7: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 8: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 9: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

. . .

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Mohamad Jaber (AUB) Spark May 17, 2017 4 / 43

Page 10: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Simple Java Program to Analyze Data

p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {

// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));

l ong score = 0;String line = n u l l ;

// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {

score += analyzer.analyze(line);}r e t u r n score;

}

Throughput 1GB per hour

10GB data set ⇒ 10 hours

Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43

Page 11: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

Simple Java Program to Analyze Data

p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {

// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));

l ong score = 0;String line = n u l l ;

// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {

score += analyzer.analyze(line);}r e t u r n score;

}

Throughput 1GB per hour

10GB data set ⇒ 10 hours

Mohamad Jaber (AUB) Spark May 17, 2017 5 / 43

Page 12: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Throughput 10GB per hour

1PB?

Fault?

Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43

Page 13: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Throughput 10GB per hour

1PB?

Fault?

Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43

Page 14: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Throughput 10GB per hour

1PB?

Fault?

Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43

Page 15: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Throughput 10GB per hour

1PB?

Fault?

Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43

Page 16: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Throughput 10GB per hour

1PB?

Fault?

Mohamad Jaber (AUB) Spark May 17, 2017 6 / 43

Page 17: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

What do we Need?

We need a framework that abstracts away / hides:

Scale Out (horizontally)

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43

Page 18: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

What do we Need?

We need a framework that abstracts away / hides:

Scale Out (horizontally)

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43

Page 19: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Big Data

What do we Need?

We need a framework that abstracts away / hides:

Scale Out (horizontally)

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Mohamad Jaber (AUB) Spark May 17, 2017 7 / 43

Page 20: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

1 Big Data

2 Apache Spark

3 Distributed File System

Mohamad Jaber (AUB) Spark May 17, 2017 8 / 43

Page 21: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43

Page 22: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43

Page 23: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43

Page 24: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43

Page 25: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Mohamad Jaber (AUB) Spark May 17, 2017 9 / 43

Page 26: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Spark is

More expressive. APIs modeled after Scala collections. Look like functionallists! Richer, more composable operations possible than in MapReduce

Efficient. Not only performant in terms of running time... But also in termsof developer productivity! Interactive!

Good for data science. Not just because of performance, but because itenables iteration, which is required by most algorithms in a data scientist’stoolbox (e.g., machine learning, graph analytics)

Mohamad Jaber (AUB) Spark May 17, 2017 10 / 43

Page 27: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Scala Quick Tour

Scala is a high-level language for the Java VM (object oriented + functionalprogramming) – supports interactive shell

// declare variablesv a r x: Int = 7v a r x = 7 // type inferredv a l y = "hi" // read -only

// Functionsdef square(x: Int): Int = x * xdef square(x: Int): Int = {

x*x}

def announce(text: String) {println(text)

}

// Generic Typesv a r arr = new Array[Int ](8)v a r lst = List(1, 2, 3)arr (5) = 7

// processing collectionsv a l list = List(1, 2, 3)list.foreach(x => println(x))list.foreach(println) // shortcut

v a l incMap = list.map(x => x + 2)// same with place holder

notationv a l incMap = list.map(_ + 2)

v a l f = list.filter(x => x % 2 ==1)

v a l f = list.filter(_ % 2 == 1)

v a l n = list.reduce ((x,y) => x + y)

v a l n = list.reduce(_ + _)// List is immutable

Mohamad Jaber (AUB) Spark May 17, 2017 11 / 43

Page 28: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Visualizing Shared Memory Data Parallelism

v a l res = jar.map(jellyBean => doSomething(jellyBean))

Shared Memory Data Parallelism

Split the data

Workers/threads independently operates on the data shared in parallel

Combine when done (if necessary)

Mohamad Jaber (AUB) Spark May 17, 2017 12 / 43

Page 29: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Visualizing Distributed Data Parallelism

v a l res = jar.map(jellyBean => doSomething(jellyBean))

Distributed Data Parallelism

Split the data over several nodes (machines)

Workers/threads independently operates on the data shared in parallel

Combine when done (if necessary)

New concern: now we need to worry about network latency (combining)!

Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43

Page 30: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Visualizing Distributed Data Parallelism

v a l res = jar.map(jellyBean => doSomething(jellyBean))

Distributed Data Parallelism

Split the data over several nodes (machines)

Workers/threads independently operates on the data shared in parallel

Combine when done (if necessary)

New concern:

now we need to worry about network latency (combining)!

Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43

Page 31: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Visualizing Distributed Data Parallelism

v a l res = jar.map(jellyBean => doSomething(jellyBean))

Distributed Data Parallelism

Split the data over several nodes (machines)

Workers/threads independently operates on the data shared in parallel

Combine when done (if necessary)

New concern: now we need to worry about network latency (combining)!

Mohamad Jaber (AUB) Spark May 17, 2017 13 / 43

Page 32: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Apache Spark

Apache Spark is a framework for distributed data processing!

Spark implements a distributed data parallel model calledResilient Distributed Datasets (RDDs)

Mohamad Jaber (AUB) Spark May 17, 2017 14 / 43

Page 33: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distributed Data Parallel: High-Level

Given some large dataset that cannot fit into memory on a single node

Mohamad Jaber (AUB) Spark May 17, 2017 15 / 43

Page 34: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distributed Data Parallel: High-Level

Chunk up (partition) the data

Distribute it over a cluster of machines

From there think of your distributed data like a single collection!

Example (transform the text of all wiki articles to lowercase)

v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)

Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43

Page 35: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distributed Data Parallel: High-Level

Chunk up (partition) the data

Distribute it over a cluster of machines

From there think of your distributed data like a single collection!

Example (transform the text of all wiki articles to lowercase)

v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)

Mohamad Jaber (AUB) Spark May 17, 2017 16 / 43

Page 36: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distribution

Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)

Partial failure: crash failure of a subset of machines

Latency: certain operations (combining) have a much higher latency thanother operations due to network communication

Important Latency Numbers

Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns

Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43

Page 37: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distribution

Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)

Partial failure: crash failure of a subset of machines

Latency: certain operations (combining) have a much higher latency thanother operations due to network communication

Important Latency Numbers

Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns

Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43

Page 38: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distribution

Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)

Partial failure: crash failure of a subset of machines

Latency: certain operations (combining) have a much higher latency thanother operations due to network communication

Important Latency Numbers

Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns

Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43

Page 39: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Distribution

Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)

Partial failure: crash failure of a subset of machines

Latency: certain operations (combining) have a much higher latency thanother operations due to network communication

Important Latency Numbers

Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns

Mohamad Jaber (AUB) Spark May 17, 2017 17 / 43

Page 40: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Big Data Processing and Latency

Network communication and disk operations can be very expensive!

How do these latency numbers related to big data processing?

To answer this question let us discuss Spark’s predecessor, Hadoop

Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)

Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)

Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43

Page 41: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Big Data Processing and Latency

Network communication and disk operations can be very expensive!

How do these latency numbers related to big data processing?

To answer this question let us discuss Spark’s predecessor, Hadoop

Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)

Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)

Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43

Page 42: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Big Data Processing and Latency

Network communication and disk operations can be very expensive!

How do these latency numbers related to big data processing?

To answer this question let us discuss Spark’s predecessor, Hadoop

Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)

Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)

Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43

Page 43: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Big Data Processing and Latency

Network communication and disk operations can be very expensive!

How do these latency numbers related to big data processing?

To answer this question let us discuss Spark’s predecessor, Hadoop

Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)

Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)

Mohamad Jaber (AUB) Spark May 17, 2017 18 / 43

Page 44: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Hadoop MapReduce

MapReduce works by breaking the processing into two phases

Each phase has key-value pairs as input and output

Map: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)

Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)

Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43

Page 45: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Hadoop MapReduce

MapReduce works by breaking the processing into two phases

Each phase has key-value pairs as input and outputMap: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)

Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)

Mohamad Jaber (AUB) Spark May 17, 2017 19 / 43

Page 46: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Fault-tolerance in Hadoop MapReduce comes at a cost

Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk

Cons. of Hadoop MapReduce

Not efficient to use the same data multiple times (iterative and interactive)

Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization

Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43

Page 47: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Fault-tolerance in Hadoop MapReduce comes at a cost

Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk

Cons. of Hadoop MapReduce

Not efficient to use the same data multiple times (iterative and interactive)

Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization

Mohamad Jaber (AUB) Spark May 17, 2017 20 / 43

Page 48: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Retains fault-tolerance

Different strategy handling latency

Achieves this using ideas from functional programming

Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset

Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!

Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43

Page 49: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Retains fault-tolerance

Different strategy handling latency

Achieves this using ideas from functional programming

Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset

Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!

Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43

Page 50: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Why Spark?

Retains fault-tolerance

Different strategy handling latency

Achieves this using ideas from functional programming

Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset

Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!

Mohamad Jaber (AUB) Spark May 17, 2017 21 / 43

Page 51: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark vs Hadoop Performance

Mohamad Jaber (AUB) Spark May 17, 2017 22 / 43

Page 52: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark vs Hadoop Popularity

According to Google trends, Spark has surpassed Hadoop in popularity!

Mohamad Jaber (AUB) Spark May 17, 2017 23 / 43

Page 53: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark - RDD

Spark extends MapReduce model to better support two common classesanalytics applications

Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively

Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)

RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster

Spark provides high-level APIs in Java, Scala, Python and R

Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43

Page 54: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark - RDD

Spark extends MapReduce model to better support two common classesanalytics applications

Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively

Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)

RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster

Spark provides high-level APIs in Java, Scala, Python and R

Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43

Page 55: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark - RDD

Spark extends MapReduce model to better support two common classesanalytics applications

Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively

Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)

RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster

Spark provides high-level APIs in Java, Scala, Python and R

Mohamad Jaber (AUB) Spark May 17, 2017 24 / 43

Page 56: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

RDD

An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)

It is also possible to execute actions on RDD

An action returns single values (not collections) as results (e.g., reduce,count, first)

RDD can be cashed for efficient (later) use

Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43

Page 57: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

RDD

An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)

It is also possible to execute actions on RDD

An action returns single values (not collections) as results (e.g., reduce,count, first)

RDD can be cashed for efficient (later) use

Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43

Page 58: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

RDD

An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)

It is also possible to execute actions on RDD

An action returns single values (not collections) as results (e.g., reduce,count, first)

RDD can be cashed for efficient (later) use

Mohamad Jaber (AUB) Spark May 17, 2017 25 / 43

Page 59: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Programming with RDDs - Spark Context

Spark Context

Main entry point to Spark functionality

Available in shell as variable sc

scala > val rdd = sc.textFile("input.txt")

Standalone application

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43

Page 60: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Programming with RDDs - Spark Context

Spark Context

Main entry point to Spark functionality

Available in shell as variable sc

scala > val rdd = sc.textFile("input.txt")

Standalone application

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

Mohamad Jaber (AUB) Spark May 17, 2017 26 / 43

Page 61: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Create RDDs

An RDD can be created either from a stable storage (e.g., local, HDFS):

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions

Or, through parallel transformation of another RDD

// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))

// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))

// combinedv a l countData = logData.filter(x => !x.contains("error")).

map(x => x.split("\\s+").count(x => t r u e ))

Or, you can turn a Scala collection into an RDD

v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))

Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43

Page 62: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Create RDDs

An RDD can be created either from a stable storage (e.g., local, HDFS):

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions

Or, through parallel transformation of another RDD

// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))

// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))

// combinedv a l countData = logData.filter(x => !x.contains("error")).

map(x => x.split("\\s+").count(x => t r u e ))

Or, you can turn a Scala collection into an RDD

v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))

Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43

Page 63: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Create RDDs

An RDD can be created either from a stable storage (e.g., local, HDFS):

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions

Or, through parallel transformation of another RDD

// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))

// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))

// combinedv a l countData = logData.filter(x => !x.contains("error")).

map(x => x.split("\\s+").count(x => t r u e ))

Or, you can turn a Scala collection into an RDD

v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))

Mohamad Jaber (AUB) Spark May 17, 2017 27 / 43

Page 64: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Actions on RDDs

v a l nums = sc.parallelize(List(1, 2, 3))

// Retrieve RDD contents as a local collectionnums.collect () // => Array(1, 2, 3)

// Return first K elementsnums.take (2) // => Array(1, 2)

// Count number of elementsnums.count() // => 3

// Merge elements with an associative functionnums.reduce(_ + _) // => 6 -- equivalent to nums.reduce ((x,y) => x + y)

// Write elements to a text filenums.saveAsTextFile("hdfs :// file.txt")

// loop over all elementsnums.foreach(println)

Mohamad Jaber (AUB) Spark May 17, 2017 28 / 43

Page 65: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Lazy Operations and Caching

All transformations in Spark are lazy

They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program

Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)

Executing another action would repeat the reconstruction from the beginning

However, you can cache some RDDs!

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))

logDataFilter.cache()

v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))

print(countData.count()) // will repeat from the begining

Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43

Page 66: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Lazy Operations and Caching

All transformations in Spark are lazy

They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program

Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)

Executing another action would repeat the reconstruction from the beginning

However, you can cache some RDDs!

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))

logDataFilter.cache()

v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))

print(countData.count()) // will repeat from the begining

Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43

Page 67: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Lazy Operations and Caching

All transformations in Spark are lazy

They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program

Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)

Executing another action would repeat the reconstruction from the beginning

However, you can cache some RDDs!

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))

logDataFilter.cache()

v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))

print(countData.count()) // will repeat from the begining

Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43

Page 68: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Lazy Operations and Caching

All transformations in Spark are lazy

They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program

Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)

Executing another action would repeat the reconstruction from the beginning

However, you can cache some RDDs!

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))logDataFilter.cache()v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))

print(countData.count()) // will repeat from the begining

Mohamad Jaber (AUB) Spark May 17, 2017 29 / 43

Page 69: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Pair RDDs

Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs

// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b

// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))

Some transformations: reduceByKey, join, sortByKey, mapValues

v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {

v a l split = v.split("\\s+")(split (0), split (1).toInt)

}).cache ()

// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",

1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)

Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43

Page 70: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Pair RDDs

Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs

// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b

// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))

Some transformations: reduceByKey, join, sortByKey, mapValues

v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {

v a l split = v.split("\\s+")(split (0), split (1).toInt)

}).cache ()

// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",

1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)

Mohamad Jaber (AUB) Spark May 17, 2017 30 / 43

Page 71: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Example: Word Count

v a l lines = sc.textFile("input.txt")v a l counts = lines.flatMap(line => line.split("\\s+"))

.map(word => (word , 1))

.reduceByKey(_ + _)

Mohamad Jaber (AUB) Spark May 17, 2017 31 / 43

Page 72: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Example: Simple Linear Regression

v a l rddData = sc.textFile("input")v a r teta = Math.random ()v a l learningRate = 0.0001v a r i = 0v a l iterations = 100

v a l rddDataXY = rddData.map(item => {v a l itemSplit = item.split(" ")(itemSplit (0).toDouble , itemSplit (1).toDouble)

}).cache ()

f o r (i <− 1 to iterations) {v a l rddInnerGradient = rddDataXY.map(item => 2 * (teta * item._1 - item._2

) * item._1)v a l gradient = rddInnerGradient.reduce ((v1 , v2) => v1 + v2)teta = teta - learningRate * gradient.toDouble

}

Mohamad Jaber (AUB) Spark May 17, 2017 32 / 43

Page 73: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Fault Tolerance

One option to do fault tolerance is to replicate the data (into multiple nodes)

However, this may drastically affect the performance (disk and network I/O)

Spark uses method called lineage

Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!

Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43

Page 74: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Fault Tolerance

One option to do fault tolerance is to replicate the data (into multiple nodes)

However, this may drastically affect the performance (disk and network I/O)

Spark uses method called lineage

Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!

Mohamad Jaber (AUB) Spark May 17, 2017 33 / 43

Page 75: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Engine - Stages and Tasks

sc.textFile("hdfs ://master -node/input/data").map(x => (x(0), x)).groupByKey ().mapValues(f => f.count(x => t r u e ))

stage1

stage2

Mohamad Jad Mary

(M, Mohamad) (J, Jad) (M, Mary)

M, (Mohamad, Mary) J, (Jad)

M, 2 J, 1

Mohamad Jaber (AUB) Spark May 17, 2017 34 / 43

Page 76: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Engine - Stages and Tasks

stage 1 stage 2

stage 3

RDD1 RDD2

RDD3 RDD4

RDD5

RDD6

RDD7

map map

filter

join

map

DAGScheduler is the schedulinglayer of Spark that implementsstage-oriented scheduling

It transforms a logical executionplan to a physical execution plan(using stages)

Stages are submitted as tasks

When the result generate isindependent of any other data thenwe can pipeline!

Mohamad Jaber (AUB) Spark May 17, 2017 35 / 43

Page 77: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43

Page 78: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43

Page 79: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43

Page 80: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43

Page 81: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Mohamad Jaber (AUB) Spark May 17, 2017 36 / 43

Page 82: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Apache Spark

More about Spark

Modules on top of Spark

Spark also supports a rich set of higher-level tools

GraphX forgraph processing

MLLib for machine learning

Spark SQL for structured data processing

Spark Streaming

Geo and Spatial Spark: geographical and spatial data

Mohamad Jaber (AUB) Spark May 17, 2017 37 / 43

Page 83: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

1 Big Data

2 Apache Spark

3 Distributed File System

Mohamad Jaber (AUB) Spark May 17, 2017 38 / 43

Page 84: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

Distributed File Systems

Client/Server-based Distributed File Systems

The actual file service is offered/stored by a single machine

Network File System (NFS)

Andrew File System (AFS)

ServerClient 1

Client 2

Client 3

HDD

Cluster-based Distributed File Systems

Divide files among tens, hundreds, thousands or tens of thousands of machines

Google File System (GFS - appeared inSOSP 2003)

Hadoop Distributed File System (HDFS)

Server Server

Server Server

Client 1

Client 2

Client 3

HDD HDD

HDD HDD

Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43

Page 85: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

Distributed File Systems

Client/Server-based Distributed File Systems

The actual file service is offered/stored by a single machine

Network File System (NFS)

Andrew File System (AFS)

ServerClient 1

Client 2

Client 3

HDD

Cluster-based Distributed File Systems

Divide files among tens, hundreds, thousands or tens of thousands of machines

Google File System (GFS - appeared inSOSP 2003)

Hadoop Distributed File System (HDFS)

Server Server

Server Server

Client 1

Client 2

Client 3

HDD HDD

HDD HDD

Mohamad Jaber (AUB) Spark May 17, 2017 39 / 43

Page 86: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

Hadoop Distributed File System (HDFS)

HDFS (open source) is inspired by GFS

DataNode DataNode DataNode DataNode DataNode

NameNode

FsImage EditLog

metadata

file

— — — — — — — — — — — —

— — — — — — — — — — — —

— — — — — — — — — — — —

Mohamad Jaber (AUB) Spark May 17, 2017 40 / 43

Page 87: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

HDFS Components in Cluster

Mohamad Jaber (AUB) Spark May 17, 2017 41 / 43

Page 88: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

HDFS Commands

HDFS provides a shell like and a list of commands are available to interact with it

# format the file systemhdfs namenode -format

# starts namenode and datanode daemonsstart -dfs.sh

# Operations on the HDFShdfs dfs <args >

# Create directoryhdfs dfs -mkdir /input

# List Fileshdfs dfs -ls /

# Transfer and store a data file from local systems to the HDFShdfs dfs -put /home/jaber/file.txt /input

# View the data from HDFS using cat commandhdfs dfs -cat /input/file.txt

# Get the file from HDFS to the local file systemhdfs dfs -get /input/file.txt /home/jaber/Desktop

Mohamad Jaber (AUB) Spark May 17, 2017 42 / 43

Page 89: An Introduction to Big Data Analysis using Sparklia.deis.unibo.it/Courses/CompNetworksM/1617/... · An Introduction to Big Data Analysis using Spark Mohamad Jaber American University

Distributed File System

Hadoop - Yarn Cluster

Mohamad Jaber (AUB) Spark May 17, 2017 43 / 43