an introduction to big data analysis using sparklia.deis.unibo.it/courses/compnetworksm/1617/... ·...

An Introduction to Big Data Analysis using Spark

Mohamad Jaber

American University of Beirut-

Faculty of Arts & Sciences - Department of Computer Science

May 17, 2017

Mohamad Jaber (AUB) Spark May 17, 2017 1 / 43

Big Data

1 Big Data

2 Apache Spark

3 Distributed File System

Big Data

We live in the data age

Big Data

Big Data - Some Numbers (2013-14)

Storage and Processing

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month

New York Stock Exchange generates about 4 − 5 terabytes of data per day

Google processes 20 petabytes of information per day

Estimation

The size of digital universe is 4.4 zettabytes (1021) in 2013

2020: 44 zettabytes

Big Data

Estimation

2020: 44 zettabytes

Big Data

Estimation

2020: 44 zettabytes

Big Data

Estimation

2020: 44 zettabytes

Big Data

Estimation

2020: 44 zettabytes

Big Data

Estimation

2020: 44 zettabytes

Big Data

Simple Java Program to Analyze Data

p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {

// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));

l ong score = 0;String line = n u l l ;

// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {

score += analyzer.analyze(line);}r e t u r n score;

Throughput 1GB per hour

10GB data set ⇒ 10 hours

Big Data

Simple Java Program to Analyze Data

p u b l i c s t a t i c long analyze(String fileName , Analyzer analyzer) throwsIOException {

// Read InputBufferedReader reader = new BufferedReader(new FileReader(fileName));

l ong score = 0;String line = n u l l ;

// Processingw h i l e ((line = reader.readLine ()) != n u l l ) {

score += analyzer.analyze(line);}r e t u r n score;

10GB data set ⇒ 10 hours

Big Data

How can we Improve the Performance?

Faster CPU – Scale up (vertically)

More/Faster memory – Scale up (vertically)

Increase the number of cores

Increase the number of threads

Increase the number of threads and cores

Shared Memory (pthreads)Message Passing (MPI)

Multi-threaded

Fault?

Big Data

Multi-threaded

Fault?

Big Data

Multi-threaded

Fault?

Big Data

Multi-threaded

Fault?

Big Data

Multi-threaded

Fault?

Big Data

What do we Need?

We need a framework that abstracts away / hides:

Scale Out (horizontally)

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Big Data

What do we Need?

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Big Data

What do we Need?

Parallelization

Data distribution

Fault-tolerance

Load Balancing

Apache Spark

1 Big Data

2 Apache Spark

Apache Spark

Why Spark?

Normally, data science and analytics is done ”in the small”, inR/Python/MATLAB, etc.

If your dataset ever gets too large to fit into memory, theselanguages/frameworks won’t allow you to scale

You have to re-implement everything in some other language or system

Moreover, there is a massive shift in industry to data-oriented decisionmaking too! ⇒ data science in the large

According to the popular IT job portal, Dice.com, a keyword search for theterm ”Spark Developer” showed 34617 listings as of 16th December, 2015.

Apache Spark

Why Spark?

Apache Spark

Why Spark?

Apache Spark

Why Spark?

Apache Spark

Why Spark?

Apache Spark

Why Spark?

Spark is

More expressive. APIs modeled after Scala collections. Look like functionallists! Richer, more composable operations possible than in MapReduce

Efficient. Not only performant in terms of running time... But also in termsof developer productivity! Interactive!

Good for data science. Not just because of performance, but because itenables iteration, which is required by most algorithms in a data scientist’stoolbox (e.g., machine learning, graph analytics)

Apache Spark

Scala Quick Tour

Scala is a high-level language for the Java VM (object oriented + functionalprogramming) – supports interactive shell

// declare variablesv a r x: Int = 7v a r x = 7 // type inferredv a l y = "hi" // read -only

// Functionsdef square(x: Int): Int = x * xdef square(x: Int): Int = {

def announce(text: String) {println(text)

// Generic Typesv a r arr = new Array[Int ](8)v a r lst = List(1, 2, 3)arr (5) = 7

// processing collectionsv a l list = List(1, 2, 3)list.foreach(x => println(x))list.foreach(println) // shortcut

v a l incMap = list.map(x => x + 2)// same with place holder

notationv a l incMap = list.map(_ + 2)

v a l f = list.filter(x => x % 2 ==1)

v a l f = list.filter(_ % 2 == 1)

v a l n = list.reduce ((x,y) => x + y)

v a l n = list.reduce(_ + _)// List is immutable

Apache Spark

Visualizing Shared Memory Data Parallelism

v a l res = jar.map(jellyBean => doSomething(jellyBean))

Shared Memory Data Parallelism

Split the data

Workers/threads independently operates on the data shared in parallel

Combine when done (if necessary)

Apache Spark

Visualizing Distributed Data Parallelism

Distributed Data Parallelism

Split the data over several nodes (machines)

New concern: now we need to worry about network latency (combining)!

Apache Spark

New concern:

now we need to worry about network latency (combining)!

Apache Spark

New concern: now we need to worry about network latency (combining)!

Apache Spark

Apache Spark is a framework for distributed data processing!

Spark implements a distributed data parallel model calledResilient Distributed Datasets (RDDs)

Apache Spark

Distributed Data Parallel: High-Level

Given some large dataset that cannot fit into memory on a single node

Apache Spark

Chunk up (partition) the data

Distribute it over a cluster of machines

From there think of your distributed data like a single collection!

Example (transform the text of all wiki articles to lowercase)

v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)

Apache Spark

Chunk up (partition) the data

Distribute it over a cluster of machines

From there think of your distributed data like a single collection!

Example (transform the text of all wiki articles to lowercase)

v a l wiki: RDD[WikiArticle ]...v a l lowerWiki = wiki.map(article => article.text.toLowerCase)

Apache Spark

Distribution

Distribution introduces important concerns beyond parallelism in the sharedmemory case (single node/machine)

Partial failure: crash failure of a subset of machines

Latency: certain operations (combining) have a much higher latency thanother operations due to network communication

Important Latency Numbers

Main memory reference 100nsSend 2K bytes over 1Gbps network 20,000nsSSD random read 150,000nsRead 1 MB sequentially from memory 250,000nsRead 1 MB sequentially from SSD 1,000,000nsRead 1 MB sequentially from disk 20,000,000nsSend packet US → Europe → US 150,000,000ns

Apache Spark

Distribution

Apache Spark

Distribution

Apache Spark

Distribution

Apache Spark

Big Data Processing and Latency

Network communication and disk operations can be very expensive!

How do these latency numbers related to big data processing?

To answer this question let us discuss Spark’s predecessor, Hadoop

Hadoop is a widely-used large scale batch data processing frameworkIt is an open source implementation of Google’s MapReduce (2004)

Ground breaking because of: (1) simplicity (map and reduce); and (2) faulttoleranceFault tolerance is what made it possible for Hadoop MapReduce to scale up to1000 nodes (recover from node failure)

Apache Spark

Hadoop MapReduce

MapReduce works by breaking the processing into two phases

Each phase has key-value pairs as input and output

Map: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)

Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)

Apache Spark

Hadoop MapReduce

MapReduce works by breaking the processing into two phases

Each phase has key-value pairs as input and outputMap: Grab the relevant data from the source and output intermediate (key,value) pairs (local file system - disk)

Reduce: Aggregate the results for each unique key of the generatedintermediate (key, value) pairs (HDFS)

Apache Spark

Why Spark?

Fault-tolerance in Hadoop MapReduce comes at a cost

Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk

Cons. of Hadoop MapReduce

Not efficient to use the same data multiple times (iterative and interactive)

Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization

Apache Spark

Why Spark?

Fault-tolerance in Hadoop MapReduce comes at a cost

Between each map and reduce step, in order to recover from potential failures,Hadoop MapReduce shuffles its data and write intermediate data to disk

Cons. of Hadoop MapReduce

Not efficient to use the same data multiple times (iterative and interactive)

Intermediate results written into stable storageOutput of reducers written on HDFSDisk I/O, network I/O, [de]serialization

Apache Spark

Why Spark?

Retains fault-tolerance

Different strategy handling latency

Achieves this using ideas from functional programming

Keep all data immutable and in-memoryAll operations on data are just functional transformationsFault tolerance is achieved by replaying function transformations over originaldataset

Spark has been shown to be 100x more efficient than Hadoop MapReduce whileadding even more expressive APIs!

Apache Spark

Why Spark?

Apache Spark

Why Spark?

Apache Spark

Spark vs Hadoop Performance

Apache Spark

Spark vs Hadoop Popularity

According to Google trends, Spark has surpassed Hadoop in popularity!

Apache Spark

Spark - RDD

Spark extends MapReduce model to better support two common classesanalytics applications

Iterative algorithms (e.g., machine learning, graph)Interactive: efficiently analyze data sets interactively

Spark implements a distributed data parallel model called ResilientDistributed Datasets (RDDs)

RDDs look just like immutable sequential or parallel Scala collectionsRDD is big parallel collection that distributed (in-memory or Disk) across thecluster

Spark provides high-level APIs in Java, Scala, Python and R

Apache Spark

Spark - RDD

Apache Spark

Spark - RDD

Apache Spark

An RDD can be created either from a stable storage (e.g., local, HDFS) orthrough parallel transformation of another RDD (e.g., map, filter)

It is also possible to execute actions on RDD

An action returns single values (not collections) as results (e.g., reduce,count, first)

RDD can be cashed for efficient (later) use

Apache Spark

Programming with RDDs - Spark Context

Spark Context

Main entry point to Spark functionality

Available in shell as variable sc

scala > val rdd = sc.textFile("input.txt")

Standalone application

v a l conf = new SparkConf ().setAppName("Simple Application").setMaster("local [*]")

v a l sc = new SparkContext(conf)

Apache Spark

Programming with RDDs - Spark Context

Spark Context

Main entry point to Spark functionality

Available in shell as variable sc

scala > val rdd = sc.textFile("input.txt")

Standalone application

Apache Spark

Create RDDs

An RDD can be created either from a stable storage (e.g., local, HDFS):

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt", 32) // 32partitions

Or, through parallel transformation of another RDD

// remove lines containing the word errorv a l logDataFilter = logData.filter(x => !x.contains("error"))

// count the number of words per linev a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))

// combinedv a l countData = logData.filter(x => !x.contains("error")).

map(x => x.split("\\s+").count(x => t r u e ))

Or, you can turn a Scala collection into an RDD

v a l rdd = sc.parallelize(Array(1, 2, 3, 4, 5))

Apache Spark

Create RDDs

Apache Spark

Create RDDs

Apache Spark

Actions on RDDs

v a l nums = sc.parallelize(List(1, 2, 3))

// Retrieve RDD contents as a local collectionnums.collect () // => Array(1, 2, 3)

// Return first K elementsnums.take (2) // => Array(1, 2)

// Count number of elementsnums.count() // => 3

// Merge elements with an associative functionnums.reduce(_ + _) // => 6 -- equivalent to nums.reduce ((x,y) => x + y)

// Write elements to a text filenums.saveAsTextFile("hdfs :// file.txt")

// loop over all elementsnums.foreach(println)

Apache Spark

Lazy Operations and Caching

All transformations in Spark are lazy

They do not compute their results right awayThey just remember the transformations applied to some base datasetThe transformations are only computed when an action requires a result to bereturned to the driver program

Upon execution an action a result will be computed and intermediate RDDs arestored in RAM (if possible)

Executing another action would repeat the reconstruction from the beginning

However, you can cache some RDDs!

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))

logDataFilter.cache()

v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

countData = logDataFilter.map(x => x.split("\\s+").count(x => x != "mohamad"))

print(countData.count()) // will repeat from the begining

Apache Spark

v a l logData = sc.textFile("hdfs ://hadoop -master/a.txt")v a l logDataFilter = logData.filter(x => !x.contains("error"))logDataFilter.cache()v a l countData = logDataFilter.map(x => x.split("\\s+").count(x => t r u e ))println(countData.count())

Apache Spark

Pair RDDs

Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs

// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b

// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))

Some transformations: reduceByKey, join, sortByKey, mapValues

v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {

v a l split = v.split("\\s+")(split (0), split (1).toInt)

}).cache ()

// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",

1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)

Apache Spark

Pair RDDs

Spark’s ”distributed reduce” transformations operate on RDDs of key-valuepairs

// Scala pairv a l pair = (a, b)pair._1 // => apair._2 // => b

// pets is a Pair RDDpets = sc.parallelize(Array(("cat", 1), ("dog", 1), ("cat", 2)))

Some transformations: reduceByKey, join, sortByKey, mapValues

v a l data = sc.textFile("input.txt")v a l pairData = data.map(v => {

v a l split = v.split("\\s+")(split (0), split (1).toInt)

}).cache ()

// automatically implements combinersv a l rdd1 = pairData.reduceByKey ((x,y) => x + y)) // ("cat", 3), ("dog",

1)v a l rdd2 = pairData.groubByKey () // ("cat", [1, 2]), ("dog", [1])v a l rdd3 = pairData.sortByKey () // ("cat", 1), ("cat", 2), ("dog", 1)

Apache Spark

Example: Word Count

v a l lines = sc.textFile("input.txt")v a l counts = lines.flatMap(line => line.split("\\s+"))

.map(word => (word , 1))

.reduceByKey(_ + _)

Apache Spark

Example: Simple Linear Regression

v a l rddData = sc.textFile("input")v a r teta = Math.random ()v a l learningRate = 0.0001v a r i = 0v a l iterations = 100

v a l rddDataXY = rddData.map(item => {v a l itemSplit = item.split(" ")(itemSplit (0).toDouble , itemSplit (1).toDouble)

}).cache ()

f o r (i <− 1 to iterations) {v a l rddInnerGradient = rddDataXY.map(item => 2 * (teta * item._1 - item._2

) * item._1)v a l gradient = rddInnerGradient.reduce ((v1 , v2) => v1 + v2)teta = teta - learningRate * gradient.toDouble

Apache Spark

Fault Tolerance

One option to do fault tolerance is to replicate the data (into multiple nodes)

However, this may drastically affect the performance (disk and network I/O)

Spark uses method called lineage

Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!

Apache Spark

Fault Tolerance

One option to do fault tolerance is to replicate the data (into multiple nodes)

However, this may drastically affect the performance (disk and network I/O)

Spark uses method called lineage

Remember how an RDD it was built from a given sourceAutomatically rebuilt on failureRecompute only lost partitions on failures, that is, no cost if nothing fails!

Apache Spark

Spark Execution Engine - Stages and Tasks

sc.textFile("hdfs ://master -node/input/data").map(x => (x(0), x)).groupByKey ().mapValues(f => f.count(x => t r u e ))

stage1

stage2

Mohamad Jad Mary

(M, Mohamad) (J, Jad) (M, Mary)

M, (Mohamad, Mary) J, (Jad)

M, 2 J, 1

Apache Spark

Spark Execution Engine - Stages and Tasks

stage 1 stage 2

stage 3

RDD1 RDD2

RDD3 RDD4

map map

filter

DAGScheduler is the schedulinglayer of Spark that implementsstage-oriented scheduling

It transforms a logical executionplan to a physical execution plan(using stages)

Stages are submitted as tasks

When the result generate isindependent of any other data thenwe can pipeline!

Apache Spark

Spark Execution Flow

Spark applications run as independent sets of processes on a cluster, coordinated bythe SparkContext object in your main program (driver)

SparkContext can connect to several types of cluster managers either Spark’s ownstandalone cluster manager, Mesos or YARN

> spark -submit --class path.to.your.Class --master yarn --deploy -modecluster \

[options] <app jar > [app options]

Once connected, Spark acquires executors on nodes in the cluster

Next, it sends your application code (defined by JAR or Python files passed toSparkContext) to the executors

Finally, SparkContext sends tasks to the executors to run

Apache Spark

More about Spark

Modules on top of Spark

Spark also supports a rich set of higher-level tools

GraphX forgraph processing

MLLib for machine learning

Spark SQL for structured data processing

Spark Streaming

Geo and Spatial Spark: geographical and spatial data

Distributed File System

1 Big Data

2 Apache Spark

Distributed File Systems

Client/Server-based Distributed File Systems

The actual file service is offered/stored by a single machine

Network File System (NFS)

Andrew File System (AFS)

ServerClient 1

Client 2

Client 3

Cluster-based Distributed File Systems

Divide files among tens, hundreds, thousands or tens of thousands of machines

Google File System (GFS - appeared inSOSP 2003)

Hadoop Distributed File System (HDFS)

Server Server

Client 1

Client 2

Client 3

HDD HDD

Distributed File Systems

Client/Server-based Distributed File Systems

The actual file service is offered/stored by a single machine

Network File System (NFS)

Andrew File System (AFS)

ServerClient 1

Client 2

Client 3

Cluster-based Distributed File Systems

Divide files among tens, hundreds, thousands or tens of thousands of machines

Google File System (GFS - appeared inSOSP 2003)

Server Server

Client 1

Client 2

Client 3

HDD HDD

HDFS (open source) is inspired by GFS

DataNode DataNode DataNode DataNode DataNode

NameNode

FsImage EditLog

metadata

— — — — — — — — — — — —

HDFS Components in Cluster

HDFS Commands

HDFS provides a shell like and a list of commands are available to interact with it

# format the file systemhdfs namenode -format

# starts namenode and datanode daemonsstart -dfs.sh

# Operations on the HDFShdfs dfs <args >

# Create directoryhdfs dfs -mkdir /input

# List Fileshdfs dfs -ls /

# Transfer and store a data file from local systems to the HDFShdfs dfs -put /home/jaber/file.txt /input

# View the data from HDFS using cat commandhdfs dfs -cat /input/file.txt

# Get the file from HDFS to the local file systemhdfs dfs -get /input/file.txt /home/jaber/Desktop

Hadoop - Yarn Cluster

an introduction to big data analysis using sparklia.deis.unibo.it/courses/compnetworksm/1617/... ·...

Documents