an introduction to apache spark

1

An Introduction to Apache Spark

By Amir Sedighi

Datis Pars Data Technology

Slides adopted from Databricks (Paco Nathan and Aaron Davidson)

@amirsedighihttp://hexican.com

https://twitter.com/amirsedighi

http://hexican.com/

2

History

● Developed in 2009 at UC Berkeley AMPLab.

● Open sourced in 2010.

● Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as:

– Databricks, Yahoo!, Intel, Cloudera, IBM, …

3

What is Spark?

● Fast and general cluster computing system interoperable with Hadoop datasets.

4

What are Spark improvements?

● Improves efficiency through:

– In-memory computing primitives.

– General computation graphs.

● Improves usability through

– Rich APIs in Scala, Java, Python

– Interactive shell (Scala/Python)

5

MapReduce is a DAG in General

6

MapReduce

● MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner...

7

What improvements Spark made on running MapReduce?

● Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop.

Note:

– Spark is a hadoop successor.

8

How Spark Made it?

A Wise Data Sharing!

9

Data Sharing in Hadoop MapReduce

10

Data Sharing in Spark

11

Data Sharing in Spark

10-100x Faster than network and disk!

12

Spark Programming Model

● At a high level, every Spark application consists of a driver program that runs the user’s main function.

● Promotes you to write programs in term of making transformations on distributed datasets.

13


● The main abstraction Spark provides is a resilient distributed dataset (RDD).

– Collection of elements partitioned across the cluster (Memory of Disk)

– Can be accessed and operated in parallel (map, filter, ...)

– Automatically rebuilt on failure

14


● RDDs Operations

– Transformations: Create a new dataset from an existing one.

● Example: map()

– Actions: Return a value to the driver program after running a computation on the dataset.

● Example: reduce()

15


16


● Another abstraction is Shared Variables

– Broadcast Variables, which can be used to cache a value in memory on all nodes.

– Accumulator

17


18


19


20

Ease of Use

● Spark offers over 80 high-level operators that make it easy to build parallel apps.

● Scala and Python shells to use it interactively.

21

A General Stack

22

Apache Spark Core

23

Apache Spark Core

● Spark Core is the general engine for the Spark platform.

– In-memory computing capabilities deliver speed

– General execution model supports wide delivery of use cases

– Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)

24

Spark SQL

25

Spark SQL

26

Spark SQL

27

Spark SQL

28

Spark SQL

29

Spark Streaming

30

Spark Streaming

● makes it easy to build scalable fault-tolerant streaming applications.

31

Spark Streaming

32

Spark Streaming

33

Spark Streaming

34

Spark Streaming

35

Spark Streaming

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

36

Spark Streaming


val hashTags = tweets.flatMap (status => getTags(status))

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

transformation: modify data in one DStream to create another DStream

new DStream

37

Spark Streaming


val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

sliding window operation

window length sliding interval

38

Spark Streaming

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

39

MLLib

40

MLLib

● MLLib is Spark's scaleable machine learning engine.

● MLLib works on any hadoop datasource such as HDFS, HBase and local files.

41

MLLib

● Algorithms:

– linear SVM and logistic regression

– classification and regression tree

– k-means clustering

– recommendation via alternating least squares

– singular value decomposition

– linear regression with L1- and L2-regularization

– multinomial naive Bayes

– basic statistics

– feature transformations

42

GraphX

43

GraphX

● GraphX is Spark's API for graphs and graph-parallel computation.

● Works with both graphs and collections.

44

GraphX

● Comparable performance to the fastest specialized graph processing systems

45

GraphX

● Algorithms

– PageRank

– Connected components

– Label propagation

– SVD++

– Strongly connected components

– Triangle count

46

Spark Runs Everywhere

● Spark runs on Hadoop, Mesos, standalone, or in the cloud.

● Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.

47

Resources

● http://spark.apache.org● Intro to Apache Spark by Paco Nathan● Building a Unified Data Pipeline in Spark by Aaron

Davidson.● http://www.slideshare.net/manishgforce/lightening-fast-bi

g-data-analytics-using-apache-spark● Deep Dive with Spark Streaming - Tathagata Das - Spark

Meetup● ZYMR

http://spark.apache.org/

http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark

http://www.slideshare.net/manishgforce/lightening-fast-big-data-analytics-using-apache-spark

48

Thank You!

Questions?

an introduction to apache spark

Data & Analytics

spark sql

spark improvements

improvements spark

spark application

spark platform

apache spark core spark

ease of use spark

main abstraction spark