an introduction to apache spark
DESCRIPTION
This is an introduction to Apache Spark.TRANSCRIPT
1
An Introduction to Apache Spark
By Amir Sedighi
Datis Pars Data Technology
Slides adopted from Databricks (Paco Nathan and Aaron Davidson)
@amirsedighihttp://hexican.com
2
History
● Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …
3
What is Spark?
● Fast and general cluster computing system interoperable with Hadoop datasets.
4
What are Spark improvements?
● Improves efficiency through:
– In-memory computing primitives.
– General computation graphs.
● Improves usability through
– Rich APIs in Scala, Java, Python
– Interactive shell (Scala/Python)
5
MapReduce is a DAG in General
6
MapReduce
● MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner...
7
What improvements Spark made on running MapReduce?
● Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop.
Note:
– Spark is a hadoop successor.
8
How Spark Made it?
A Wise Data Sharing!
9
Data Sharing in Hadoop MapReduce
10
Data Sharing in Spark
11
Data Sharing in Spark
10-100x Faster than network and disk!
12
Spark Programming Model
● At a high level, every Spark application consists of a driver program that runs the user’s main function.
● Promotes you to write programs in term of making transformations on distributed datasets.
13
Spark Programming Model
● The main abstraction Spark provides is a resilient distributed dataset (RDD).
– Collection of elements partitioned across the cluster (Memory of Disk)
– Can be accessed and operated in parallel (map, filter, ...)
– Automatically rebuilt on failure
14
Spark Programming Model
● RDDs Operations
– Transformations: Create a new dataset from an existing one.
● Example: map()
– Actions: Return a value to the driver program after running a computation on the dataset.
● Example: reduce()
15
Spark Programming Model
16
Spark Programming Model
● Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a value in memory on all nodes.
– Accumulator
17
Spark Programming Model
18
Spark Programming Model
19
Spark Programming Model
20
Ease of Use
● Spark offers over 80 high-level operators that make it easy to build parallel apps.
● Scala and Python shells to use it interactively.
21
A General Stack
22
Apache Spark Core
23
Apache Spark Core
● Spark Core is the general engine for the Spark platform.
– In-memory computing capabilities deliver speed
– General execution model supports wide delivery of use cases
– Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)
24
Spark SQL
25
Spark SQL
26
Spark SQL
27
Spark SQL
28
Spark SQL
29
Spark Streaming
30
Spark Streaming
● makes it easy to build scalable fault-tolerant streaming applications.
31
Spark Streaming
32
Spark Streaming
33
Spark Streaming
34
Spark Streaming
35
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data
36
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data
transformation: modify data in one DStream to create another DStream
new DStream
37
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window operation
window length sliding interval
38
Spark Streaming
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
39
MLLib
40
MLLib
● MLLib is Spark's scaleable machine learning engine.
● MLLib works on any hadoop datasource such as HDFS, HBase and local files.
41
MLLib
● Algorithms:
– linear SVM and logistic regression
– classification and regression tree
– k-means clustering
– recommendation via alternating least squares
– singular value decomposition
– linear regression with L1- and L2-regularization
– multinomial naive Bayes
– basic statistics
– feature transformations
42
GraphX
43
GraphX
● GraphX is Spark's API for graphs and graph-parallel computation.
● Works with both graphs and collections.
44
GraphX
● Comparable performance to the fastest specialized graph processing systems
45
GraphX
● Algorithms
– PageRank
– Connected components
– Label propagation
– SVD++
– Strongly connected components
– Triangle count
46
Spark Runs Everywhere
● Spark runs on Hadoop, Mesos, standalone, or in the cloud.
● Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.
47
Resources
● http://spark.apache.org● Intro to Apache Spark by Paco Nathan● Building a Unified Data Pipeline in Spark by Aaron
Davidson.● http://www.slideshare.net/manishgforce/lightening-fast-bi
g-data-analytics-using-apache-spark● Deep Dive with Spark Streaming - Tathagata Das - Spark
Meetup● ZYMR
48
Thank You!
Questions?