an introduction to apache spark

An Introduction to Apache Spark

By Amir Sedighi

Datis Pars Data Technology

Slides adopted from Databricks (Paco Nathan and Aaron Davidson)

@amirsedighihttp://hexican.com

History

● Developed in 2009 at UC Berkeley AMPLab.

● Open sourced in 2010.

● Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations such as:

– Databricks, Yahoo!, Intel, Cloudera, IBM, …

What is Spark?

● Fast and general cluster computing system interoperable with Hadoop datasets.

What are Spark improvements?

● Improves efficiency through:

– In-memory computing primitives.

– General computation graphs.

● Improves usability through

– Rich APIs in Scala, Java, Python

– Interactive shell (Scala/Python)

MapReduce is a DAG in General

MapReduce

● MapReduce is great for single-pass batch jobs while in many use-cases we need to use MapReduce in a multi-pass manner...

What improvements Spark made on running MapReduce?

● Improving the performance of MapReduce for running as a multi-pass analytics, interactive, real-time, distributed computation model on the top of Hadoop.

– Spark is a hadoop successor.

How Spark Made it?

A Wise Data Sharing!

Data Sharing in Hadoop MapReduce

Data Sharing in Spark

10-100x Faster than network and disk!

Spark Programming Model

● At a high level, every Spark application consists of a driver program that runs the user’s main function.

● Promotes you to write programs in term of making transformations on distributed datasets.

● The main abstraction Spark provides is a resilient distributed dataset (RDD).

– Collection of elements partitioned across the cluster (Memory of Disk)

– Can be accessed and operated in parallel (map, filter, ...)

– Automatically rebuilt on failure

● RDDs Operations

– Transformations: Create a new dataset from an existing one.

● Example: map()

– Actions: Return a value to the driver program after running a computation on the dataset.

● Example: reduce()

● Another abstraction is Shared Variables

– Broadcast Variables, which can be used to cache a value in memory on all nodes.

– Accumulator

Ease of Use

● Spark offers over 80 high-level operators that make it easy to build parallel apps.

● Scala and Python shells to use it interactively.

A General Stack

Apache Spark Core

● Spark Core is the general engine for the Spark platform.

– In-memory computing capabilities deliver speed

– General execution model supports wide delivery of use cases

– Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)

Spark SQL

Spark Streaming

● makes it easy to build scalable fault-tolerant streaming applications.

Spark Streaming

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

Spark Streaming

val hashTags = tweets.flatMap (status => getTags(status))

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

transformation: modify data in one DStream to create another DStream

new DStream

Spark Streaming

val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

sliding window operation

window length sliding interval

Spark Streaming

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

● MLLib is Spark's scaleable machine learning engine.

● MLLib works on any hadoop datasource such as HDFS, HBase and local files.

● Algorithms:

– linear SVM and logistic regression

– classification and regression tree

– k-means clustering

– recommendation via alternating least squares

– singular value decomposition

– linear regression with L1- and L2-regularization

– multinomial naive Bayes

– basic statistics

– feature transformations

GraphX

● GraphX is Spark's API for graphs and graph-parallel computation.

● Works with both graphs and collections.

GraphX

● Comparable performance to the fastest specialized graph processing systems

GraphX

● Algorithms

– PageRank

– Connected components

– Label propagation

– SVD++

– Strongly connected components

– Triangle count

Spark Runs Everywhere

● Spark runs on Hadoop, Mesos, standalone, or in the cloud.

● Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.

Resources

● http://spark.apache.org● Intro to Apache Spark by Paco Nathan● Building a Unified Data Pipeline in Spark by Aaron

Davidson.● http://www.slideshare.net/manishgforce/lightening-fast-bi

g-data-analytics-using-apache-spark● Deep Dive with Spark Streaming - Tathagata Das - Spark

Meetup● ZYMR

Thank You!

Questions?

an introduction to apache spark

spark sql

spark improvements

improvements spark

spark application

spark platform

apache spark core spark

ease of use spark

main abstraction spark

Data & Analytics

deep learning with apache spark: an introduction

introduction to apache spark

introduction to apache spark - tropars.github.io · i...

webinar: big data series: apache spark @ csc - introduction

an introduction to apache spark - amir h. payberah · an...

integrating apache hive with kafka, spark, and...

introduction to apache spark and machine learning

20150716 introduction to apache spark v3

introduction to apache spark 2.0

apache spark introduction - seoul...

introduction to apache spark for the spring developer

introduction to apache spark

introduction to apache spark - hadoopexpress 1.pdf ·...

parallel maritime tra c clustering based on apache spark ·...

an introduction to spark and to its programming...

introduction to cassandra • why spark - apache cassandra |...

deep learning with apache spark - an introduction

boston apache spark user group (the spahk group) -...

using apache spark pat mcdonough - databricks. apache spark...

an introduction to apache spark