spark core
TRANSCRIPT
Assumptions
• You want to learn Apache Spark, but need to know where to begin
• You need to know the fundamentals of Spark in order to progress in your learning of Spark
• You need to evaluate if Spark could be an appropriate fit for your use cases or career growth
One or more of the following
Friday, January 22, 16
In a nutshell, why spark?
• Engine for efficient large-scale processing. It’s faster than Hadoop MapReduce
• Spark can complement your existing Hadoop investments such as HDFS and Hive
• Rich ecosystem including support for SQL, Machine Learning, Steaming and multiple language APIs such as Scala, Python and Java
Friday, January 22, 16
Spark Essentials
• Resilient Distributed Datasets (RDD)
• Transformers
• Actions
• Spark Driver Programs and SparkContext
To begin, you need to know:
Friday, January 22, 16
Resilient Distributed Datasets (RDDs)
• RDDs are Spark’s primary abstraction for data interaction (lazy, in memory)
• RDDs are an immutable, distributed collection of elements separated into partitions
• There are multiple types of RDDs
• RDDs can be created from an external data sets such as Hadoop InputFormats, text files on a variety of file systems or existing RDDs via a Spark Transformations
Friday, January 22, 16
Transformations
• RDD functions which return pointers to new RDDs (remember: lazy)
• map, flatMap, filter, etc.
Friday, January 22, 16
Actions
• RDD functions which return values to the driver
• reduce, collect, count, etc.
Friday, January 22, 16
Spark RDDs, Transformations, Actions Diagram
Load from External SourceExample: textFile
Transformations Actions
RDDs
Output Value(s)Example: count, collect5, ['a','b', 'c']
Friday, January 22, 16
Spark Driver Programs and Context
• Spark driver is a program that declares transformations and actions on RDDs of data
• A driver submits the serialized RDD graph to the master where the master creates tasks. These tasks are delegated to the workers for execution.
• Workers are where the tasks are actually executed.
Friday, January 22, 16
Driver Program and SparkContext
Image borrowed from http://spark.apache.org/docs/latest/cluster-overview.html
Friday, January 22, 16
References
• For course information and discount coupons, visit http://www.supergloo.com/
• Learning Spark Book Summary http://www.amazon.com/Learning-Spark-Summary-Lightning-Fast-Deconstructed-ebook/dp/B019HS7USA/
Friday, January 22, 16