spark core

14
Introducing Spark Core Friday, January 22, 16

Upload: todd-mcgrath

Post on 14-Apr-2017

207 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Introducing Spark Core

Friday, January 22, 16

Agenda

• Assumptions

• Why Spark?

• What you need to know to begin?

Friday, January 22, 16

Assumptions

• You want to learn Apache Spark, but need to know where to begin

• You need to know the fundamentals of Spark in order to progress in your learning of Spark

• You need to evaluate if Spark could be an appropriate fit for your use cases or career growth

One or more of the following

Friday, January 22, 16

In a nutshell, why spark?

• Engine for efficient large-scale processing. It’s faster than Hadoop MapReduce

• Spark can complement your existing Hadoop investments such as HDFS and Hive

• Rich ecosystem including support for SQL, Machine Learning, Steaming and multiple language APIs such as Scala, Python and Java

Friday, January 22, 16

Introduction

• Ok, so where should I start?

Friday, January 22, 16

Spark Essentials

• Resilient Distributed Datasets (RDD)

• Transformers

• Actions

• Spark Driver Programs and SparkContext

To begin, you need to know:

Friday, January 22, 16

Resilient Distributed Datasets (RDDs)

• RDDs are Spark’s primary abstraction for data interaction (lazy, in memory)

• RDDs are an immutable, distributed collection of elements separated into partitions

• There are multiple types of RDDs

• RDDs can be created from an external data sets such as Hadoop InputFormats, text files on a variety of file systems or existing RDDs via a Spark Transformations

Friday, January 22, 16

Transformations

• RDD functions which return pointers to new RDDs (remember: lazy)

• map, flatMap, filter, etc.

Friday, January 22, 16

Actions

• RDD functions which return values to the driver

• reduce, collect, count, etc.

Friday, January 22, 16

Spark RDDs, Transformations, Actions Diagram

Load from External SourceExample: textFile

Transformations Actions

RDDs

Output Value(s)Example: count, collect5, ['a','b', 'c']

Friday, January 22, 16

Spark Driver Programs and Context

• Spark driver is a program that declares transformations and actions on RDDs of data

• A driver submits the serialized RDD graph to the master where the master creates tasks. These tasks are delegated to the workers for execution.

• Workers are where the tasks are actually executed.

Friday, January 22, 16

Driver Program and SparkContext

Image borrowed from http://spark.apache.org/docs/latest/cluster-overview.html

Friday, January 22, 16

Next Steps

Friday, January 22, 16