scalable data science with spark and r
TRANSCRIPT
Scalable Data Science
with Spark and RZeydy Ortiz, Ph. D. & Rob Montalvo
DataCrunch Lab, LLC
Twitter: @DCrunchLab
PyData Carolinas 2016
DataCrunch
Lab
Tutorial Objectives
Understand basic concepts of Spark
Learn how to explore data in SparkR
Learn how to perform interactive analysis
in SparkR
Learn to run machine learning algorithms in
SparkR
DataCrunch
Lab
What is Spark?
A distributed computing framework
Provides programming abstraction and
parallel runtime to hide complexities of
fault tolerance and slow machines
Partitions out big data across multiple
machines and stores the data on those
machines' memory
DataCrunch
Lab
Apache Spark Components
DataCrunch
Lab
Why Spark?
Spark is fast...
Massively parallel
Minimizes I/O bottlenecks by storing data in
memory
Spark is partitioning-aware to avoid network-
intensive shuffles
DataCrunch
Lab
Spark supports multiple languages
DataCrunch
Lab
Spark Concepts - Driver and Executors
A Spark program consists of two programs:
Driver program
runs on one machine ("main")
Executor program
runs either on cluster nodes or in local threads
on the same machine
DataCrunch
Lab
Spark Concepts – Resilient Distributed
Dataset (RDD)
“The main abstraction Spark provides is
a resilient distributed dataset (RDD), which
is a collection of elements partitioned across
the nodes of the cluster that can be operated
on in parallel.”
DataCrunch
Lab
Spark Concepts - SparkDataFrame
"A SparkDataFrame is a distributed collection
of data organized into named columns. It is
conceptually equivalent to a table in a
relational database or a data frame in R, but
with richer optimizations under the hood."
DataCrunch
Lab
Spark Concepts – SparkDataFrame
Properties
Immutable; once created cannot be changed
Distributed across all Executors
Can be created from many sources (HDFS, text files, JSON, Parquet, Hive,...
Can be cached in memory for later reuse (optimization by you)
Must have a schema (columns, each with name and type)
DataCrunch
Lab
Spark Concepts - Operations
Transformations Are lazily evaluated (part of an
execution plan); are not
immediately executed
Are executed only when an
action is invoked
Create a new SparkDataFrame
from an existing one
Actions The mechanism to get results
out of Spark
Trigger the execution of "the
execution plan"
DataCrunch
Lab
Setting up for the Tutorial
Pre-requisites:
Databricks Community Edition account
https://databricks.com/ce
1. Log into your Databricks account
https://community.cloud.databricks.com/
2. Import tutorial notebook
http://bit.ly/DCL-SparkR
DataCrunch
Lab
Import Notebook
1. On a separate window (or tab), point your browser to
http://bit.ly/DCL-SparkR
2. Copy to the clipboard the URL to which the previous bit.ly evaluates.
3. On the Databricks UI, click on Workspace. The Workspace view comes up.
4. Click on the right-most dropdown arrow (next to your User ID).
5. Select Import. The Import Item window comes up.
6. Select the URL radio button, and paste onto the space the URL obtained on step 2 above.
7. Click Import.
DataCrunch
Lab
SparkR
DataCrunch
Lab
Exploring Old Faithful Geyser Data
Old Faithful, named by members of the
1870 Washburn Expedition, was once called
“Eternity’s Timepiece” because of the
regularity of its eruptions. Despite the
myth, this geyser has never erupted at
exact hourly intervals, nor is it the largest
or most regular geyser in Yellowstone.
Questions to explore:• Historically, how long has been the wait between eruptions?
• How long does an eruption usually last?
• What is the most common wait time between eruptions?
• How long eruptions last for the most common wait time?
DataCrunch
Lab
Identifying Irises
Use machine learning algorithms to classify irises based on
their measured features
DataCrunch
Lab
Summary
Explained basic concepts of Spark
Learned how to explore data in SparkR
Learned how to perform interactive
analysis in SparkR
Learned to run machine learning algorithms
in SparkR
Thank You!Zeydy Ortiz, Ph. D. & Rob Montalvo
zortiz @ datacrunchlab.com
rmontalvo @ datacrunchlab.com
Twitter: @DCrunchLab