scalable data science with spark and r

18
Scalable Data Science with Spark and R Zeydy Ortiz, Ph. D. & Rob Montalvo DataCrunch Lab, LLC Twitter: @DCrunchLab PyData Carolinas 2016

Upload: zeydy-ortiz-ph-d

Post on 13-Apr-2017

161 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Scalable Data Science with Spark and R

Scalable Data Science

with Spark and RZeydy Ortiz, Ph. D. & Rob Montalvo

DataCrunch Lab, LLC

Twitter: @DCrunchLab

PyData Carolinas 2016

Page 2: Scalable Data Science with Spark and R

DataCrunch

Lab

Tutorial Objectives

Understand basic concepts of Spark

Learn how to explore data in SparkR

Learn how to perform interactive analysis

in SparkR

Learn to run machine learning algorithms in

SparkR

Page 3: Scalable Data Science with Spark and R

DataCrunch

Lab

What is Spark?

A distributed computing framework

Provides programming abstraction and

parallel runtime to hide complexities of

fault tolerance and slow machines

Partitions out big data across multiple

machines and stores the data on those

machines' memory

Page 4: Scalable Data Science with Spark and R

DataCrunch

Lab

Apache Spark Components

Page 5: Scalable Data Science with Spark and R

DataCrunch

Lab

Why Spark?

Spark is fast...

Massively parallel

Minimizes I/O bottlenecks by storing data in

memory

Spark is partitioning-aware to avoid network-

intensive shuffles

Page 6: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark supports multiple languages

Page 7: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark Concepts - Driver and Executors

A Spark program consists of two programs:

Driver program

runs on one machine ("main")

Executor program

runs either on cluster nodes or in local threads

on the same machine

Page 8: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark Concepts – Resilient Distributed

Dataset (RDD)

“The main abstraction Spark provides is

a resilient distributed dataset (RDD), which

is a collection of elements partitioned across

the nodes of the cluster that can be operated

on in parallel.”

Page 9: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark Concepts - SparkDataFrame

"A SparkDataFrame is a distributed collection

of data organized into named columns. It is

conceptually equivalent to a table in a

relational database or a data frame in R, but

with richer optimizations under the hood."

Page 10: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark Concepts – SparkDataFrame

Properties

Immutable; once created cannot be changed

Distributed across all Executors

Can be created from many sources (HDFS, text files, JSON, Parquet, Hive,...

Can be cached in memory for later reuse (optimization by you)

Must have a schema (columns, each with name and type)

Page 11: Scalable Data Science with Spark and R

DataCrunch

Lab

Spark Concepts - Operations

Transformations Are lazily evaluated (part of an

execution plan); are not

immediately executed

Are executed only when an

action is invoked

Create a new SparkDataFrame

from an existing one

Actions The mechanism to get results

out of Spark

Trigger the execution of "the

execution plan"

Page 12: Scalable Data Science with Spark and R

DataCrunch

Lab

Setting up for the Tutorial

Pre-requisites:

Databricks Community Edition account

https://databricks.com/ce

1. Log into your Databricks account

https://community.cloud.databricks.com/

2. Import tutorial notebook

http://bit.ly/DCL-SparkR

Page 13: Scalable Data Science with Spark and R

DataCrunch

Lab

Import Notebook

1. On a separate window (or tab), point your browser to

http://bit.ly/DCL-SparkR

2. Copy to the clipboard the URL to which the previous bit.ly evaluates.

3. On the Databricks UI, click on Workspace. The Workspace view comes up.

4. Click on the right-most dropdown arrow (next to your User ID).

5. Select Import. The Import Item window comes up.

6. Select the URL radio button, and paste onto the space the URL obtained on step 2 above.

7. Click Import.

Page 14: Scalable Data Science with Spark and R

DataCrunch

Lab

SparkR

Page 15: Scalable Data Science with Spark and R

DataCrunch

Lab

Exploring Old Faithful Geyser Data

Old Faithful, named by members of the

1870 Washburn Expedition, was once called

“Eternity’s Timepiece” because of the

regularity of its eruptions. Despite the

myth, this geyser has never erupted at

exact hourly intervals, nor is it the largest

or most regular geyser in Yellowstone.

Questions to explore:• Historically, how long has been the wait between eruptions?

• How long does an eruption usually last?

• What is the most common wait time between eruptions?

• How long eruptions last for the most common wait time?

Page 16: Scalable Data Science with Spark and R

DataCrunch

Lab

Identifying Irises

Use machine learning algorithms to classify irises based on

their measured features

Page 17: Scalable Data Science with Spark and R

DataCrunch

Lab

Summary

Explained basic concepts of Spark

Learned how to explore data in SparkR

Learned how to perform interactive

analysis in SparkR

Learned to run machine learning algorithms

in SparkR

Page 18: Scalable Data Science with Spark and R

Thank You!Zeydy Ortiz, Ph. D. & Rob Montalvo

zortiz @ datacrunchlab.com

rmontalvo @ datacrunchlab.com

Twitter: @DCrunchLab