getting started with sparklyr - r . manchester · 2018-05-22 · getting started with sparklyr...

30
Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat [email protected]

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Getting started with sparklyr

Chris Campbell

Senior Data Scientist

@MangoTheCat

[email protected]

Page 2: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Mango Solutions

• Business solutions using statistics & machine learning since 2002

• Software development services

–R

–Python

–Java

Page 3: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark

Page 4: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark

• Open source cluster computing framework

• In memory processing, spill to disk

• Map-Reduce-like operations implemented in Spark functions

Page 5: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Resilient Distributed Dataset

• Data split over memory on multiple machines

• Extended to include DataFrames

Page 6: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark Memory

Execution Memory

Disc

Storage Memory

R

RAM

Page 7: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Deployment

• Processor

–Stand-alone single machine

–Stand-alone cluster

–Mesos/YARN cluster

• Data store

–mounted drives e.g. Network File System

–Hadoop Distributed File System

Page 8: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark Ecosystem

Page 9: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark and R

Page 10: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark and R

• Originally supported languages Scala, Java and Python

• SparkR was a separate project, integrated into Spark as of v1.4

• Create and work with massive data frames over a cluster of machines

Page 11: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

SparkR

• Limited subset of Spark features

• Unfamiliar code structures

• New implementations (e.g. lm)

Page 12: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Sparklyr

library(sparklyr)

# wraps installation

spark_install(version = "2.1.0")

Page 13: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark is not R

• Use specific sparklyr functions

• Use Hive functions

• Not arbitrary R code

Page 14: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

What’s in it for you?

• If you’ve ever run out of memory in R or need to interface with a Hadoop cluster, sparklyr may be the solution for you.

• Create and work with vast datasets

• Work interactively or create batch jobs using a familiar dplyr syntax

Page 15: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Connections

Page 16: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Creating a spark context

# create connection

# provide config if needed

sc <- spark_connect()

Page 17: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Copy to Spark Memory

mtcars_tbl <- copy_to(

dest = sc,

df = mtcars)

Page 18: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Import to Spark Memory

cost_tbl <- spark_read_csv(

sc = sc, name = "cost",

path = f1)

Page 19: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark SQL

• dplyr verbs–select

–filter

–arrange

–summarise

–mutate

• Laziness–Data pull (collect)

–Collates queries

Page 20: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark SQL

filter(mtcars_tbl,

cyl == 6 & am == 0) %>%

select(mpg, wt, qsec)

# # Source: lazy query [?? x 3]

# # Database: spark_connection

# mpg wt qsec

# <dbl> <dbl> <dbl>

# 1 21.4 3.22 19.4

# 2 18.1 3.46 20.2

# 3 19.2 3.44 18.3

Page 21: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Spark SQL

m2 <- summarise(

group_by(mtcars_tbl, am),

n = n())

m2

# Source: lazy query [?? x 2]

# Database: spark_connection

am n

<dbl> <dbl>

1 0 19.0

2 1.00 13.0

Page 22: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Viewing Spark SQL

# print query

dbplyr::sql_render(m2)

# <SQL> SELECT `am`, count(*) AS `n`

# FROM `mtcars`

# GROUP BY `am`

Page 23: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Transforming Spark DataFrames

• ft_*

• Applies common feature transformations to columns

Page 24: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Transforming Spark DataFrames

mtcars_tbl <- ft_bucketizer(

x = mtcars_tbl,

input_col = "disp",

output_col = "fct_disp",

splits = c(71.1, 145, 301, 472))

mtcars_tbl %>%

select(mpg, am, disp, fct_disp)

# # Source: lazy query [?? x 4]

# # Database: spark_connection

# mpg am disp fct_disp

# <dbl> <dbl> <dbl> <dbl>

Page 25: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Updating Spark DataFrames

• sdf_*

• Access the Scala Spark DataFrame API directly

• Not lazy

Page 26: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Updating Spark DataFrames

# query rows

sdf_nrow(mtcars_tbl)

# [1] 32

# data management

mtcars_tbl <- sdf_bind_cols(

mtcars_tbl, info_tbl)

Page 27: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Updating Spark DataFrames

# create test set

# list of tbl_spark objects

partitions <- sdf_partition(

x = mtcars_tbl,

training = 0.7,

test = 0.3,

seed = 26325)

Page 28: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Machine Learning

• ml_*

• Applies Spark ML library algorithms to Spark data frames

Page 29: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Demo Script

• http://bit.ly/manr0518sparklyr

Page 30: Getting started with sparklyr - R . Manchester · 2018-05-22 · Getting started with sparklyr Chris Campbell Senior Data Scientist @MangoTheCat ... –R –Python –Java. Spark

Sparklyr

• Fast local installation for modelling pipeline development

• Familiar environment for low learning overhead

• Connect to big data infrastructure from R