getting started with sparklyr - r . manchester · 2018-05-22 · getting started with sparklyr...

Getting started with sparklyr

Chris Campbell

Senior Data Scientist

@MangoTheCat

[email protected]

Mango Solutions

• Business solutions using statistics & machine learning since 2002

• Software development services

–R

–Python

–Java

Spark

• Open source cluster computing framework

• In memory processing, spill to disk

• Map-Reduce-like operations implemented in Spark functions

Resilient Distributed Dataset

• Data split over memory on multiple machines

• Extended to include DataFrames

Spark Memory

Execution Memory

Disc

Storage Memory

R

RAM

Deployment

• Processor

–Stand-alone single machine

–Stand-alone cluster

–Mesos/YARN cluster

• Data store

–mounted drives e.g. Network File System

–Hadoop Distributed File System

Spark Ecosystem

Spark and R

Spark and R

• Originally supported languages Scala, Java and Python

• SparkR was a separate project, integrated into Spark as of v1.4

• Create and work with massive data frames over a cluster of machines

SparkR

• Limited subset of Spark features

• Unfamiliar code structures

• New implementations (e.g. lm)

Sparklyr

library(sparklyr)

# wraps installation

spark_install(version = "2.1.0")

Spark is not R

• Use specific sparklyr functions

• Use Hive functions

• Not arbitrary R code

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

What’s in it for you?

• If you’ve ever run out of memory in R or need to interface with a Hadoop cluster, sparklyr may be the solution for you.

• Create and work with vast datasets

• Work interactively or create batch jobs using a familiar dplyr syntax

Connections

Creating a spark context

# create connection

# provide config if needed

sc <- spark_connect()

Copy to Spark Memory

mtcars_tbl <- copy_to(

dest = sc,

df = mtcars)

Import to Spark Memory

cost_tbl <- spark_read_csv(

sc = sc, name = "cost",

path = f1)

Spark SQL

• dplyr verbs–select

–filter

–arrange

–summarise

–mutate

• Laziness–Data pull (collect)

–Collates queries

Spark SQL

filter(mtcars_tbl,

cyl == 6 & am == 0) %>%

select(mpg, wt, qsec)

# # Source: lazy query [?? x 3]

# # Database: spark_connection

# mpg wt qsec

# <dbl> <dbl> <dbl>

# 1 21.4 3.22 19.4

# 2 18.1 3.46 20.2

# 3 19.2 3.44 18.3

Spark SQL

m2 <- summarise(

group_by(mtcars_tbl, am),

n = n())

m2

# Source: lazy query [?? x 2]

# Database: spark_connection

am n

<dbl> <dbl>

1 0 19.0

2 1.00 13.0

Viewing Spark SQL

# print query

dbplyr::sql_render(m2)

# <SQL> SELECT `am`, count(*) AS `n`

# FROM `mtcars`

# GROUP BY `am`

Transforming Spark DataFrames

• ft_*

• Applies common feature transformations to columns

Transforming Spark DataFrames

mtcars_tbl <- ft_bucketizer(

x = mtcars_tbl,

input_col = "disp",

output_col = "fct_disp",

splits = c(71.1, 145, 301, 472))

mtcars_tbl %>%

select(mpg, am, disp, fct_disp)

# # Source: lazy query [?? x 4]

# # Database: spark_connection

# mpg am disp fct_disp

# <dbl> <dbl> <dbl> <dbl>

Updating Spark DataFrames

• sdf_*

• Access the Scala Spark DataFrame API directly

• Not lazy


# query rows

sdf_nrow(mtcars_tbl)

# [1] 32

# data management

mtcars_tbl <- sdf_bind_cols(

mtcars_tbl, info_tbl)


# create test set

# list of tbl_spark objects

partitions <- sdf_partition(

x = mtcars_tbl,

training = 0.7,

test = 0.3,

seed = 26325)

Machine Learning

• ml_*

• Applies Spark ML library algorithms to Spark data frames

Demo Script

• http://bit.ly/manr0518sparklyr

http://bit.ly/manr0518sparklyr

Sparklyr

• Fast local installation for modelling pipeline development

• Familiar environment for low learning overhead

• Connect to big data infrastructure from R

getting started with sparklyr - r . manchester · 2018-05-22 · getting started with sparklyr...

Documents