getting started with sparklyr - r . manchester · 2018-05-22 · getting started with sparklyr...
TRANSCRIPT
Mango Solutions
• Business solutions using statistics & machine learning since 2002
• Software development services
–R
–Python
–Java
Spark
Spark
• Open source cluster computing framework
• In memory processing, spill to disk
• Map-Reduce-like operations implemented in Spark functions
Resilient Distributed Dataset
• Data split over memory on multiple machines
• Extended to include DataFrames
Spark Memory
Execution Memory
Disc
Storage Memory
R
RAM
Deployment
• Processor
–Stand-alone single machine
–Stand-alone cluster
–Mesos/YARN cluster
• Data store
–mounted drives e.g. Network File System
–Hadoop Distributed File System
Spark Ecosystem
Spark and R
Spark and R
• Originally supported languages Scala, Java and Python
• SparkR was a separate project, integrated into Spark as of v1.4
• Create and work with massive data frames over a cluster of machines
SparkR
• Limited subset of Spark features
• Unfamiliar code structures
• New implementations (e.g. lm)
Sparklyr
library(sparklyr)
# wraps installation
spark_install(version = "2.1.0")
Spark is not R
• Use specific sparklyr functions
• Use Hive functions
• Not arbitrary R code
What’s in it for you?
• If you’ve ever run out of memory in R or need to interface with a Hadoop cluster, sparklyr may be the solution for you.
• Create and work with vast datasets
• Work interactively or create batch jobs using a familiar dplyr syntax
Connections
Creating a spark context
# create connection
# provide config if needed
sc <- spark_connect()
Copy to Spark Memory
mtcars_tbl <- copy_to(
dest = sc,
df = mtcars)
Import to Spark Memory
cost_tbl <- spark_read_csv(
sc = sc, name = "cost",
path = f1)
Spark SQL
• dplyr verbs–select
–filter
–arrange
–summarise
–mutate
• Laziness–Data pull (collect)
–Collates queries
Spark SQL
filter(mtcars_tbl,
cyl == 6 & am == 0) %>%
select(mpg, wt, qsec)
# # Source: lazy query [?? x 3]
# # Database: spark_connection
# mpg wt qsec
# <dbl> <dbl> <dbl>
# 1 21.4 3.22 19.4
# 2 18.1 3.46 20.2
# 3 19.2 3.44 18.3
Spark SQL
m2 <- summarise(
group_by(mtcars_tbl, am),
n = n())
m2
# Source: lazy query [?? x 2]
# Database: spark_connection
am n
<dbl> <dbl>
1 0 19.0
2 1.00 13.0
Viewing Spark SQL
# print query
dbplyr::sql_render(m2)
# <SQL> SELECT `am`, count(*) AS `n`
# FROM `mtcars`
# GROUP BY `am`
Transforming Spark DataFrames
• ft_*
• Applies common feature transformations to columns
Transforming Spark DataFrames
mtcars_tbl <- ft_bucketizer(
x = mtcars_tbl,
input_col = "disp",
output_col = "fct_disp",
splits = c(71.1, 145, 301, 472))
mtcars_tbl %>%
select(mpg, am, disp, fct_disp)
# # Source: lazy query [?? x 4]
# # Database: spark_connection
# mpg am disp fct_disp
# <dbl> <dbl> <dbl> <dbl>
Updating Spark DataFrames
• sdf_*
• Access the Scala Spark DataFrame API directly
• Not lazy
Updating Spark DataFrames
# query rows
sdf_nrow(mtcars_tbl)
# [1] 32
# data management
mtcars_tbl <- sdf_bind_cols(
mtcars_tbl, info_tbl)
Updating Spark DataFrames
# create test set
# list of tbl_spark objects
partitions <- sdf_partition(
x = mtcars_tbl,
training = 0.7,
test = 0.3,
seed = 26325)
Machine Learning
• ml_*
• Applies Spark ML library algorithms to Spark data frames
Sparklyr
• Fast local installation for modelling pipeline development
• Familiar environment for low learning overhead
• Connect to big data infrastructure from R