r and spark: how to analyze data using rstudio's sparklyr and h2o's rsparkling packages:...

ANALYZE DATA USING RSTUDIO'S SPARKLYR

R AND SPARKhttps://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast

https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast

Apache Spark• Huge investments in big data and Hadoop• Data scientists wanting to analyze data at scale• Rapid progress and adoption in Spark libraries

R and RStudio• Wide range of tools and packages• Powerful ways to share insights• Interactive notebooks• Great visualizations

What we hear from our customers

Best of both worldsIf you are investing in Spark, then there is nothing stopping you from using it with the full power of R

Using R with Spark

Benefits of Spark for the R user

Apache Spark…• Can integrate with Hadoop• Supports familiar SQL syntax• Has built-in machine learning• Is designed for performance• Great for interactive data analysis

R users can take advantage of all these investments

New! Open-source R package from RStudio• Integrated with the RStudio IDE• Sparklyr is a dplyr back-end for Spark• Extensible foundation for Spark

applications and R

sparklyrhttp://spark.rstudio.com/

http://spark.rstudio.com/

Create your own R packages with interfaces to Spark•Interfaces to custom machine learning pipelines•Interfaces to 3rd party Spark packages•Many other R interfaces

sparklyr extensionsExampleCount the number of lines in a file

Extensionlibrary(sparklyr)count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count")}

Callsc <- spark_connect(master = "local")count_lines(sc, "hdfs://path/data.csv")

R for data science toolchain

“You’ll learn how to get your data into R [with Spark], get it into the most useful structure, transform it, visualize it and model it [with Spark].”

Import

Create a connectionsc <- spark_connect()

Import data from file/S3/HDFS/Rspark_read_csv(sc, “table”, “hdfs://<path>”)sdf_copy_to(sc, table, “table”)nyct2010_tbl <- tbl(sc, “table")

Write dataspark_write_parquet(table, “hdfs://<path>”)

SparklyrConnect to Spark.

Read and write data in CSV, JSON, and Parquet

formats. Data can be stored in HDFS, S3, or on the

local filesystem.

Wrangle

dplyrmy_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width)

Spark SQLselect Petal_Length, Petal_Widthfrom mytablewhere Petal_Width < 0.3

Use dplyr to write Spark SQL

A fast, consistent tool for working with data

frame like objects, both in memory and

out of memory.

Visualize

ggplot2collect(mpg_tbl) %>%ggplot() + aes(displ, hwy, color = class) + geom_point()

Use ggplot2 to visualize data

collected from Spark

A plotting system for R that makes it easy to

produce complex multi-layered graphics.

Model

ModelsK-means

Linear regressionLogistic regressionSurvival regression

Generalized linear regressionDecision trees

Random forestsGradient boosted trees

Principal component analysisNaive Bayes

Multilayer perceptronLatent Dirichlet allocation

One vs rest

Industry SpecificChemometricsClinical TrialsEconometricsEnvironmetrics

FinanceGenetics

PharmacokineticsPhylogeneticsPsychometricsSocial Sciences

ModelsGLMNet

Bayesian regressionMultinomial regression

Random ForestGradient boosted machine

Decision treesMulti-Layer Perceptron

Auto-encoderRestricted Boltzmann

K-MeansLSHSVDALS

ARIMAForecasting

Collaborative filteringSolvers and optimization

General TopicsMachine Learning

BayesianCluster

Design of experimentsExtreme ValueMeta AnalsisMultivariate

NLPRobust methods

SpatialSurvival

Time SeriesGraphical models

No data movement required. Native ML algorithms.

Fast growing ecosystem.

Over 10,000 packages. Time tested, industry specific models.

Integrated with other R packages

Scalable, high performance models. Wide variety of algorithms.

Useful diagnostics.

MLlib

CommunicateR MarkdownNotebooks

Make decisionsTake actionsSee results

Weave together text and code to produce

high quality documents, apps, and plots.

Share

DemoAnalyzing 1 billion records with Spark and R

http://colorado.rstudio.com:3939/content/262/



rsparkling extension

Spark is extensible…sparklyr is extensible

https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53

Spark

R H2O

rsparkling

sparklyr

h2o

sparklingwater

Benefits Limitations

No data movement required. Native ML algorithms.

Fast growing ecosystem.

Comparatively fewer algorithms and fewer diagnostics.

Scalable, high performance models. Wide variety of algorithms.

Useful diagnostics.

Data conversion requires 3-4X memory. Added complexity around introducing and

learning another tool.

Access to CRAN packages, visualization, reporting tools, and time tested algorithms.

Data collection is expensive and collection size is limited (< 10 GB).

Where should I model my data?

Others…

MLlib

What’s new with sparklyr?

spark.rstudio.com


http://spark.rstudio.com


r and spark: how to analyze data using rstudio's sparklyr and h2o's rsparkling packages:...

Data & Analytics