sparklyr - jeff allen

15
Spark + R = sparklyr Jeff Allen, RStudio @trestleJeff

Upload: jo-fai-chow

Post on 16-Apr-2017

301 views

Category:

Technology


0 download

TRANSCRIPT

Spark + R = sparklyrJeff Allen, RStudio

@trestleJeff

THE R LANGUAGE

• 5th most popular programming language in the world1

• Data analysis, statistical modeling, visualization

• Great interface to your data

• Historically limited to in-memory data

1 http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

WHAT IS SPARK?

• Open-source Apache computing engine

• Bigger-than-memory data, low-latency distributed computing

• Can integrate with the Hadoop ecosystem

• Built-in machine learning

sparklyrIntroducing…

“SPARRK-lee ARR”

http://spark.rstudio.com

• New open-source R package from RStudio

• Complete dplyr back-end for Spark

• Integrated with the RStudio IDE

• Extensible foundation for Spark + R

sparklyr

KICK THE TIRESlibrary(sparklyr)spark_install()sc <- spark_connect(“local")my_tbl <- copy_to(sc, iris)

Driver Program

Cluster Manager

Worker

Spark Context

Task Task

Executor Cache

library(dplyr)

# use standard verbs to filter and aggregate select( filter(my_tbl, Petal_Width < 0.3), Petal_Length, Petal_Width )

# use magrittr pipes for a cleaner syntax my_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width)

USE DPLYR TO WRITE SPARK SQL

Demo

RUN IN PRODUCTION

Spark Cluster

Master Node

Driver Program

Spark ContextWorker Node

Task Task

Executor Cache

Cluster Manager

Worker Node

Task Task

Executor Cache

Use RStudio Server on the Spark cluster master node> spark_connect(“spark://spark.company.org:7077”)> my_tbl <- tbl(sc, “tblname”)

SPARKLYR FUNCTIONALITY• Full dplyr back-end for Spark DataFrames

• R wrappers for all MLlib functions

• Easily leverage from R Markdown, Shiny, etc.

• IDE integration

• Windows, too!

SPARKLYR & H2O

• rsparkling: R interface to Sparkling Water Spark package

• A sparklyr extension

• http://spark.rstudio.com/h2o.html

RELATIONSHIP TO SPARKR

• Working together to establish a common extension API

• Some differences in approach:

• CRAN distribution

• dplyr compatibility

SPARKR DPLYR

APPLICATIONS

http://spark.rstudio.com/examples.html

QUESTIONS?http://spark.rstudio.com

@trestleJeff

https://github.com/trestletech/user2016-sparklyr