sparkr - play spark using r (20160909 hadoopcon)

SparkR - Play Spark Using R

Gil Chen @HadoopCon 2016

Demo: http://goo.gl/VF77ad

about me

• R, Python & Matlab User

• Taiwan R User Group

• Taiwan Spark User Group

• Co-founder

• Data Scientist @

HadoopCon 2015

Outline• Introduction to SparkR

• Demo

• Starting to use SparkR

• DataFrames: dplyr style, SQL style

• RDD v.s. DataFrames

• MLlib: GLM, K-means

• User Case

• Median: approxQuantile()

• ID Match: dplyr style, SQL style, SparkR function

• SparkR + Shiny

• The Future of SparkR

Introduction to SparkR

Spark Origin

• Apache Spark is an open source cluster computing framework

• Originally developed at the University of California, Berkeley's AMPLab

• The first 2 contributors of SparkR: Shivaram Venkataraman & Zongheng Yang

https://amplab.cs.berkeley.edu/

Spark History

https://en.wikipedia.org/wiki/Apache_Spark

SparkR

DataFrames

PySpark

Key Advantages of Spark & R

Flexible

Scalable

Statistical!

Interactive

Packages

https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

ggplot2

Google Search: ggplot2

ggplot2 is a plotting system for R, based on the grammar of graphics.

http://shiny.rstudio.com/gallery/and more impressive dashboard…

A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required

Performance

https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/

The runtime performance of running group-by aggregation on 10 million integer pairs on a single machine in R, Python and Scala. (using the same dataset as https://goo.gl/iMLXnh)

https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

RDD (Resilient Distributed Dataset)

https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD

Internally, each RDD is characterized by five main properties:

1. A list of partitions

2. A function for computing each split

3. A list of dependencies on other RDDs

4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

https://docs.cloud.databricks.com/docs/latest/courses

RDD dependencies• Narrow dependency: Each partition of the parent RDD is used by at most

one partition of the child RDD. This means the task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)

• Wide dependency: Multiple child partitions may depend on one partition of the parent RDD. This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Job Scheduling

Black: if they are already in memory

Transformations

RDD Operations

map() flatmap() filter() mapPartitions() sample() union() intersection() distinct() groupByKey() reduceByKey() sortByKey() join() cogroup() …

Actionsreduce() collect() count() first() take(num) takeSample() takeOrdered() saveAsTextFile() saveAsSequenceFile() saveAsObjectFile() countByValue() countByKey() foreach() …

narrowdep.

widedep.

lazy evaluation

RDD Example

RDDRDDRDDRDD

Transformations

Action Value

rdd <- SparkR:::textFile(sc, "txt")

words <- SparkR:::flatMap(rdd, function())

wordCount <- SparkR:::lapply(words, function(word))

counts <- SparkR:::reduceByKey(wordCount, "+", 1)

op <- SparkR:::collect(counts)

R shell

SparkR

RDD & DataFrames

before v1.6

since v2.0

data.frame

+ schema

SparkDataFrame

+ schema

General

Action

Transformation

DataFrames are Faster!

http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/

Beyond SQL: Speeding up Spark with DataFrameshttp://www.slideshare.net/databricks/spark-sqlsse2015public

Spark Stack

https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html

Storage

Cluster Manager

Processing Engine

Access & Interfaces

How does sparkR works?

https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

Upgrading From SparkR 1.6 to 2.0

Before 1.6.2 Since 2.0.0

data type naming DataFrame SparkDataFrame

read csv Package from Databricks built-in

function(like approxQuantile) X O

ML function glm more(or use sparklyr)

SQLContext/ HiveContext sparkRSQL.init(sc) merge in

sparkR.session()

Execute Message very detailed simple

Launch on EC2 API X

https://spark.apache.org/docs/latest/sparkr.html

Demohttp://goo.gl/VF77ad

Easy Setting

1. Download

2. Decompress and Give a Path

3. Set Path and Launch SparkR in R

Documents

• If you have to use RDD, refer to AMP-Lab github: http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/and use “:::” e.g. SparkR:::textFile, SparkR:::lapply

• Otherwise, refer to SparkR official documents: https://spark.apache.org/docs/2.0.0/api/R/index.html

Starting to Use SparkR (v1.6.2)

# Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")

# Load SparkR library into your R session library(SparkR,

lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

# Initialize SparkContext, sc:sparkContext sc <- sparkR.init(appName = "Demo_SparkR")

# Initialize SQLContext sqlContext <- sparkRSQL.init(sc)

# your sparkR script # ... # ...

sparkR.stop()

Starting to Use SparkR (v2.0.0)

# Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")

# Load SparkR library into your R session library(SparkR,

lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

# Initialize SparkContext, sc: sparkContext sc <- sparkR.session(appName = "Demo_SparkR")

# Initialize SQLContext (don’t need anymore since 2.0.0) # sqlContext <- sparkRSQL.init(sc)

# your sparkR script # ... # ...

sparkR.stop()

DataFrames# Load the flights CSV file using read.df sdf <- read.df(sqlContext,"data_flights.csv", "com.databricks.spark.csv", header = "true")

# Filter flights from JFK jfk_flights <- filter(sdf, sdf$origin == "JFK")

# Group and aggregate flights to each destination dest_flights <- summarize(

groupBy(jfk_flights, jfk_flights$dest), count = n(jfk_flights$dest))

# Running SQL Queries registerTempTable(sdf, "tempTable") training <- sql(sqlContext,

"SELECT dest, count(dest) as cnt FROM tempTable WHERE dest = 'JFK' GROUP BY dest")

Word Count

# read data into RDD rdd <- SparkR:::textFile(sc, "data_word_count.txt")

# split word words <- SparkR:::flatMap(rdd, function(line) { strsplit(line, " ")[[1]] })

# map: give 1 for each word wordCount <- SparkR:::lapply(words, function(word) { list(word, 1) })

# reduce: count the value by key(word) counts <- SparkR:::reduceByKey(wordCount, "+", 2)

# convert RDD to list op <- SparkR:::collect(counts)

RDD v.s. DataFrames

flights_SDF <- read.df(sqlContext, "data_flights.csv", source = "com.databricks.spark.csv", header = "true")

SDF_op <- flights_SDF %>% group_by(flights_SDF$hour) %>% summarize(sum(flights_SDF$dep_delay)) %>% collect()

flights_RDD <- SparkR:::textFile(sc, "data_flights.csv")

RDD_op <- flights_RDD %>% SparkR:::filterRDD(function (x) { x >= 1 }) %>% SparkR:::lapply(function(x) { y1 <- as.numeric(unlist(strsplit(x, ","))[2]) y2 <- as.numeric(unlist(strsplit(x, ","))[6]) return(list(y1,y2))}) %>% SparkR:::reduceByKey(function(x,y) x + y, 1) %>% SparkR:::collect()

DataFrames

SparkR on MLlib

SparkR supports a subset of the available R formula operators for model fitting, including ~ . : + - ,

e.g. y ~ x1 + x2

Generalized Linear Model, GLM

# read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv", header = "true", schema) %>% cache()

# drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any")

# split train/test dataset train <- sample(flights_SDF_2, withReplacement = FALSE, fraction = 0.5, seed = 42) test <- except(flights_SDF_2, train)

# building model gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist, family = "gaussian") summary(gaussianGLM)

# prediction preds <- predict(gaussianGLM, test)

K-means

# read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv", header = "true", schema) %>% cache()

# drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any")

# clustering kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay + dep_delay + dist + flight + dest + cancelled + time + dist, k = 15) summary(kmeansModel)

cluster_op <- fitted(kmeansModel)

# clustering result kmeansPredictions <- predict(kmeansModel, flights_SDF_2)

User Case

Median (approxQuantile)

gdf <- seq(1,10,1) %>% data.frame() colnames(gdf) <- "seq"

sdf <- createDataFrame(gdf)

median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print()

Calculate Median using SQL query…so complicated…

http://www.1keydata.com/tw/sql/sql-median.html

ID Match

##### method 1 : like dplyr + pipeline join_id_m1 <- join(sdf_1, sdf_2, sdf_1$id1 == sdf_2$id2, "inner") %>% select("id2") %>% collect()

##### method 2 : sql query createOrReplaceTempView(sdf_1, "table1") createOrReplaceTempView(sdf_2, "table2")

qry_str <- "SELECT table2.id2 FROM table1 JOIN table2 ON table1.id1 = table2.id2"

join_id_m2 <- sql(qry_str)

##### method 3 : SparkR function join_id_m2 <- intersect(sdf_1, sdf_2) %>% collect()

Play Pokemon Go Data with SparkR !!

Application on SparkR

Interactive MapsWeb FrameworkCompute Engine

Where is the Dragonite nest ?

Port: 8080 - Cluster Monitor

Capacity of each worker

Port: 4040 Jobs Monitor

cache(SparkDataFrame), long run time for first time

Advanced performance

Status of each worker

Some Tricks

• Customize spark config for launch

• cache()

• Some codes can’t run in Rstudio, try to use terminal

• Packages from 3rd party, like package of read csv file from databricks

The Future of SparkR

• More MLlib API

• Advanced User Define Function

• package(“sparklyr”) from Rstudio

Reference• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,

Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016. https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark Summit, June 2014, San Francisco.https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

• Apache Spark Official Research http://spark.apache.org/research.html - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

• Apache Spark Official Document http://spark.apache.org/docs/latest/api/scala/

• AMPlab UC Berkeley - SparkR Project https://github.com/amplab-extras/SparkR-pkg

• Databricks Official Bloghttps://databricks.com/blog/category/engineering/spark

• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/

Rstudio in Amazon EC2

Join Us• Fansboard

• Web Designer (php & JavaScript) • Editor w/ facebook & instagram

• Vpon - Data Scientist

• Taiwan Spark User Group

• Taiwan R User Group

Thanks for your attention

& Taiwan Spark User Group& Vpon Data Team

sparkr - play spark using r (20160909 hadoopcon)

Data & Analytics

sparkr the past, present and future -...

中期報告 interim report 2016 -...

hadoopcon 2015: building a stream processing system for...

generalized linear models in spark mllib and sparkr by...

using sparkr to scale data science applications in...

generalized linear models in spark mllib and sparkr

intermediatedata sciencelearningplan...

20160909 5thanniv tomaskaberger · data från bp...

mindfulness in the curriculum: preaching to the...

gearing for growth - listed...

sparkr: enabling interactive data science at scale

sparkr slides

dimensionality reduction of genomic variation with big data...

cfi faster automotive ethernet phy -...

issn workshops and resources - asia...

psychiatric disorders and trends in resource use in ...sep...

interface 2015 tutorial- sparkr ·...

parallelizing existing r packages with sparkr

a data frame abstraction layer for sparkr-(chris freeman,...

9 september 2016 asx market announcements office dear sir...