sparkr: interactive data science at scalefiles.meetup.com/1225993/sparkr-barug.pdfsparkr:...

31
SparkR: Interactive Data Science at Scale Shivaram Venkataraman

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

SparkR: Interactive Data Science at Scale

Shivaram Venkataraman

Page 2: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Fast

Scalable Flexible

Page 3: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

DataFrames

Packages Plots

Page 4: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Fast

Scalable

Flexible

DataFames

Plots

Packages

Page 5: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Outline

SparkR Distributed Lists (RDD) Design Details SparkR DataFrames Roadmap

Page 6: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

RDD

Parallel Collection

Page 7: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

RDD

Parallel Collection

Transformations map filter

groupBy …

Actions count collect

saveAsTextFile …

Page 8: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

R + RDD =

Page 9: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

R + RDD = R2D2

Page 10: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

R + RDD = RRDD

Page 11: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

R + RDD = RRDD

lapply lapplyPartition

groupByKey reduceByKey sampleRDD

collect cache filter …

broadcast includePackage

textFile parallelize

Page 12: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  

Page 13: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

Page 14: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Example: word_count.R library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)  

Page 15: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

How does this work ?

Page 16: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Dataflow

Local Worker

Worker

Page 17: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Dataflow

Local

R

Worker

Worker

Page 18: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Dataflow

Local

R Spark Context

Java Spark

Context

R-JVM bridge

Worker

Worker

Page 19: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Dataflow

Local Worker

Worker R Spark Context

Java Spark

Context

Spark Executor exec R

Spark Executor exec R

R-JVM bridge

Page 20: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

SparkR DataFrames

Page 21: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Need for DataFrames

Structured Data Processing Read in CSV, JSON, JDBC etc.

Data source for Machine Learning

 glm(a  ~  b  +  c,  data  =  df) Functional transformations not intuitive

Page 22: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Spark SQL Imposes a schema on RDDs Query Optimizer, Code Gen Rich DataSources API >> Hive, Parquet, JDBC, JSON

SchemaRDDs DataFrames!

User

User

Name Age Height

Name Age Height

Page 23: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

SparkR DataFrame Methods

Filter -- Select some rows  filter(df,  df$col1  >  0)  

Project -- Select some columns  df$col1  or  df[“col”]  

Page 24: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

SparkR DataFrame Methods

Filter -- Select some rows Project -- Select some columns Aggregate -- Group and Summarize data groupDF  <-­‐  groupBy(df,  df$col1)  

agg(groupDF,  sum(groupDF$col2),  max(groupDF$col3))  

Sort -- Sort data by a particular column sortDF(df,  asc(df$col1))  

 

 

Page 25: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Demo

Page 26: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Developer Community Originated in AMPLab 19 contributors AMPLab, Alteryx, Databricks, Intel

Page 27: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Merged with Spark !

Part of Apache Spark 1.4

Page 28: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Coming Soon: ML Pipelines

High-level APIs to do Machine learning Example: glm, kmeans

Pipelines with featurizers, learning

Tokenize à TF-IDF à LogisticRegression Extended models, summary methods

Page 29: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Coming Soon

APIs for Streaming, Time series analysis Distributed matrix operations

<Your SparkR use case ?>

Page 30: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

SparkR

Combine scalability & utility

RDD à distributed lists Re-use existing packages Distributed DataFrames

Shivaram Venkataraman [email protected]

Page 31: SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames