recent development in sparkr for advanced analyticsfiles.meetup.com/4439192/recent development in...

Recent Developments in SparkR for Advanced Analytics

Xiangrui Meng [email protected]

2016/07/15 - Silicon Valley Machine Learning Meetup

mailto:[email protected]?subject=

About Apache Spark

2

• General engine for large-scale data processing under the MapReduce framework

• Resilient Distributed Dataset (RDD) with in-memory & on-disk caching

• Concise APIs in Python, Scala, Java, and R

• Apache open source license

• Spark 2.0 coming soon!

Spark

Spark SQL Streaming MLlib GraphX SparkR

Apache Spark is the Taylor Swift of Big Data Software

- Derrick Harris, Fortune

4

About Databricks

• Founded by the team who created Apache Spark • Offers a hosted service • Apache Spark in the cloud • Notebooks • Cluster management • Production environment

• Free Community Edition • http://databricks.com/try

5

http://goo.gl/mhlxUz

About Me

• Software Engineer at Databricks • tech lead of machine learning and data science

• Committer and PMC member of Apache Spark • Ph.D. from Stanford in computational mathematics

6

Outline

• Introduction to SparkR •Descriptive analytics in SparkR •Predictive analytics in SparkR • Future directions

7

Introduction to SparkR

Bridging the gap between R and Big Data

SparkR

• Introduced to Spark since 1.4 • Wrappers over DataFrames and DataFrame-based APIs

• In SparkR, we make the APIs similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. • R is very convenient for analytics and users love it. • Scalability is the main issue, not the API.

9

DataFrame-based APIs

• Storage: S3 / HDFS / local / … • Data sources: csv / parquet / json / … • DataFrame operations: • select / subset / groupBy / agg / collect / … • rand / sample / avg / var / …

• Conversion to/from R data.frame

10

DataFrame-based APIs

Compute average age per department

• df.groupBy(“dept”).avg(“age”) # Python • df.groupBy(“dept”).avg(“age”) // Scala • df.groupBy(“dept”).avg(“age”); // Java

• avg(groupBy(df, “dept”), “age”) # R • df %>% groupBy(“dept”) %>% avg(“age”) # R with magrittr

11

+----+--------+ |dept|avg(age)| +----+--------+ | eng| 25.0| | ops| 30.0| +----+--------+

SparkR Architecture

12

Spark Driver

R JVM

R Backend

JVM

Worker

JVM

Worker

Data Sources

Data Conversion between R and SparkR

13

R JVM

R BackendSparkR::collect()

SparkR::createDataFrame()

Descriptive Analytics

Big Data at a glimpse in SparkR

Summary Statistics

15

• count, min, max, mean, standard deviation, variance describe(df)

df %>% groupBy(“dept”, avgAge = avg(df$age))

• covariance, correlation df %>% select(var_samp(df$x, df$y))

• skewness, kurtosis df %>% select(skewness(df$x), kurtosis(df$x))

Sampling Algorithms

• Bernoulli sampling (without replacement)df %>% sample(FALSE, 0.01)

• Poisson sampling (with replacement)df %>% sample(TRUE, 0.01)

• stratified samplingdf %>% sampleBy(“key”, c(positive = 1.0, negative = 0.1))

16

Approximate Algorithms

• frequent items [Karp03] df %>% freqItems(c(“title”, “gender”), support = 0.01)

• approximate quantiles [Greenwald01] df %>% approxQuantile(“value”, c(0.1, 0.5, 0.9), relErr = 0.01)

• single pass with aggregate pattern • trade-off between accuracy and space

17

Implementation: Aggregation Pattern

split + aggregate + combine in a single pass • split data into multiple partitions • calculate partially aggregated result on each partition • combine partial results into final result

18

Implementation: High-Performance

• new online update formulas of summary statistics • code generation to achieve high performance

kurtosis of 1 billion values on a Macbook Pro (2 cores):

19

scipy.stats 250s

octave 120s

CRAN::moments 70s

SparkR / Spark / PySpark 5.5s

https://arxiv.org/pdf/1510.04923.pdf

Predictive Analytics

Enabling large-scale machine learning in SparkR

MLlib + SparkR

MLlib and SparkR integration started in Spark 1.5.

API design choices: 1. mimic the methods implemented in R or R packages

• no new method to learn • similar but not the same / shadows existing methods • inconsistent APIs

2. create a new set of APIs

21

MLlib Algorithm Coverage• Classification • Logistic regression • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron

• Regression • Ordinary least squares • Ridge regression • Lasso • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Survival regression • Streaming linear methods

• Frequent pattern mining • FP-growth • PrefixSpan

22

Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering • Bisecting k-means

Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation • Kolmogorov–Smirnov test Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices

• Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix

• Matrix decompositions

Recommendation • Alternating Least Squares Feature extraction & selection • Word2Vec • Chi-Squared selection • Hashing term frequency • Inverse document frequency • Normalizer • Standard scaler • Tokenizer • One-Hot Encoder • StringIndexer • VectorIndexer • VectorAssembler • Binarizer • Bucketizer • ElementwiseProduct • PolynomialExpansion

Model import/export Pipelines

List based on Spark 1.6

Generalized Linear Models (GLMs)

• Linear models are simple but extremely popular. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that

• maximizes the sum of log-likelihoods

23

Distributions and Link Functions

SparkR supports all families supported by R in Spark 2.0.

24

Model Distribution Link

linear least squares normal identity

logistic regression binomial logit

Poisson regression Poisson log

gamma regression gamma inverse

… … …

GLMs in SparkR

# Create the DataFrame for training df <- read.df(sqlContext, “path/to/training”)

# Fit a Gaussian linear model model <- glm(y ~ x1 + x2, data = df, family = “gaussian”) # mimic R model <- spark.glm(df, y ~ x1 + x2, family = “gaussian”)

# Get the model summary summary(model)

# Make predictions predict(model, newDF)

25

Implementation: SparkR::glm

The `SparkR::glm` is a simple wrapper over an ML pipeline that consists of the following stages:

• RFormula, which itself embeds an ML pipeline for feature preprocessing and encoding,

• an estimator (GeneralizedLinearRegression).

26

RWrapper

Implementation: SparkR::glm

27

RFormula

GLM

RWrapper

RFormula

GLM

StringIndexer

VectorAssembler

IndexToString

StringIndexer

Implementation: R Formula

28

• R provides model formula to express models. • We support the following R formula operators in SparkR:

• `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized

categorical values) • `.` all columns except target

• The implementation is in Scala.

GLM: Row-based Distributed Storage

29

w x y

w x y

partition 1

partition 2

GLM: Gradient Descent Methods

• Stochastic gradient descent (SGD): • trade-offs on the merge scheme and convergence

• Mini-batch SGD: • hard to sample mini-batches efficiently • communication overhead on merging gradients

• Batch gradient descent: • slow convergence

30

GLM: Quasi-Newton methods

• Newton’s method converges much than GD, but it requires second-order information: • L-BFGS works for smooth objectives. It approximates the

inverse Hessian using only first-order information. • OWL-QN works for objectives with L1 regularization. • MLlib calls L-BFGS/OWL-QN implemented in breeze.

31

Direct Methods for Linear Least Squares

• Linear least squares has an analytic solution:

• The solution could be computed directly or through QR factorization, both of which are implemented in Spark. • requires only a single pass • efficient when the number of features is small (<4000) • provides R-like model summary statistics

32

GLM: Iteratively Re-weighted Least Squares

• Generalized linear models with exponential family can be solved via iteratively re-weighted least squares (IRLS). • linearizes the objective at the current solution • solves the weighted linear least squares problem • repeat above steps until convergence

• efficient when the number of features is small (<4000) • provides R-like model summary statistics • This is the implementation in R.

33

Standardization

To match the result in both R and glmnet, the most popular R package for GLMs, we provide options to standardize features and labels before training:where delta is the stddev of labels, and sigma_j is the stddev of the j-th feature column.

34

Implementation: Test against R

Besides normal tests, we also verify our implementation using R.

/* df <- as.data.frame(cbind(A, b)) for (formula in c(b ~ . -1, b ~ .)) { model <- lm(formula, data=df, weights=w) print(as.vector(coef(model))) } [1] -3.727121 3.009983 [1] 18.08 6.08 -0.60 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983), Vectors.dense(18.08, 6.08, -0.60))

35

ML Models in SparkR (Spark 2.0)

• generalized linear models (GLMs) • glm / spark.glm (stats::glm)

• accelerated failure time (AFT) model for survival analysis • spark.survreg (survival)

• k-means clustering • spark.kmeans (stats:kmeans)

• Bernoulli naive Bayes • spark.naiveBayes (e1071)

36

Model Persistence in SparkR

• model persistence supported for all ML models in SparkR • thin wrappers over pipeline persistence from MLlib

model <- spark.glm(df, x ~ y + z, family = “gaussian”)

write.ml(model, path)

model <- read.ml(path)

summary(model)

• feasible to pass saved models to Scala/Java engineers

37

Work with R Packages in SparkR

• There are ~8500 community packages on CRAN. • It is impossible for SparkR to match all existing features.

• Not every dataset is large. • Many people work with small/medium datasets.

• SparkR helps in those scenarios by: • connecting to different data sources, • filtering or downsampling big datasets, • parallelizing training/tuning tasks.

38

Work with R Packages in SparkR

df <- sqlContext %>% read.df(…) %>% collect()

points <- data.matrix(df)

run_kmeans <- function(k) {

kmeans(points, centers=k)

}

kk <- 1:6

lapply(kk, run_kmeans) # R’s apply

spark.lapply(kk, run_kmeans) # parallelize the tasks on Spark

39

summary(this.talk)

• SparkR enables big data analytics on R • descriptive analytics on top of DataFrames • predictive analytics from MLlib integration

• SparkR works well with existing R packages

Thanks to the Apache Spark community for developing and maintaining SparkR: Alteryx, Berkeley AMPLab, Databricks, Hortonworks, IBM, Intel, etc, and individual contributors!!

40

Future Directions

• CRAN release of SparkR (WIP) • more consistent APIs with existing R packages: dplyr, etc • better R formula support (WIP) • more algorithms from MLlib: decision trees, ALS, etc (WIP) • better integration with existing R packages: gapply / UDFs (WIP) • integration with Spark packages: GraphFrames, CoreNLP, etc

We’d greatly appreciate feedback from the R community!

41

Try Apache Spark with Databricks

42

http://databricks.com/try

• Download a companion notebook of this talk at: http://dbricks.co/1rbujoD

• Try latest version of Apache Spark and preview of Spark 2.0

http://goo.gl/mhlxUz

http://dbricks.co/1rbujoD

Thank you.• SparkR user guide on Apache Spark website • SparkR/MLlib roadmap for Spark 2.1

• Databricks Community Edition and blog posts

http://spark.apache.org/docs/latest/sparkr.html

https://issues.apache.org/jira/browse/SPARK-15799

https://community.cloud.databricks.com/

https://databricks.com/blog

recent development in sparkr for advanced analyticsfiles.meetup.com/4439192/recent development in...

Documents