building a scalable data science platform with r on hdinsight

67
BR005 Microsoft Machine Learning & Data Science Summit September 26 – 27 | Atlanta, GA

Upload: doanthuan

Post on 15-Dec-2016

224 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Building a Scalable Data Science Platform with R on HDInsight

BR005

Microsoft Machine Learning& Data Science SummitSeptember 26 – 27 | Atlanta, GA

Page 2: Building a Scalable Data Science Platform with R on HDInsight

Building a Scalable Data Science Platform with R on HDInsightDebraj GuhaThakurtaSenior Data ScientistData Group – Algorithms and Data Science, Redmond

Email: [email protected]: @d_guhathakurta

Co-contributors: Mario Inchiosa, Katherine Zhao, Hang Zhang, Max Kaznadi

Page 3: Building a Scalable Data Science Platform with R on HDInsight

• R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight • Mon, Sept 26, 1:30 – 2:30 PM• Maxim Lukiyanov

• Big, Fast, and Data-Furious…with Spark • Mon, Sept 27, 12:30 – 1:30 PM• Maxim Lukiyanov

• Instructor-Led Lab: The Cortana Intelligence Suite - Part Two: Deep Dive • Mon, Sept 26, 10:30 AM – 5 PM• Buck Woody

• Self-Paced Lab: Microsoft Server R• Mon, Sept 26, 1 – 4 PM; Tue Sept 27, 10:30 – 11:30 AM & 12:30 – 2:30 PM• Jeremy Reynolds

• Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…• Tue, Sept 27, 3 – 4 PM• Hang Zhang, Jacob Spoelstra, Gopi Kumar

Related talks3

Page 4: Building a Scalable Data Science Platform with R on HDInsight

• Microsoft R Server: Benefits

• R Server on HDInsight (Premium, Preview): Scalable analytical platform on Azure

• How to: • Develop end-to-end data science process using R Server on Spark HDInsight

(Premium)• How to adopt process and code

Key takeaways4

Page 5: Building a Scalable Data Science Platform with R on HDInsight

• R and its benefits / limitations• Microsoft R Server: Scalable, enterprise-class• R Server on HDInsight (Premium) clusters• Demo - Developing end-to-end data science processes using

R Server on HDInsight Spark clusters• Pointers to technical content: Tutorials, templates, blogs

Agenda5

Page 6: Building a Scalable Data Science Platform with R on HDInsight

R – its benefits and limitations

Page 7: Building a Scalable Data Science Platform with R on HDInsight

R - introduction

• 2.5+M users • Taught in most universities• Thriving user groups

worldwideCommunity

• The most popular statistical programming & ML language

• Data visualization & reporting tool• Open source, transparent

Language Platform

• Free

7

• 9,000+ contributed packagesEcosystem • Applications & integration• Many use cases / business problems

addressed

Page 8: Building a Scalable Data Science Platform with R on HDInsight

Preferred language by Analytics Professionals

Source: SAS, R or Python Survey 2016, by Burtch Works

Which do you prefer to use: SAS, R, or Python?

2015 20142016 2015

Unified IEEE Spectrum Ranking 2016http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages

8

Page 9: Building a Scalable Data Science Platform with R on HDInsight

Common R use casesVertical Sales & Marketing Finance & Risk Customer & Channel Operations &

Workforce

Retail

Demand  ForecastingLoyalty ProgramsCross-sell & Upsell

Customer Acquisition

Fraud DetectionPricing Strategy

Personalization Lifetime Customer Value Product Segmentation

Store Location DemographicsSupply Chain Management

Inventory Management

Financial Services

Customer Churn Loyalty Programs Cross-sell & Upsell

Customer Acquisition

Fraud DetectionRisk& Compliance

Loan Defaults

PersonalizationLifetime Customer

Value

Call Center OptimizationPay for Performance 

Healthcare Marketing Mix Optimization

Patient Acquisition Fraud Detection

Bill Collection Population Health

Patient Demographics Operational Efficiency Pay for Performance

ManufacturingDemand Forecasting

Marketing mix OptimizationPricing Strategy

Perf Risk Management Supply Chain Optimization

Personalization

Remote Monitoring Predictive Maintenance

Asset Management

9

Page 10: Building a Scalable Data Science Platform with R on HDInsight

Processing limitations of open source R

• In-Memory Operation

• Lack of Parallelism

• Expensive Data Movement

& Duplication

Page 11: Building a Scalable Data Science Platform with R on HDInsight

Open source R is not enterprise class

Inadequacy of

Community Support

Lack of Guaranteed

Support Timeliness

No SLAs or Support Models

Page 12: Building a Scalable Data Science Platform with R on HDInsight

Microsoft R Server

Page 13: Building a Scalable Data Science Platform with R on HDInsight

R from Microsoft brings13

Peace of mind Speed and

scalabilityEfficiencyFlexibilit

y

• Support and SLA• Works on data in memory or on disc (scale)• Wide range of scalable and distributed R functions • Works in several compute contexts (incl. Hadoop, Spark, SQL-server),

and data sources (incl. disk, HDFS, SQL)

Page 14: Building a Scalable Data Science Platform with R on HDInsight

Portability & investment assurance

R Server portfolio

Cloud • Windows• Linux

• SQL Server 2016 EE• SQL Server 2016 SERDBMS• Windows• LinuxDesktops & Servers

Hadoop & Spark • Hortonworks• Cloudera• MapR

EDW • SQL Server 2016• Teradata Database

R+CR

ANM

icros

oft R

Op

en

DistributedR

ScaleR

ConnectR

DeployR

R Server Technology

14

Write once deploy anywhere - WODA

Page 15: Building a Scalable Data Science Platform with R on HDInsight

• On a workstation:• All available cores used for math operations and parallel processes• Hard drive capacity sets limit for data size, not RAM• Works directly on XDF (External data frames) on disk

• On a cluster:• Parallel utilization of nodes• Distributed file systems like HDFS greatly expand possible data

sizes

ScaleR - parallel or distributed processing

15

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

Page 16: Building a Scalable Data Science Platform with R on HDInsight

ScaleRPEMA: Parallel external memory algorithms

Stream data into RAM in blocks. “Big Data” can be any data size. Can handle Megabytes to Gigabytes to Terabytes…

ScaleR algorithms work inside multiple cores / nodes in parallel at high speed

Interim results are collected and combined analytically to produce the output on the entire data set

XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing.

16

Lee Edefsen: PEMA’s applied to GLM, http://www.slideshare.net/RevolutionAnalytics/parallel-external-memory-algorithms-applied-to-generalized-linear-models

Page 17: Building a Scalable Data Science Platform with R on HDInsight

• Linear regression (rxLinMod)• Generalized linear models (rxLogit, rxGLM)• Decision trees (rxDTree)• Gradient boosted decision trees (rxBTree)• Random forests (rxDForest)• K-means (rxKmeans)• Naïve Bayes (rxNaiveBayes)

Available ScaleR distributed algorithms

17

Page 18: Building a Scalable Data Science Platform with R on HDInsight

ScaleR distributed algorithms Data import – Delimited, Fixed, SAS,

SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums)

Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test

ETL Statistical Tests

Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for

set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables &

long form) Marginal Summaries of Cross Tabulations

Descriptive Statistics Sum of Squares (cross product matrix for

set variables) Multiple Linear Regression Generalized Linear Models (GLM)

exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.

Covariance & Correlation Matrices Predictions/scoring for models Residuals for all models

Predictive Statistics

K-MeansClustering

Linear regression Logistic regression Decision Trees Decision Forests Gradient Boosted Decision

Trees Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo) Parallel Random Number

Generation Custom Parallelization rxExec

PEMA-R APIVariable Selection Stepwise Regression

18

Page 19: Building a Scalable Data Science Platform with R on HDInsight

• Any analysis that is more complex than simple aggregations• Analysis with data that fit in physical memory of single

machines• Creating sophisticated visualizations (e.g. ggplot, lattice)• Creating reports (use knitr and Markdown)• Analyses that use domain-specific tools or cutting-edge

algorithms• e.g. Forecasting, health informatics, …. , etc.

Typical uses of open source R19

Page 20: Building a Scalable Data Science Platform with R on HDInsight

• Working with big data• Building models that take too long to run in R• Working with clusters and distributed file

systems• e.g. HDInsight clusters + HDFS

• Developing portable scripts for many compute contexts

Typical uses of R Server20

Page 21: Building a Scalable Data Science Platform with R on HDInsight

Big Data In-memory bound

Hybrid memory & disk scalability

Operates on bigger volumes of data

Speed of Analysis

Single threaded Parallel threading Shrinks analysis time

Enterprise Readiness

Community support

Commercial support Delivers full service production support

Analytic Breadth & Depth

9000+ innovative analytic packages

Leverage open source packages plus Big Data ready packages

Supercharges R with ScaleR functions

Commercial Viability

Risk of deployment of open source

Commercial license Eliminate risk with open source

Benefits of R Server21

R Server

Page 22: Building a Scalable Data Science Platform with R on HDInsight

R Server on HDInsight (Premium)

Page 23: Building a Scalable Data Science Platform with R on HDInsight

R Server on HDInsight (Premium)Managed Hadoop for advanced analytics in the Cloud

RevoScaleR

Hadoop / Spark

Blob Storage (HDFS)Data Lake Storage

• Easy setup, elastic, SLA• R Server benefits

• Leverage R skills• ScaleR functions• ….

• Familiar & enhanced IDEs• Popular IDEs (RStudio, RTVS, Notebooks,

etc.)

23

Others (e.g. SparkR)

R

Page 24: Building a Scalable Data Science Platform with R on HDInsight

Provisioning HDInsight (Premium) with R Server

24

Page 25: Building a Scalable Data Science Platform with R on HDInsight

Elastic - Scaling HDInsight clusters25

Page 26: Building a Scalable Data Science Platform with R on HDInsight

R server on HDInsight - Architecture

26

R R R R R

R R R R R

Data Scientists

R Server

Edge

Head Nodes

Data/Worker Nodes

Page 27: Building a Scalable Data Science Platform with R on HDInsight

R Server on HDInsight - Connectivity

Worker Task

R Server Master Task

Edge Node

Worker Task

Worker Task

Remote Execution: ssh

ssh or R Tools for Visual Studio

Jupyter Notebooks

Thin Client IDEs

https://

https://

or MapRedu

ce

27

Page 28: Building a Scalable Data Science Platform with R on HDInsight

R Server on HDInsight - Data processing

Server Local Processing

Data in Distributed Storage

R process on Edge Node

Server Distributed Processing

Master R process on Edge Node

Apache YARN / Spark

Worker R processes on Data

Nodes

28

Page 29: Building a Scalable Data Science Platform with R on HDInsight

Write once deploy anywhere - WODASwitching compute contextsCode can be deployed from a server or edge node to run in Spark/Hadoop without any functional R model re-coding.

## Statistical Summary rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineData, reportProgress = 1)## Linear model and plothdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,

data = AirlineData)

## SETUP LOCAL ENVIRONMENT VARIABLES ## myLocalCC <- “localpar”

## LOCAL COMPUTE CONTEXT ## rxSetComputeContext(myLocalCC)

Local Parallel processing - Linux or Windows

Compute context R script - sets where the model will run

R script – does not need to change to run in Hadoop/ Spark

29

mySparkCC <- RxSpark() myHadoopCC <- RxHadoopMR()

rxSetComputeContext(mySparkCC) rxSetComputeContext(myHadoopCC)

In – Spark/Hadoop

Page 30: Building a Scalable Data Science Platform with R on HDInsight

R Script for Execution in MapReduce

Sample R Script:

rxSetComputeContext( RxHadoopMR(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Define Compute Context

Define Data Source

Train Predictive

Model

30

Page 31: Building a Scalable Data Science Platform with R on HDInsight

Easy to Switch From MapReduce to Spark

Keep other code

unchanged

Sample R Script:

rxSetComputeContext( RxSpark(…) )inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)

Change the Compute Context

31

Page 32: Building a Scalable Data Science Platform with R on HDInsight

Creating a data science process using R Server on Spark HDInsight

Page 33: Building a Scalable Data Science Platform with R on HDInsight

Apache Spark engine and its APIs

33Denny Lee, DataBricks

Spark Core

Spark Streamin

gSpark SQL MLlib GraphX

o Scale out, fault tolerant, distributed, in-memory processing

o Multi-language API (incl. R)

o Standard libraries: ML, statistics

33

Page 34: Building a Scalable Data Science Platform with R on HDInsight

Spark’s use cases - Diverse industries & scenarios

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html

34

Page 35: Building a Scalable Data Science Platform with R on HDInsight

Spark advanced analytics

Source: Databricks Spark 2015 survey reporthttps://databricks.com/blog/2015/09/24/spark-survey-2015-results-are-now-available.html 35

Advanced analytics is an important Spark feature

R is rapidly gaining popularity

(Available since June 2015)

35

Page 36: Building a Scalable Data Science Platform with R on HDInsight

Open-source packages for ML in Spark using Ro SparkR: o R package - a light-weight front-end for Apache Spark from

Ro Limited in terms of ML algo bindings at this timeo Works on MLlib functions (RDDs)

o Sparklyr-ML: o Developed by RStudioo Provides R bindings to spark.ml library

36

Page 37: Building a Scalable Data Science Platform with R on HDInsight

Data science / advanced analytics process

http://aka.ms/tdsp

37

Page 38: Building a Scalable Data Science Platform with R on HDInsight

• Git-based repositories with templates providing a central archive

• Standardized project structure• Document templates• Utility scripts• Independent of the execution

environment, to allow scientists to use multiple cloud resources as needs dictate.

Building intelligent applications using team data science process

https://blogs.technet.microsoft.com/machinelearning/2016/09/08/building-intelligent-applications-using-the-team-data-science-

process/ 38http://aka.ms/tdsp

Data Science Doesn’t Just Happen, It Takes a Process. Learn about Ours…

Tue, Sept 27, 3 – 4 PMHang Zhang, Jacob Spoelstra, Gopi Kumar

38

Page 39: Building a Scalable Data Science Platform with R on HDInsight

Prepare: Assemble, cleanse, profile and transform diverse data relevant to the subject

Model: Use statistical and machine learning algorithms to build classifiers and regression models

Operationalize: Make predictions and visualizations to support business applications

DS process shown in demo

OperationalizeModelPrepare

39

Page 40: Building a Scalable Data Science Platform with R on HDInsight

E2E Demo/ExampleFlight arrival delay prediction1. Provisioning clusters using PowerShell scripts2. Prep (Clean/Join) – Using SparkR from R Server3. Model (Train/Score/Evaluate) – Scale R4. Deployment – to Azure ML from R Server

40

Page 41: Building a Scalable Data Science Platform with R on HDInsight

End-to-end data science process example

Azure Blob Storage

HDInsight

Microsoft R Server Azure Machine Learning

Web Application

Data Sources Data Partition Feature Engineering

Model TrainingPredictions

Web Services Consumption

Power BI

KDD 2016, (Tutorial Using R on Spark) tinyurl.com/KDD2016Rzure Machine Learning: https://azure.microsoft.com/en-us/services/machine-learning/

41

Page 42: Building a Scalable Data Science Platform with R on HDInsight

• Azure blob storage (HDFS)• R Server on Spark HDInsight (Premium)• Azure ML R package and Azure ML web

service• PowerBI (optional)

Technologies / services used42

Page 43: Building a Scalable Data Science Platform with R on HDInsight

Provisioning & deleting R Server Spark HDInsight clusters using Azure Commandlets & ARM templates## CREATE CLUSTERS USING ARM TEMPLATES$templatePath = "https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/KDDCup2016/Scripts/Configuration/azuredeploy.json";

$hdiparams @{clusterType="spark"; clusterName=$clustername; clusterLoginUserName="admin"; clusterLoginPassword=$clusterpasswd; sshUserName="remoteuser"; sshPassword=$clusterpasswd;clusterWorkerNodeCount=2};

New-AzureRmResourceGroupDeployment -Name $clustername -ResourceGroupName $resourcegroup -TemplateParameterObject $hdiparams -TemplateUri $templatePath;

## DELETE CLUSTERSRemove-AzureRmHDInsightCluster -ClusterName $clustername

43

Page 44: Building a Scalable Data Science Platform with R on HDInsight

Script based deployment of HDInsight clusters with R Sever

44

Page 45: Building a Scalable Data Science Platform with R on HDInsight

• Predict if a flight arrival is going to be by 15 mins or not (binary classification), based on features:• Airline, flight, airport • Airline carrier• Type of airplane / vehicle• Departure and arrival airports• Flight distance• Month, week, day

• Weather• Wind speed• Visibility• Humidity

Prediction task: Predict flight delays

45

Page 46: Building a Scalable Data Science Platform with R on HDInsight

• Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection

• >20 years of data• 300+ Airports• Every carrier, every commercial

flight• http://www.transtats.bts.gov

Data-set: Airline & Weather46

• Hourly land-based weather observations from NOAA (National Oceanographic and Atmospheric Assoc.)

• > 2,000 weather stations• http://www.ncdc.noaa.gov/orders/

qclcd/

Airline Weather

Page 47: Building a Scalable Data Science Platform with R on HDInsight

Connection: Thin client → RStudio Server+ Glimple of down-sampled data (19 mil rows)

Page 48: Building a Scalable Data Science Platform with R on HDInsight

Data prepClean and Join using SparkR in R Server

48

• SparkR: R package - a light-weight front-end for Apache Spark from R• Provides distributed operations like selection, filtering, aggregation using SparkSQL• Distributed machine learning using Apache Spark’s MLlib (limited)

Page 49: Building a Scalable Data Science Platform with R on HDInsight

ModelingTrain, score, and evaluate using ScaleR functions

49

Page 50: Building a Scalable Data Science Platform with R on HDInsight

Modeling scalability with ScaleR on Spark HDInsight Scales linearly to hundreds of nodes, billions of rows and terabytes of data

50

0

1,000

,000,0

00

2,000

,000,0

00

3,000

,000,0

00

4,000

,000,0

00

5,000

,000,0

00

6,000

,000,0

00

7,000

,000,0

00

8,000

,000,0

00

9,000

,000,0

00

10,00

0,000

,000

11,00

0,000

,000

12,00

0,000

,000

13,00

0,000

,000

0200400600800

10001200140016001800

Logistic Regression on NYC Taxi Dataset

Billions of rows

Elap

sed

Tim

e

HDInsight (Premium) Spark cluster100 D12 (4 core, 28 GB) worker nodes

2.2 TB

Mario Inchiosa

Page 51: Building a Scalable Data Science Platform with R on HDInsight

Comparison of ScaleR with open source algorithms (Preliminary)

51

Configuration:• HDI cluster size: 7

nodes• 1 Edge Node: 8 cores,

28GB- 4 Worker Nodes: 8

cores, 28GB• Dataset: Duplicated

Airlines data (.csv)• Number of columns: 26

1 2 3 4 5 6 7 8 9

Logistic Regression (E2E - reading from csv files)

Series1Series2Series3Series4

Number of rows (million)

Elap

sed

time

Katherine Zhao

Page 52: Building a Scalable Data Science Platform with R on HDInsight

Azure ML - Deploying web services for predictive analytics

52

Easily build ML models Easily deploy models as web-services

Page 53: Building a Scalable Data Science Platform with R on HDInsight

DeploymentPublish Web Service from R Server in AzureML

53

azureml-settings.json{"workspace": {"id": “<>", "authorization_token": “<>", "api_endpoint": "https://studioapi.azureml.net",

"management_endpoint":

https://management.azureml.net }}

Page 54: Building a Scalable Data Science Platform with R on HDInsight

A prediction web service in AzureML54

Page 55: Building a Scalable Data Science Platform with R on HDInsight

Adopting process and code - Resources

Page 56: Building a Scalable Data Science Platform with R on HDInsight

Tutorials - Scalable data analytics using R Server

• KDD Conference tutorial 2016• http://www.tinyurl.com/KDD2016R

• Public GitHub repository

56

Page 59: Building a Scalable Data Science Platform with R on HDInsight

Summary & acknowledgements

Page 60: Building a Scalable Data Science Platform with R on HDInsight

• R Server on Azure HDInsight (Premium) – a managed distributed compute platform for data science

• Scalable end to end processes can be built on HDI clusters integrated with other Azure services

• Published resources (w/ code) available for developing analytical work-flows

Summary60

Page 61: Building a Scalable Data Science Platform with R on HDInsight

• Mario Inchiosa [Principal Software Engineer]• Katherine Zhao [Data Scientist II]• Jeremy Reynolds [Senior Data Scientist Lead]• Max Kaznadi [Data Scientist II]• Hang Zhang [Senior Data Scientist Manager]

Acknowledgements61

Page 62: Building a Scalable Data Science Platform with R on HDInsight

Thank you!Debraj [email protected]

Page 63: Building a Scalable Data Science Platform with R on HDInsight

© Copyright Microsoft Corporation. All rights reserved.

Page 64: Building a Scalable Data Science Platform with R on HDInsight

Backups

Page 65: Building a Scalable Data Science Platform with R on HDInsight

R Open Microsoft R Server

R+CR

AN

DistributedR

ScaleR

ConnectR

DeployRRTVS

R Server architecture

ConnectR• High-speed & direct

connectorsAvailable for:• High-performance XDF• SAS, SPSS, delimited &

fixed format text data files• Hadoop HDFS (text & XDF)• Teradata Database• EDWs and ADWs• ODBC

ScaleR• Ready-to-Use high-performance

big data big analytics • Fully-parallelized analytics• Data prep & data distillation• Descriptive statistics & statistical tests• Range of predictive functions • User tools for distributing customized R

algorithms across nodes

DistributedR• Distributed computing

framework• Delivers cross-platform

portability

R+CRAN• Open source R interpreter• Freely-available huge range of R

algorithms• Algorithms callable by Microsoft R• Embeddable in R scripts• 100% Compatible with existing R

scripts, functions and packages

Microsoft R Open• Based on open source R• High-performance math

library to speed up linear algebra functions• Checkpoint package to easily share R code and replicate results using specific R package versions

DeployR• RESTful APIs for easy

integration from Java, JavaScript, .NET • Enterprise

authentication & security

R Tools for Visual Studio• State of the art, R Tools for Visual

Studio IDE

Page 66: Building a Scalable Data Science Platform with R on HDInsight

ModelingTrain, Score, and Evaluate using R Server

66

Page 67: Building a Scalable Data Science Platform with R on HDInsight

DeploymentPublish Web Service from R

rpartModel <- as.rpart(dTreeModel)scoringFn <- function(newdata){ library(rpart) predict(rpartModel, newdata=newdata)}

67

azureml-settings.json{"workspace":

{"id": “<>", "authorization_token": “<>", "api_endpoint":

"https://studioapi.azureml.net", "management_endpoint":

https://management.azureml.net}

}