sawmill - integrating r and large data clouds

SawmillSome Lessons Learned Running R

in Large Data Clouds

Robert GrossmanOpen Data Group

What Do You Do if Your Data is to Big for a Database?

• Give up and invoke sampling.• Buy a proprietary system and ask for a raise.• Begin to build a custom system and explain

why it is not yet done.• Use Hadoop.• Use an alternative large data cloud (e.g.

Sector)

Basic Idea

1. Turn it into a pleasantly parallel problem.2. Use a large data cloud to manage and

prepare the data.3. Use a Map/Bucket function to split the job.4. Run R on each piece using Reduce/UDF or

streams.5. Use PMML multiple models to glue the

pieces together.

Why Listen?

• This approach allows you to scale R relatively easily to hundreds of TB to PB.

• The approach is easy.• (A plus: it may look hard to your colleagues,

boss or clients.)• There is at least an order of magnitude of

performance to be gained with the right design.

Part 1. Stacks for Big Data

5

The Google Data Stack

• The Google File System (2003)• MapReduce: Simplified Data Processing… (2004)• BigTable: A Distributed Storage System… (2006)

6

Map-Reduce Example

• Input is file with one document per record• User specifies map function

– key = document URL– Value = terms that document contains

(“doc cdickens”, “it was the best of times”)

“it”, 1“was”, 1“the”, 1“best”, 1map

Example (cont’d)• MapReduce library gathers together all pairs

with the same key value (shuffle/sort phase)• The user-defined reduce function combines all

the values associated with the same key

key = “it”values = 1, 1

key = “was”values = 1, 1key = “best”values = 1key = “worst”values = 1

“it”, 2“was”, 2“best”, 1“worst”, 1reduce

Applying MapReduce to the Data in Storage Cloud

9

map shuffle/reduce

Google’s Large Data Cloud

Storage Services

Data Services

Compute Services

10

Google’s Stack

Applications

Google File System (GFS)

Google’s MapReduce

Google’s BigTable

Hadoop’s Large Data Cloud

Storage Services

Compute Services

11

Hadoop’s Stack

Applications

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL Databases

Amazon Style Data Cloud

S3 Storage Services

Simple Queue Service

12

Load Balancer

EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances

EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances

SDB

Sector’s Large Data Cloud

Storage Services

Compute Services

13

Sector’s Stack

Applications

Sector’s Distributed File System (SDFS)

Sphere’s UDFs

Routing & Transport Services

UDP-based Data Transport Protocol (UDT)

Data Services

Apply User Defined Functions (UDF) to Files in Storage Cloud

14

map shuffle /reduce

UDFUDF

Folklore

• MapReduce is great.• But sometimes it is easier to use UDFs or

other parallel programming frameworks for large data clouds.

• And often it is easier to use Hadoop streams, Sector streams, etc.

Sphere UDF vs MapReduceMapReduce Sphere

Storage Disk data Disk data & in-memory objects

Processing Map followed by Reduce

Arbitrary user defined functions

Data exchanging Reducers pull results from mappers

UDF’s push results to bucket files

Input data locality Input data is assigned to nearest mapper

Input data is assigned to nearest UDF

Output data locality NA Can be specified

Terasort Benchmark1 Rack 2 Racks 3 Racks 4 Racks

Nodes 32 64 96 128

Cores 128 256 384 512

Hadoop 85m 49s 37m 0s 25m 14s 17m 45s

Sector 28m 25s 15m 20s 10m 19s 7m 56s

Speed up 3.0 2.4 2.4 2.2

Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

MalStone

time18

dk-2 dk-1 dk

sites entities

MalStoneMalStone A MalStone B

Hadoop 455m 13s 840m 50s

Hadoop streaming with Python

87m 29s 142m 32s

Sector/Sphere 33m 40s 43m 44s

Speed up (Sector v Hadoop)

13.5x 19.2x

Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Part 2Predictive Model Markup Language

Problems Deploying Models• Models are deployed in proprietary formats• Models are application dependent• Models are system dependent• Models are architecture dependant• Time required to deploy models and to

integrate models with other applications can be long.

Predictive Model Markup Language (PMML)

• Based on XML • Benefits of PMML

– Open standard for Data Mining & Statistical Models – Not concerned with the process of creating a model– Provides independence from application, platform,

and operating system– Simplifies use of data mining models by other

applications (consumers of data mining models)

PMML Models• polynomial

regression• logistic regression• general regression• center based

clusters• density based

clusters

• trees• associations• neural nets• naïve Bayes • sequences• text models• support vector

machines• ruleset

PMML Producer & Consumers

25

Model ProducerData

Data Pre-processing

data

PMMLModel

Model Consumer

scores

Post Processing

actions

1 1

2

2

PMMLModel

3 3

Modeling Environment

Deployment Environment

1

rules

Part 3Sawmill

Step 1: Preprocess data using MapReduce or UDF

Step 2: Invoke R on each segment/bucket and build PMML model

models

Step 3: Gather the models together to form a multiple model PMML file


Step 2: Build separate model in each segment using R


Step 2: Score data in each segment using R

Sawmill Summary• Use Hadoop MapReduce or Sector UDFs to

preprocess the data• Use Hadoop Map or Sector buckets to segment

the data to gain parallelism• Build separate statistical model for each segment

using R & Hadoop / Sector Streams• Use multiple models specification in PMML

version 4.0 to specify segmentation• Example: use Hadoop Map function to send all

data for each web site to different segment (on different processor)

Small Example: Scoring Engine written in R

• R processed a typical segment in 20 minutes

• Using R to score 2 segments concatenated together = 60 minutes

• Using R to score 3 segments concatenated together = 140 minutes

With Sawmill Framework• 1 month of data, about 50 GB, hundreds of

segments

• 300 mapper keys / segments

• Mapping and Reducing < 2 minutes

• Scoring: 20 minutes * max of segments per reducer

• Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.

• Often ran in under 2 hours.

Reducer R Process?• There are at least three ways to tie the

MapReduce process to the R process.- MACHINE: One instance of the R process

on each data node (or n per node)- REDUCER: One instance of the R process

bound to each reducer- SEGMENT: Instances can be launched by

the reducers as necessary (when keys are reduced)

Tradeoffs• You need to have a general idea of

- how long the records for a key take to be reduced.- how long the application takes to process the

segment- how many keys are seen per reducer

• In order to prevent bottlenecks

Thank You!

www.opendatagroup.com

sawmill - integrating r and large data clouds

Technology

alternative large data

simplified data processing

use hadoop

distributed storage

proprietary system

custom system

models models

hadoop streams