sawmill - integrating r and large data clouds

33
Sawmill Some Lessons Learned Running R in Large Data Clouds Robert Grossman Open Data Group

Upload: robert-grossman

Post on 20-Aug-2015

2.481 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Sawmill - Integrating R and Large Data Clouds

SawmillSome Lessons Learned Running R

in Large Data Clouds

Robert GrossmanOpen Data Group

Page 2: Sawmill - Integrating R and Large Data Clouds

What Do You Do if Your Data is to Big for a Database?

• Give up and invoke sampling.• Buy a proprietary system and ask for a raise.• Begin to build a custom system and explain

why it is not yet done.• Use Hadoop.• Use an alternative large data cloud (e.g.

Sector)

Page 3: Sawmill - Integrating R and Large Data Clouds

Basic Idea

1. Turn it into a pleasantly parallel problem.2. Use a large data cloud to manage and

prepare the data.3. Use a Map/Bucket function to split the job.4. Run R on each piece using Reduce/UDF or

streams.5. Use PMML multiple models to glue the

pieces together.

Page 4: Sawmill - Integrating R and Large Data Clouds

Why Listen?

• This approach allows you to scale R relatively easily to hundreds of TB to PB.

• The approach is easy.• (A plus: it may look hard to your colleagues,

boss or clients.)• There is at least an order of magnitude of

performance to be gained with the right design.

Page 5: Sawmill - Integrating R and Large Data Clouds

Part 1. Stacks for Big Data

5

Page 6: Sawmill - Integrating R and Large Data Clouds

The Google Data Stack

• The Google File System (2003)• MapReduce: Simplified Data Processing… (2004)• BigTable: A Distributed Storage System… (2006)

6

Page 7: Sawmill - Integrating R and Large Data Clouds

Map-Reduce Example

• Input is file with one document per record• User specifies map function

– key = document URL– Value = terms that document contains

(“doc cdickens”, “it was the best of times”)

“it”, 1“was”, 1“the”, 1“best”, 1map

Page 8: Sawmill - Integrating R and Large Data Clouds

Example (cont’d)• MapReduce library gathers together all pairs

with the same key value (shuffle/sort phase)• The user-defined reduce function combines all

the values associated with the same key

key = “it”values = 1, 1

key = “was”values = 1, 1key = “best”values = 1key = “worst”values = 1

“it”, 2“was”, 2“best”, 1“worst”, 1reduce

Page 9: Sawmill - Integrating R and Large Data Clouds

Applying MapReduce to the Data in Storage Cloud

9

map shuffle/reduce

Page 10: Sawmill - Integrating R and Large Data Clouds

Google’s Large Data Cloud

Storage Services

Data Services

Compute Services

10

Google’s Stack

Applications

Google File System (GFS)

Google’s MapReduce

Google’s BigTable

Page 11: Sawmill - Integrating R and Large Data Clouds

Hadoop’s Large Data Cloud

Storage Services

Compute Services

11

Hadoop’s Stack

Applications

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL Databases

Page 12: Sawmill - Integrating R and Large Data Clouds

Amazon Style Data Cloud

S3 Storage Services

Simple Queue Service

12

Load Balancer

EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances

EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances

SDB

Page 13: Sawmill - Integrating R and Large Data Clouds

Sector’s Large Data Cloud

Storage Services

Compute Services

13

Sector’s Stack

Applications

Sector’s Distributed File System (SDFS)

Sphere’s UDFs

Routing & Transport Services

UDP-based Data Transport Protocol (UDT)

Data Services

Page 14: Sawmill - Integrating R and Large Data Clouds

Apply User Defined Functions (UDF) to Files in Storage Cloud

14

map shuffle /reduce

UDFUDF

Page 15: Sawmill - Integrating R and Large Data Clouds

Folklore

• MapReduce is great.• But sometimes it is easier to use UDFs or

other parallel programming frameworks for large data clouds.

• And often it is easier to use Hadoop streams, Sector streams, etc.

Page 16: Sawmill - Integrating R and Large Data Clouds

Sphere UDF vs MapReduceMapReduce Sphere

Storage Disk data Disk data & in-memory objects

Processing Map followed by Reduce

Arbitrary user defined functions

Data exchanging Reducers pull results from mappers

UDF’s push results to bucket files

Input data locality Input data is assigned to nearest mapper

Input data is assigned to nearest UDF

Output data locality NA Can be specified

Page 17: Sawmill - Integrating R and Large Data Clouds

Terasort Benchmark1 Rack 2 Racks 3 Racks 4 Racks

Nodes 32 64 96 128

Cores 128 256 384 512

Hadoop 85m 49s 37m 0s 25m 14s 17m 45s

Sector 28m 25s 15m 20s 10m 19s 7m 56s

Speed up 3.0 2.4 2.4 2.2

Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

Page 18: Sawmill - Integrating R and Large Data Clouds

MalStone

time18

dk-2 dk-1 dk

sites entities

Page 19: Sawmill - Integrating R and Large Data Clouds

MalStoneMalStone A MalStone B

Hadoop 455m 13s 840m 50s

Hadoop streaming with Python

87m 29s 142m 32s

Sector/Sphere 33m 40s 43m 44s

Speed up (Sector v Hadoop)

13.5x 19.2x

Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Page 20: Sawmill - Integrating R and Large Data Clouds

Part 2Predictive Model Markup Language

Page 21: Sawmill - Integrating R and Large Data Clouds

Problems Deploying Models• Models are deployed in proprietary formats• Models are application dependent• Models are system dependent• Models are architecture dependant• Time required to deploy models and to

integrate models with other applications can be long.

Page 22: Sawmill - Integrating R and Large Data Clouds

Predictive Model Markup Language (PMML)

• Based on XML • Benefits of PMML

– Open standard for Data Mining & Statistical Models – Not concerned with the process of creating a model– Provides independence from application, platform,

and operating system– Simplifies use of data mining models by other

applications (consumers of data mining models)

Page 23: Sawmill - Integrating R and Large Data Clouds

PMML Models• polynomial

regression• logistic regression• general regression• center based

clusters• density based

clusters

• trees• associations• neural nets• naïve Bayes • sequences• text models• support vector

machines• ruleset

Page 24: Sawmill - Integrating R and Large Data Clouds

PMML Producer & Consumers

25

Model ProducerData

Data Pre-processing

data

PMMLModel

Model Consumer

scores

Post Processing

actions

1 1

2

2

PMMLModel

3 3

Modeling Environment

Deployment Environment

1

rules

Page 25: Sawmill - Integrating R and Large Data Clouds

Part 3Sawmill

Page 26: Sawmill - Integrating R and Large Data Clouds

Step 1: Preprocess data using MapReduce or UDF

Step 2: Invoke R on each segment/bucket and build PMML model

models

Step 3: Gather the models together to form a multiple model PMML file

Page 27: Sawmill - Integrating R and Large Data Clouds

Step 1: Preprocess data using MapReduce or UDF

Step 2: Build separate model in each segment using R

Step 1: Preprocess data using MapReduce or UDF

Step 2: Score data in each segment using R

Page 28: Sawmill - Integrating R and Large Data Clouds

Sawmill Summary• Use Hadoop MapReduce or Sector UDFs to

preprocess the data• Use Hadoop Map or Sector buckets to segment

the data to gain parallelism• Build separate statistical model for each segment

using R & Hadoop / Sector Streams• Use multiple models specification in PMML

version 4.0 to specify segmentation• Example: use Hadoop Map function to send all

data for each web site to different segment (on different processor)

Page 29: Sawmill - Integrating R and Large Data Clouds

Small Example: Scoring Engine written in R

• R processed a typical segment in 20 minutes

• Using R to score 2 segments concatenated together = 60 minutes

• Using R to score 3 segments concatenated together = 140 minutes

Page 30: Sawmill - Integrating R and Large Data Clouds

With Sawmill Framework• 1 month of data, about 50 GB, hundreds of

segments

• 300 mapper keys / segments

• Mapping and Reducing < 2 minutes

• Scoring: 20 minutes * max of segments per reducer

• Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.

• Often ran in under 2 hours.

Page 31: Sawmill - Integrating R and Large Data Clouds

Reducer R Process?• There are at least three ways to tie the

MapReduce process to the R process.- MACHINE: One instance of the R process

on each data node (or n per node)- REDUCER: One instance of the R process

bound to each reducer- SEGMENT: Instances can be launched by

the reducers as necessary (when keys are reduced)

Page 32: Sawmill - Integrating R and Large Data Clouds

Tradeoffs• You need to have a general idea of

- how long the records for a key take to be reduced.- how long the application takes to process the

segment- how many keys are seen per reducer

• In order to prevent bottlenecks

Page 33: Sawmill - Integrating R and Large Data Clouds

Thank You!

www.opendatagroup.com