sawmill - integrating r and large data clouds
TRANSCRIPT
SawmillSome Lessons Learned Running R
in Large Data Clouds
Robert GrossmanOpen Data Group
What Do You Do if Your Data is to Big for a Database?
• Give up and invoke sampling.• Buy a proprietary system and ask for a raise.• Begin to build a custom system and explain
why it is not yet done.• Use Hadoop.• Use an alternative large data cloud (e.g.
Sector)
Basic Idea
1. Turn it into a pleasantly parallel problem.2. Use a large data cloud to manage and
prepare the data.3. Use a Map/Bucket function to split the job.4. Run R on each piece using Reduce/UDF or
streams.5. Use PMML multiple models to glue the
pieces together.
Why Listen?
• This approach allows you to scale R relatively easily to hundreds of TB to PB.
• The approach is easy.• (A plus: it may look hard to your colleagues,
boss or clients.)• There is at least an order of magnitude of
performance to be gained with the right design.
Part 1. Stacks for Big Data
5
The Google Data Stack
• The Google File System (2003)• MapReduce: Simplified Data Processing… (2004)• BigTable: A Distributed Storage System… (2006)
6
Map-Reduce Example
• Input is file with one document per record• User specifies map function
– key = document URL– Value = terms that document contains
(“doc cdickens”, “it was the best of times”)
“it”, 1“was”, 1“the”, 1“best”, 1map
Example (cont’d)• MapReduce library gathers together all pairs
with the same key value (shuffle/sort phase)• The user-defined reduce function combines all
the values associated with the same key
key = “it”values = 1, 1
key = “was”values = 1, 1key = “best”values = 1key = “worst”values = 1
“it”, 2“was”, 2“best”, 1“worst”, 1reduce
Applying MapReduce to the Data in Storage Cloud
9
map shuffle/reduce
Google’s Large Data Cloud
Storage Services
Data Services
Compute Services
10
Google’s Stack
Applications
Google File System (GFS)
Google’s MapReduce
Google’s BigTable
Hadoop’s Large Data Cloud
Storage Services
Compute Services
11
Hadoop’s Stack
Applications
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services NoSQL Databases
Amazon Style Data Cloud
S3 Storage Services
Simple Queue Service
12
Load Balancer
EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances
EC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 InstanceEC2 Instances
SDB
Sector’s Large Data Cloud
Storage Services
Compute Services
13
Sector’s Stack
Applications
Sector’s Distributed File System (SDFS)
Sphere’s UDFs
Routing & Transport Services
UDP-based Data Transport Protocol (UDT)
Data Services
Apply User Defined Functions (UDF) to Files in Storage Cloud
14
map shuffle /reduce
UDFUDF
Folklore
• MapReduce is great.• But sometimes it is easier to use UDFs or
other parallel programming frameworks for large data clouds.
• And often it is easier to use Hadoop streams, Sector streams, etc.
Sphere UDF vs MapReduceMapReduce Sphere
Storage Disk data Disk data & in-memory objects
Processing Map followed by Reduce
Arbitrary user defined functions
Data exchanging Reducers pull results from mappers
UDF’s push results to bucket files
Input data locality Input data is assigned to nearest mapper
Input data is assigned to nearest UDF
Output data locality NA Can be specified
Terasort Benchmark1 Rack 2 Racks 3 Racks 4 Racks
Nodes 32 64 96 128
Cores 128 256 384 512
Hadoop 85m 49s 37m 0s 25m 14s 17m 45s
Sector 28m 25s 15m 20s 10m 19s 7m 56s
Speed up 3.0 2.4 2.4 2.2
Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone
time18
dk-2 dk-1 dk
sites entities
MalStoneMalStone A MalStone B
Hadoop 455m 13s 840m 50s
Hadoop streaming with Python
87m 29s 142m 32s
Sector/Sphere 33m 40s 43m 44s
Speed up (Sector v Hadoop)
13.5x 19.2x
Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
Part 2Predictive Model Markup Language
Problems Deploying Models• Models are deployed in proprietary formats• Models are application dependent• Models are system dependent• Models are architecture dependant• Time required to deploy models and to
integrate models with other applications can be long.
Predictive Model Markup Language (PMML)
• Based on XML • Benefits of PMML
– Open standard for Data Mining & Statistical Models – Not concerned with the process of creating a model– Provides independence from application, platform,
and operating system– Simplifies use of data mining models by other
applications (consumers of data mining models)
PMML Models• polynomial
regression• logistic regression• general regression• center based
clusters• density based
clusters
• trees• associations• neural nets• naïve Bayes • sequences• text models• support vector
machines• ruleset
PMML Producer & Consumers
25
Model ProducerData
Data Pre-processing
data
PMMLModel
Model Consumer
scores
Post Processing
actions
1 1
2
2
PMMLModel
3 3
Modeling Environment
Deployment Environment
1
rules
Part 3Sawmill
Step 1: Preprocess data using MapReduce or UDF
Step 2: Invoke R on each segment/bucket and build PMML model
models
Step 3: Gather the models together to form a multiple model PMML file
Step 1: Preprocess data using MapReduce or UDF
Step 2: Build separate model in each segment using R
Step 1: Preprocess data using MapReduce or UDF
Step 2: Score data in each segment using R
Sawmill Summary• Use Hadoop MapReduce or Sector UDFs to
preprocess the data• Use Hadoop Map or Sector buckets to segment
the data to gain parallelism• Build separate statistical model for each segment
using R & Hadoop / Sector Streams• Use multiple models specification in PMML
version 4.0 to specify segmentation• Example: use Hadoop Map function to send all
data for each web site to different segment (on different processor)
Small Example: Scoring Engine written in R
• R processed a typical segment in 20 minutes
• Using R to score 2 segments concatenated together = 60 minutes
• Using R to score 3 segments concatenated together = 140 minutes
With Sawmill Framework• 1 month of data, about 50 GB, hundreds of
segments
• 300 mapper keys / segments
• Mapping and Reducing < 2 minutes
• Scoring: 20 minutes * max of segments per reducer
• Had anywhere from 2 to 3 reducers per node and 2 to 8 segments per reducer.
• Often ran in under 2 hours.
Reducer R Process?• There are at least three ways to tie the
MapReduce process to the R process.- MACHINE: One instance of the R process
on each data node (or n per node)- REDUCER: One instance of the R process
bound to each reducer- SEGMENT: Instances can be launched by
the reducers as necessary (when keys are reduced)
Tradeoffs• You need to have a general idea of
- how long the records for a key take to be reduced.- how long the application takes to process the
segment- how many keys are seen per reducer
• In order to prevent bottlenecks
Thank You!
www.opendatagroup.com