herodotos herodotou, harold lim, fei dong, shivnath babu duke university

Starfish: A Self-tuning System for Big Data Analytics

Herodotos Herodotou,

Harold Lim, Fei Dong, Shivnath Babu

Duke University

Analysis in the Big Data Era

9/26/2011

Massive Data

DataAnalysi

Insight

Key to Success = Timely and Cost-Effective Analysis

Starfish

Hadoop MapReduce EcosystemPopular solution to Big Data Analytics

9/26/2011

MapReduce Execution Engine

Distributed File System

Hadoop

Java / C++ / R / Python

OozieHivePigElastic

MapReduceJaql

Starfish

Practitioners of Big Data AnalyticsWho are the users?

Data analysts, statisticians, computational scientists…Researchers, developers, testers…You!

Who performs setup and tuning?The users!Usually lack expertise to tune the system

9/26/2011 Starfish

Tuning ChallengesHeavy use of programming languages for

MapReduce programs (e.g., Java/python)

Data loaded/accessed as opaque files

Large space of tuning choices

Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking)

Terabyte-scale data cycles

9/26/2011 Starfish

Our goal: Provide good performance automatically

Starfish: Self-tuning System

9/26/2011

MapReduce Execution Engine

Distributed File System

Hadoop

Java / C++ / R / Python

OozieHivePigElastic

MapReduceJaql

Starfish

Analytics System

Starfish

What are the Tuning Problems?

9/26/2011

Job-level MapReduce

configuration

Workload management

Datalayout tuning

Cluster sizing

Workflow optimization

Starfish

Starfish’s Core Approach to Tuning

9/26/2011

1) if Δ(conf. parameters) then what …?

2) if Δ(data properties) then what …?

3) if Δ(cluster properties) then what …?

Profiler

Collects concisesummaries of

execution

What-if Engine

Estimates impact of hypothetical

changes on execution

Optimizers

Search through space of tuning choices

WorkflowWorkload

Data layout

Cluster

Starfish

Starfish Architecture

9/26/2011 9

Profiler What-if Engine

Workflow Optimizer

Workload Optimizer Elastisizer

Job Optimizer

Data ManagerMetadata

Mgr.Intermediate

Data Mgr.Data Layout & Storage Mgr.

Starfish

MapReduce Job Execution

9/26/2011

split 0 map out 0reduce

Two Map Waves One Reduce Wave

split 2 map

split 1 map split 3 map Out 1reduce

job j = < program p, data d, resources r, configuration c >

Starfish

What Controls MR Job Execution?

Space of configuration choices:Number of map tasksNumber of reduce tasksPartitioning of map outputs to reduce tasksMemory allocation to task-level buffersMultiphase external sorting in the tasksWhether output data from tasks should be compressedWhether combine function should be used

9/26/2011

job j = < program p, data d, resources r, configuration c >

Starfish

Effect of Configuration Settings

Use defaults or set manually (rules-of-thumb)Rules-of-thumb may not suffice

9/26/2011

Two-dimensional projection of a multi-dimensional surface(Word Co-occurrence MapReduce Program)

Rules-of-thumb settings

Starfish

MapReduce Job Tuning in a NutshellGoal:

Challenges: p is an arbitrary MapReduce program; c is high-dimensional; …

9/26/2011

),,,(minarg crdpFcSc

),,,( crdpFperf

Profiler

What-if Engine

Optimizer

Runs p to collect a job profile (concise execution summary) of <p,d1,r1,c1>

Given profile of <p,d1,r1,c1>, estimates virtual profile for <p,d2,r2,c2>

Enumerates and searches through the optimization space S efficiently

Starfish

Job ProfileConcise representation of program execution as a jobRecords information at the level of “task phases”Generated by Profiler through measurement or by the

What-if Engine through estimation

9/26/2011

Memory Buffer

Sort,[Combine],[Compress]

Serialize,Partitionmap

SpillCollectMapRead

Starfish

Job Profile FieldsDataflow: amount of data flowing through task phasesMap output bytes

Number of spills

Number of records in buffer per spill

9/26/2011

Costs: execution times at the level of task phasesRead phase time in the map task

Map phase time in the map task

Spill phase time in the map task

Dataflow Statistics: statistical information about dataflowWidth of input key-value pairs

Map selectivity in terms of records

Map output compression ratio

Cost Statistics: statistical information about resource costsI/O cost for reading from local disk per byte

CPU cost for executing the Mapper per record

CPU cost for uncompressing the input per byte

Starfish

Generating Profiles by MeasurementGoals

Have zero overhead when profiling is turned offRequire no modifications to HadoopSupport unmodified MapReduce programs written in

Java or Hadoop Streaming/Pipes (Python/Ruby/C++)

Approach: Dynamic (on-demand) instrumentationEvent-condition-action rules are specified (in Java)Leads to run-time instrumentation of Hadoop internalsMonitors task phases of MapReduce job executionWe currently use Btrace (Hadoop internals are in Java)

9/26/2011 Starfish

Generating Profiles by Measurement

9/26/2011

split 0 map out 0reduce

split 1 map

raw data

map profile

reduce profile

job profile

Use of Sampling• Profile fewer tasks• Execute fewer tasks

JVM = Java Virtual Machine, ECA = Event-Condition-Action

JVM JVM

Enable Profiling

ECA rules

Starfish

What-if Engine

Job Oracle

Virtual Job Profile for <p, d2, r2, c2>

What-if Engine

9/26/2011

Task Scheduler Simulator

JobProfile

<p, d1, r1, c1>

Properties of Hypothetical job

Input DataProperties

ClusterResources

ConfigurationSettings

Possibly Hypothetical

Starfish

Virtual Profile Estimation

9/26/2011

Given profile for job j = <p, d1, r1, c1> estimate profile for job j' = <p, d2, r2, c2>

(Virtual) Profile for j'

DataflowStatistics

Dataflow

CostStatistics

Profile for jInput

Data d2

Confi-guration

Resourcesr2

White-box Models

CostStatisticsRelative

Black-boxModels

Dataflow

White-box Models

DataflowStatistics

CardinalityModels

Starfish

Job Optimizer

9/26/2011

Best Configuration Settings <copt> for <p, d2, r2>

Subspace Enumeration

Recursive Random Search

Just-in-Time Optimizer

JobProfile

<p, d1, r1, c1>

Input DataProperties

ClusterResources

What-ifcalls

Starfish

Workflow Optimization Space

9/26/2011

Job-level Configuration

Dataset-level Configuration

Physical

Optimization Space

Logical

Join Selection

Partition Function Selection

Vertical Packing

Inter-job Inter-job

Starfish

Optimizations on TF-IDF Workflow

9/26/2011

LogicalOptimization

…D0 <{D},{W}>

…<{D, W},{f}>

…<{D},{W, f, c}>

J3, J4

…<{W},{D, t}>

Partition:{D}Sort: {D,W}

M1R1M2R2

…D0 <{D},{W}>

J1, J2

M3R3M4

…<{D},{W, f, c}>

J3, J4

…<{W},{D, t}>

PhysicalOptimization

Reducers= 50Compress = offMemory = 400…

Reducers= 20Compress = onMemory = 300…

LegendD = docname f = frequencyW = word c = countt = TF-IDF

M3R3M4

Starfish

New ChallengesWhat-if challenges:

Support concurrent job execution

Estimate intermediate data properties

Optimization challengesInteractions across jobsExtended optimization spaceFind good configuration

settings for individual jobs

9/26/2011

Workflow

Starfish

Cluster Sizing ProblemUse-cases for cluster sizing

Tuning the cluster size for elastic workloadsWorkload transitioning from development cluster to

production clusterMulti-objective cluster provisioning

GoalDetermine cluster resources & job-level configuration

parameters to meet workload requirements

9/26/2011 Starfish

Multi-objective Cluster Provisioning

9/26/2011

m1.small m1.large m1.xlarge c1.medium c1.xlarge0

200400600800

1,0001,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge0.002.004.006.008.00

EC2 Instance Type

)Cloud enables users to provision clusters in minutes

Starfish

Experimental Evaluation

9/26/2011 26

Starfish (versions 0.1, 0.2) to manage Hadoop on EC2Different scenarios: Cluster × Workload × Data

EC2 Node Type

CPU: EC2 units

Mem I/O Perf. Cost /hour

#Maps /node

#Reds/node

MaxMem /task

m1.small 1 (1 x 1) 1.7 GB moderate $0.085 2 1 300 MB

m1.large 4 (2 x 2) 7.5 GB high $0.34 3 2 1024 MB

m1.xlarge 8 (4 x 2) 15 GB high $0.68 4 4 1536 MB

c1.medium 5 (2 x 2.5) 1.7 GB moderate $0.17 2 2 300 MB

c1.xlarge 20 (8 x 2.5) 7 GB high $0.68 8 6 400 MB

cc1.4xlarge 33.5 (8) 23 GB very high $1.60 8 6 1536 MB

Starfish

Experimental Evaluation

9/26/2011 27

Starfish (versions 0.1, 0.2) to manage Hadoop on EC2Different scenarios: Cluster × Workload × Data

Abbr. MapReduce Program Domain Dataset

CO Word Co-occurrence Natural Lang Proc. Wikipedia (10GB – 22GB)

WC WordCount Text Analytics Wikipedia (30GB – 1TB)

TS TeraSort Business Analytics TeraGen (30GB – 1TB)

LG LinkGraph Graph Processing Wikipedia (compressed ~6x)

JO Join Business Analytics TPC-H (30GB – 1TB)

TF Term Freq. - Inverse Document Freq.

Information Retrieval Wikipedia (30GB – 1TB)

Starfish

Job Optimizer Evaluation

9/26/2011

Hadoop cluster: 30 nodes, m1.xlargeData sizes: 60-180 GB

TS WC LG JO TF CO0

Default Set-tings

Rule-based Optimizer

Cost-based Optimizer

MapReduce Programs

Starfish

Estimates from the What-if Engine

9/26/2011

Hadoop cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB Wikipedia

True surface Estimated surface

Starfish

Profiling Overhead Vs. Benefit

9/26/2011

1 5 10 20 40 60 80 1000

Percent of Tasks Profiled

1 5 10 20 40 60 80 1000.0

Percent of Tasks Profiled

Hadoop cluster: 16 nodes, c1.mediumMapReduce Program: Word Co-occurrenceData set: 10 GB Wikipedia

Starfish

Multi-objective Cluster Provisioning

9/26/2011

m1.small m1.large m1.xlarge c1.medium c1.xlarge0

200400600800

1,0001,200

ActualPredicted

m1.small m1.large m1.xlarge c1.medium c1.xlarge0.002.004.006.008.00

ActualPredicted

EC2 Instance Type for Target Cluster

Instance Type for Source Cluster: m1.large

Starfish

More info: www.cs.duke.edu/starfish

9/26/2011

Job-level MapReduce

configuration

Workflow optimization

Workload management

Datalayout tuning

Cluster sizing

Starfish

herodotos herodotou, harold lim, fei dong, shivnath babu duke university

mapreduce job tuning

map raw data map profile

byte starfish slide

map split

output data

thumb settings starfish

map task map phase time

starfish architecture

Documents

presented by carl erhard & zahid mian authors: herodotos...

cps 216: advanced database systems shivnath babu fall 2006

1. riot: i/o-efficient numerical computing in yi zhang...

cps216: advanced database systems notes 09:query...

cps216: advanced database systems notes 07:query execution...

1 cps216: advanced database systems notes 04: operators for...

starfish: a self-tuning system for big data...

cps 216: advanced database systems shivnath babu

adaptive processing in data stream systems shivnath babu...

xplus: a sql-tuning-aware query...

1 cps216: data-intensive computing systems failure recovery...

experiment-driven system management shivnath babu duke...

cps216: data-intensive computing systems data access from...

proﬁling, what-if analysis, and cost-based optimization of...

proﬁling, what-if analysis, and cost-based ling, what-if...

starfish: a self-tuning system for big data analytics ·...

herodotos halikarnesseus

data engineering how mapreduce works shivnath babu

herodotos (sayce ) 1-3. with notes, introductions, and...

starfish: a self-tuning system for big data analytics ·...