automating big data benchmarking and performance analysis...

wwwbsces

Automating Big Data Benchmarking

and Performance Analysis with ALOJA

David Carrera Senior Researcher

wwwbsces

November 2015

Hadoop

ndash gt 100+ tunable parameters

ndash obscure and interrelated

bull mapredmapreducetasksspeculativeexecution

bull iosortmb 100 (300)

bull iosortrecordpercent 5 (15)

bull iosortspillpercent 80 (95 ndash 100)

ndash Similar for Hive Spark HBase

Dominated by rules-of-thumb

ndash Number of containers in parallel

bull 05 - 2 per CPU core

Large stack for tuning

Setting up your Big Data system

Image source Intelreg Distribution for Apache Hadoop

Default values in Apache source not ideal

Large and spread eco systemndash Different distributions

ndash Product claims

Each job is differentndash No one-fits-all solution

Cloud vs On-premisendash IaaS

bull Tens of different VMs to choosendash PaaS

bull HDInsight CloudBigData EMR

New economic HWndash SSDs InfiniBand Networking

How do I set my system too many options

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular

benchmarks

Eco-system is not transparent

ndash Needs auditing

Product claims on performance and TCO

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk


600s 2 containers Remote disk


CPU utilization 48-node cluster


Moderate iowait

Higher iowait

Very high iowait


CPU queues 48-node cluster


1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs



Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s


Net=IB



Disk=SSD

Net=ETH



Net=IB


IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Hadoop

ndash gt 100+ tunable parameters

ndash obscure and interrelated

bull mapredmapreducetasksspeculativeexecution

bull iosortmb 100 (300)

bull iosortrecordpercent 5 (15)

bull iosortspillpercent 80 (95 ndash 100)

ndash Similar for Hive Spark HBase

Dominated by rules-of-thumb

ndash Number of containers in parallel

bull 05 - 2 per CPU core

Large stack for tuning

Setting up your Big Data system

Image source Intelreg Distribution for Apache Hadoop










Terasort

K-means

Wordcount


benchmarks













httpalojabsces


Online Repository

Web

Analytics



Connect



bull SSH proxies

Deploy





Start Stop

Delete


Providers

On-premise




Cloud IaaS


(testing)

Cloud PaaS












ndash And costs




load AZURE defaults



numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs


clusterType=IaaS


8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces










Terasort

K-means

Wordcount


benchmarks













httpalojabsces


Online Repository

Web

Analytics



Connect



bull SSH proxies

Deploy





Start Stop

Delete


Providers

On-premise




Cloud IaaS


(testing)

Cloud PaaS












ndash And costs




load AZURE defaults



numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs


clusterType=IaaS


8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces













httpalojabsces


Online Repository

Web

Analytics



Connect



bull SSH proxies

Deploy





Start Stop

Delete


Providers

On-premise




Cloud IaaS


(testing)

Cloud PaaS












ndash And costs




load AZURE defaults



numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs


clusterType=IaaS


8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces



Connect



bull SSH proxies

Deploy





Start Stop

Delete


Providers

On-premise




Cloud IaaS


(testing)

Cloud PaaS












ndash And costs




load AZURE defaults



numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs


clusterType=IaaS


8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces










ndash And costs




load AZURE defaults



numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs


clusterType=IaaS


8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces

8












ndash Cluster costs















Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces










Moderate iowait

Higher iowait

Very high iowait




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces




1 blocked process

4 blocked processes


phase)



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces



Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB












Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces









Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces






Recommended size









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces









15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model


NO

YES

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces

Knowledge Discovery





16

Tree Descriptor

Disk=HDD

Net=ETH



Net=IB



Disk=SSD

Net=ETH



Net=IB



wwwbsces

Thank you


davidcarrerabsces

wwwbsces

wwwbsces

Thank you


davidcarrerabsces

wwwbsces

automating big data benchmarking and performance analysis...

Documents