automating big data benchmarking and performance analysis...

17
www.bsc.es Automating Big Data Benchmarking and Performance Analysis with ALOJA David Carrera, Senior Researcher www.bsc.es November 2015

Upload: others

Post on 20-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

wwwbsces

Automating Big Data Benchmarking

and Performance Analysis with ALOJA

David Carrera Senior Researcher

wwwbsces

November 2015

Hadoop

ndash gt 100+ tunable parameters

ndash obscure and interrelated

bull mapredmapreducetasksspeculativeexecution

bull iosortmb 100 (300)

bull iosortrecordpercent 5 (15)

bull iosortspillpercent 80 (95 ndash 100)

ndash Similar for Hive Spark HBase

Dominated by rules-of-thumb

ndash Number of containers in parallel

bull 05 - 2 per CPU core

Large stack for tuning

Setting up your Big Data system

Image source Intelreg Distribution for Apache Hadoop

Default values in Apache source not ideal

Large and spread eco systemndash Different distributions

ndash Product claims

Each job is differentndash No one-fits-all solution

Cloud vs On-premisendash IaaS

bull Tens of different VMs to choosendash PaaS

bull HDInsight CloudBigData EMR

New economic HWndash SSDs InfiniBand Networking

How do I set my system too many options

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular

benchmarks

Eco-system is not transparent

ndash Needs auditing

Product claims on performance and TCO

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 2: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Hadoop

ndash gt 100+ tunable parameters

ndash obscure and interrelated

bull mapredmapreducetasksspeculativeexecution

bull iosortmb 100 (300)

bull iosortrecordpercent 5 (15)

bull iosortspillpercent 80 (95 ndash 100)

ndash Similar for Hive Spark HBase

Dominated by rules-of-thumb

ndash Number of containers in parallel

bull 05 - 2 per CPU core

Large stack for tuning

Setting up your Big Data system

Image source Intelreg Distribution for Apache Hadoop

Default values in Apache source not ideal

Large and spread eco systemndash Different distributions

ndash Product claims

Each job is differentndash No one-fits-all solution

Cloud vs On-premisendash IaaS

bull Tens of different VMs to choosendash PaaS

bull HDInsight CloudBigData EMR

New economic HWndash SSDs InfiniBand Networking

How do I set my system too many options

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular

benchmarks

Eco-system is not transparent

ndash Needs auditing

Product claims on performance and TCO

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 3: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Default values in Apache source not ideal

Large and spread eco systemndash Different distributions

ndash Product claims

Each job is differentndash No one-fits-all solution

Cloud vs On-premisendash IaaS

bull Tens of different VMs to choosendash PaaS

bull HDInsight CloudBigData EMR

New economic HWndash SSDs InfiniBand Networking

How do I set my system too many options

Terasort

K-means

Wordcount

Sample mappers and reducer for 3 popular

benchmarks

Eco-system is not transparent

ndash Needs auditing

Product claims on performance and TCO

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 4: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Eco-system is not transparent

ndash Needs auditing

Product claims on performance and TCO

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 5: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

BSCrsquos project ALOJA towards cost-effective Big Data

Benchmarking and Analysis tools

Online repository and largest Big Data repositoryndash 50000+ runs of HiBench TPC-H and [some] BigBench

ndash Over 100 HW configurations testedbull Of different NodeVM disks and networks

bull Cloud Multi-cloud provider including both IaaS and PaaS

bull On-premise High-end HPC commodity low-power

Community ndash Collaborations with industry and Academia

ndash Presented in different conferences and workshops

ndash Visibility 47 different countries

httpalojabsces

Big Data Benchmarking

Online Repository

Web

Analytics

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 6: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Commands and providers

Provisioning commands

Connect

ndash Node and Cluster

ndash Builds SSH cmd line

bull SSH proxies

Deploy

ndash Creates a cluster

ndash Sets SSH credentials

ndash If created updates config as needed

ndash If stopped starts nodes

Start Stop

Delete

Queue jobs to clusters

Providers

On-premise

ndash Custom settings for clusters

bull Multiple disk types

bull Different architectures

Cloud IaaS

ndash Azure OpenStack Rackspace AWS

(testing)

Cloud PaaS

ndash HDInsight CloudBigData EMR soon

Code at httpsgithubcomAlojaalojatreemasteraloja-deploy

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 7: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Cluster and nodes definitions multi-provider abstraction

Steps to define a cluster

Import defaults (if any)ndash Sets OS version

Select providerndash Azure RackSpace AWS On-

premise vagranthellip

Name the cluster and size

Optionalndash Select VM type

ndash Attached disks

ndash Define metadata

ndash And costs

Nodes can also be definedndash For Web share folders etc

You can logically split clusters

Azure 8-datanode sample

load AZURE defaults

source $CONF_DIRcluster_defaultsconf

clusterName=azure-large-8

numberOfNodes=8

vmSize=Large

attachedVolumes=3

diskSize=1024 in GB

details

vmCores=4

vmRAM=7 in GB

costs

clusterCostHour=1584 in USD

clusterType=IaaS

Source sample httpsgithubcomAlojaalojablobmastershellconfcluster_al-08conf

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 8: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

8

Entry point for explore the results collected from the executions

ndash Index of executions

bull Quick glance of executions

bull Searchable Sortablendash Execution details

bull Performance charts and histograms

bull Hadoop counters

bull Jobs and task details

Data management of benchmark executionsndash Data importing from different clusters

ndash Execution validation

ndash Data management and backup

Cluster definitions ndash Cluster capabilities (resources)

ndash Cluster costs

Sharing resultsndash Download executions

ndash Add external executions

Documentation and Referencesndash Papers links and feature documentation

2) ALOJA-WEB Online Repository

Available at httpalojabsces

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 9: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Comparing 3 runs on same cluster different configs

Mappers and reducers 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

400s 2 containers Local disk

800s 3 containers Local disk

600s 2 containers Remote disk

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 10: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Comparing 3 runs on same cluster different configs

CPU utilization 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

Moderate iowait

Higher iowait

Very high iowait

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 11: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Comparing 3 runs on same cluster different configs

CPU queues 48-node cluster

URL httpalojabscesperfchartsexecs5B5D=90086ampexecs5B5D=90088ampexecs5B5D=90104

1 blocked process

4 blocked processes

4 blocked processes (map

phase)

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 12: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Impact of HW configurations in Speedup

Disks and Network Cloud remote volumes

Local

only

1

Remote

2

Remotes

3 Remotes

3

Remotes

tmp local

2

Remotes

tmp local

1 Remotes

tmp local

HDD-

ETH

HDD-IB

SSD-

ETH

SDD-IB

Speedup (higher is better)

Results using httphadoopbscesconfigimprovement

Details httpsrawgithubusercontentcomAlojaalojamasterpublicationsBSC-MSR_ALOJApdf

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 13: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Clusters by cost-effectiveness

URL httpalojabscesclustercosteffectiveness

bull Cluster ID reference

bull RL-06 = 8 performance1-8 VMs

bull RL-16 = 8 general1-8 VMs

bull RL-19 = 8 io1-15 VMs

bull RL-33 = 8 performance2-30 VMs

bull RL-30 = 8 io1-30 VMs

Performance2-30

Io1-30

Io1-15

General1-8

Performance1-8

Io1-30

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 14: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

This shows a sample of a new screen (with sample data) to find the most cost-effective cluster sizendash X axis number of datanodes (cluster size

ndash Left Y Execution time (lower is better)

ndash Right Y Execution cost

CostPerformance Scalability of cluster size

Execution time Execution cost

Recommended size

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 15: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Modeling Hadoop ndash Methodology

Methodologyndash 3-step learning process

ndash Different split sizes tested (10 le training le 50)

ndash Different learning algorithms Regression trees Nearest-neighbors methods LinearMultinomial regressions Neural networks Deep Learning

Learning resultsndash Mean Absolute Errors ~250s (ranges in [100s 6000s])

ndash Relative Absolute Errors between [010 025]

bull Depend on benchmark and of examples per benchmark

bull Some executions aremay be anomalies

15

ALOJA

Data-Set

Training

Validation

Testing

ModelSelect this

modelFinal

Model

Train

Test the model

Test the model

Tune algorithm re-train

NO

YES

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 16: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

Knowledge Discovery

Make analyzing results easier

ndash Multi-variable visualization

ndash Trees separating relevant attributes

ndash Other interesting tools

16

Tree Descriptor

Disk=HDD

Net=ETH

IOFBuf=128KB rArr 2935s

IOFBuf=64KB rArr 2942s

Net=IB

IOFBuf=128KB rArr 3118s

IOFBuf=64KB rArr 3125s

Disk=SSD

Net=ETH

IOFBuf=128KB rArr 1248s

IOFBuf=64KB rArr 1256s

Net=IB

IOFBuf=128KB rArr 1233s

IOFBuf=64KB rArr 124s1

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces

Page 17: Automating Big Data Benchmarking and Performance Analysis ...cdn.bdigital.org/PDF/BigDataCongress15/Barcelona... · BSC’s project ALOJA: towards cost-effective Big Data Benchmarking

wwwbsces

Thank you

For further information please contact

davidcarrerabsces

wwwbsces