the effect of system utilization on application performance … · 2019. 6. 26. ·...

20
The Effect of System Utilization on Application Performance Variability Boyang Li*, Sudheer Chunduri+ , Kevin Harms+ , Yuping Fan* , Zhiling Lan* Illinois Institute of Technology* Argonne National Laboratory+

Upload: others

Post on 28-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

The Effect of System Utilization on ApplicationPerformance Variability

Boyang Li*, Sudheer Chunduri+ , Kevin Harms+ , Yuping Fan* , Zhiling Lan*

Illinois Institute of Technology*Argonne National Laboratory+

Page 2: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Outline

Related Work

Motivation

1

Project Contributions

Summary

Page 3: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Dragonfly topology becomes popular

• High-radix• Low-diameter

Theta at Argonne

• 4,392 nodes• Peak performance of 11.69 petaflops• 2D-Dragonfly topology

Dragonfly topology

Motivation

2

Performance variability due to network sharing!

Page 4: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

• Job placement

• Routing policy

• Task mapping

q Communication interference due to network contention is a dominant cause of performance variability.

q Existing studies of exploiting job scheduling to mitigate communication interference:

Related Work

[1] Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and LaxmikantV Kale. 2014. Maximizing throughput on a dragonfly network SC14’

[2] Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study on dragonfly network. SC16’

[3]Xin Wang, Misbah Mubarak, Xu Yang, Robert B Ross, and Zhiling Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. IPDPS18’

3

Page 5: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Distinct from previous studies, we investigate how system utilization influences application runtime variability.

Overview

4

• Empirical analysis:

- Log analysis- Application experiments (over 4000 tests)

• New scheduling design:

-CEIL (Cut-off Extreme hIgh utiLization) design

Page 6: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Empirical Study - Log Analysis

5

• Records belong to the same application: all of the above Aprun log information is matched• Fifteen applications that have multiple executions are identified.• Top five applications with high repetition frequency for various job sizes are presented.

Table: Theta Aprun log field names and description

Table: Logs of Theta at ALCF

Page 7: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Application runtimes (Jan-March of 2018 on Theta) under different system utilization rates.

Empirical Study - Log Analysis

Positive correlation between high system utilization and application performance degradation (up to 21%)

Maximum runtime always occurred during high utilization periods.

6

Page 8: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Empirical Study - Application Experiments

Ø Four applications: MILC, Reordered MILC, Nek5000, NEKBONE

Ø Over 4000 application tests in total on different days and times

Ø Cobalt log => average system utilization during these application runs.

Table: Experiment description

7

Page 9: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Empirical Study - Application Experiments

8

Same observation as from log analysis!

Page 10: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Rethink HPC Scheduling Design

8

Q: Shall we solely target high system utilization on Dragonfly system for scheduling?

Page 11: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Illustrative Example

Scheduling for utilization vs for productivity

High system utilization does not necessarily mean high system productivity

9

• Nine 9-node jobs and nine 1-node jobs, each having a runtime estimate of 5 hours

• Assume each application’s runtime will be increased by 20% (thus becoming 6 hours) due to network sharing when system utilization is greater than a threshold (e.g., 95%).

Page 12: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

• Resource utilization exhibits a fluctuating pattern throughout a day.

• Not all the users are in a hurry for the job completion.

CEIL: Scheduling Design

Two assumptions:

10

Day 10.0

0.2

0.4

0.6

0.8

1.0

Sys

tem

utili

zatio

n

Actual system utilizationSystem utilization under CEIL95% utilization

Day 20.0

0.2

0.4

0.6

0.8

1.0

Actual system utilizationSystem utilization under CEIL95% utilization

Day 30.0

0.2

0.4

0.6

0.8

1.0

Actual system utilizationSystem utilization under CEIL95% utilization

Page 13: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

CEIL: Scheduling Design

CEIL (Cut-off Extreme hIgh utiLization) scheduling design:

Ø There is an additional Postpone Queue besides traditional Waiting Queue

Ø Only the jobs in the Waiting Queue can be scheduled for execution.

Ø One of the following conditions is satisfied, jobs move from Postpone Queue to Waiting Queue

• Empty Waiting Queue• Low utilization• Approaching user’s expected job completion time

Flowchart of CEIL design11

Page 14: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Scheduling Evaluation

Table: Workload traces from Theta at ALCF

Table: Workloads with various postponed rates

• Theta workload logs

• Synthetic logs

• Trace-based scheduling simulator: CQSimCQSim github link: https://github.com/SPEAR-IIT/CQSim

12

Page 15: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Evaluation Metrics

System centric metrics:

Ø Makespan (e.g., to evaluate scheduling throughput)-Total length of the schedule to complete all the jobs.

Ø Percentage of high utilization periods-Proportion of the time when the system utilization is higher than 95% in this study

User centric metrics:

Ø User wait time -Time period between a job’s expected end time and its actual end time.

Ø Job bounded slowdown-Ratio of job response time (user wait time plus job runtime) to the job runtime

13

Page 16: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

System Centric Results

Table: Comparison of system-level scheduling metrics

CEIL can significantly reduce the percentage of high utilization periods.

CEIL does not does not impact system throughput.

14

• We compare CEIL with WFP (original scheduling policy deployed on Theta).

• EASY Backfilling is used to mitigate resource fragmentation.

Page 17: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

User Centric Results

Comparison of CEIL and WFP

CEIL can effectively reduce average user wait time by 12.5%-35.3%.

Job bounded slowdown is reduced by 7.4%−20.2%.

15

Page 18: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Summary

In this work, our contributions are summarized as below:

• There is a strong correlation between application runtime and system utilization.

• We have investigated a scheduling strategy CEIL to proactively avoid job allocation under high system utilization.

This is a proof of concept study. Limitations:

• Selection of 95% as the high utilization is specific to the Theta workload.

• Not suitable for the systems which are always heavily utilized.

16

Page 19: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

Acknowledgement

17

Page 20: The Effect of System Utilization on Application Performance … · 2019. 6. 26. · PPT模板下载: p t.co m/mo b a n/ 行业PPT模板: pt.co m/h a ngye / 节日PPT模板:

P P T模板下载:w w w .1 p p t.co m /m o b a n/ 行业P P T模板:w w w .1 p pt.co m /h a ngye / 节日P P T模板:w w w .1 p p t.co m /jie ri/ P P T素材下载:w w w .1 p p t.co m /su ca i/P P T背景图片:w w w .1 p p t.co m /b e ijin g/ P P T图表下载:w w w .1 p p t.co m /tu b ia o /

优秀P P T下载:w w w .1 p p t.co m /xia za i/ P P T教程:w w w .1 p p t.co m /p o w erp o in t/ W o rd 教程:w w w .1 p p t.co m /w o rd / E xce l教程:w w w .1 p p t.co m /e xce l/ 资料下载:w w w .1 p p t.co m /zilia o / P P T课件下载:w w w .1 p p t.co m /k e jia n / 范文下载:w w w .1 p p t.co m /fa n w e n / 试卷下载:w w w .1 p p t.co m /sh iti/ 教案下载:w w w .1 p p t.co m /jia o a n / P P T论坛:w w w .1 p p t.cn

Thank you!

Questions