lessons from post-processing climate data on modern flash … · 2020. 1. 7. · adnan haider 1,...

30
Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1 , Sheri Mickelson 2 , John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research, USA

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Lessons from Post-processing Climate Data on Modern Flash-based HPC

Systems

Adnan Haider1, Sheri Mickelson2, John Dennis2

1Illinois Institute of Technology, USA; 2National Center of Atmospheric Research, USA

Page 2: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Post-processing Climate Data Software

• Post-processing software analyze climate data • Important goal of post-processing software:

• Allow scientists to do more science in less time • Two post-processing software

• PyAverager: Computes averages • PyReshaper: Converts input to different file layout

2

Page 3: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

I/O Workload Characteristics

3

• I/O bound! • Varying I/O workloads • Can flash based HPC systems reduce execution time?

0.06250.251416

0%20%40%60%80%

100%

Ice Land Atmosphere AtmosphereS.E

Ocean

Aver

age

I/O

Req

uest

Size

(M

B)

Perc

enta

ge o

f Run

time

spen

t doi

ng I/

O

Dataset

% I/O Time Average I/O Request Size

PyReshaper

Page 4: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

What is Flash?

• Faster hardware which accelerates I/O

• Flash in HPC systems

4

Page 5: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

What is Flash?

• Faster hardware which accelerates I/O

• Flash in HPC systems

5

Page 6: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Two Flash System Architectures

• Flash Devices • Gordon: SSD • Wrangler: DSSD

Difference in Flash Device

Flash Based IO Node

Mem

SSD

Compute Nodes

Parallel File System/Object Store

Local Flash Design Gordon

Pooled Flash Design Wrangler

SSDSSDSSD

` ` ` ` ` ` ` `Compute Nodes

PCI Express InterfaceRDMA via Infiniband

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

6

Page 7: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Two Flash System Architectures

• Flash Devices • Gordon: SSD • Wrangler: DSSD

• Storage Architecture • Gordon: Local • Wrangler: Pooled

Difference in Storage

Architecture

Flash Based IO Node

Mem

SSD

Compute Nodes

Parallel File System/Object Store

Local Flash Design Gordon

Pooled Flash Design Wrangler

SSDSSDSSD

` ` ` ` ` ` ` `Compute Nodes

PCI Express InterfaceRDMA via Infiniband

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

7

Page 8: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Two Flash System Architectures

• Flash Devices • Gordon: SSD • Wrangler: DSSD

• Storage Architecture • Gordon: Local • Wrangler: Pooled

• Interconnect • Gordon: Infiniband • Wrangler: PCI Express

Difference in Interconnect

Flash Based IO Node

Mem

SSD

Compute Nodes

Parallel File System/Object Store

Local Flash Design Gordon

Pooled Flash Design Wrangler

SSDSSDSSD

` ` ` ` ` ` ` `Compute Nodes

PCI Express InterfaceRDMA via Infiniband

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

8

Page 9: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Two Flash System Architectures

• Flash Devices • Gordon: SSD • Wrangler: DSSD

• Storage Architecture • Gordon: Local • Wrangler: Pooled

• Interconnect • Gordon: Infiniband • Wrangler: PCI Express

• Yellowstone has disks

Flash Based IO Node

Mem

SSD

Compute Nodes

Parallel File System/Object Store

Local Flash Design Gordon

Pooled Flash Design Wrangler

SSDSSDSSD

` ` ` ` ` ` ` `Compute Nodes

PCI Express InterfaceRDMA via Infiniband

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

DSSD Rack

9

Page 10: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ocean(large)

10

Gordon

Wrangler

Seconds 0 200 400 600 800 1000 1200 1400

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD*

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

Single SSD runs out of capacity!

Page 11: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ocean(large)

11

Gordon

Wrangler

Seconds 0 200 400 600 800 1000 1200 1400

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD*

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

Reading from SSD increases runtime by 75%

Page 12: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ocean(large)

12

Gordon

Wrangler

Seconds 0 200 400 600 800 1000 1200 1400

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD*

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

3.6x reduction in execution time compared to Yellowstone

Page 13: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ice(small)

13

Gordon

Wrangler

Seconds 0 20 40 60 80 100

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

SSDs decrease runtime by 47 %

Page 14: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ice(small)

14

Gordon

Wrangler

Seconds 0 20 40 60 80 100

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

Hybrid I/O decreases runtime by 6x

Page 15: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

PyReshaper on 1 Compute Node – Ice(small)

15

Gordon

Wrangler

Seconds 0 20 40 60 80 100

Read & Write HDDRead & Write DSSD

Read DSSD Write HDDRead HDD Write DSSD

Read & Write HDDRead & Write SSD

Read SSD Write HDDRead HDD Write SSD

Yellowstone (Read & Write HDD)

Metadata Time Read Time Write Time

11x reduction in execution time compared to Yellowstone

Page 16: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Lesson 1 • Incorrect matching between storage architecture and

I/O workload can hide the benefits of flash devices by increasing runtime by 4x.

16

Flash Based IO Node

Mem

SSD

Compute Nodes

Local Flash Design: Gordon

SSDSSDSSD

` ` ` `

RDMA via Infiniband

Single SSD & Interconnect

Page 17: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Lesson 2 • Local flash architecture is

more common • Number of flash devices per

compute node should increase

17

0

100

200

300

400

1/16 2/8 4/4 8/2 16/1

Seco

nds

# of Compute Nodes (SSDS) / # of Processes per Node

Ice LandAtmosphere Atmosphere S.E.

- Optimal number of SSDs

Performance on Gordon with 16 Processes

Page 18: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Lesson 3 • Hybrid I/O (reading and writing to difference device types) decreases

flash storage consumption by half while decreasing runtime by 6x.

18

Page 19: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Conclusion • Pooled architecture performs better than local architecture but if the

local architecture alleviates bottlenecks it can be a more feasible solution.

19

Page 20: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Conclusion • Pooled architecture performs better than local architecture but if the

local architecture alleviates bottlenecks it can be a more feasible solution.

• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time

20

Page 21: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Conclusion • Pooled architecture performs better than local architecture but if the

local architecture alleviates bottlenecks it can be a more feasible solution.

• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time

• Moving from Yellowstone’s HDD to Wrangler’s flash provided 11x reduction in execution time.

21

Page 22: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Conclusion • Pooled architecture performs better than local architecture but if the

local architecture alleviates bottlenecks it can be a more feasible solution.

• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time

• Moving from Yellowstone’s HDD to Wrangler’s flash provided up to a 11x reduction in execution time.

• With data amount surmounting, consideration must be placed on a cost-effective I/O architecture.

22

Page 23: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Acknowledgements • Sheri Mickelson and John Dennis • Kevin Paul & the ASAP group

23

Page 24: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Flash Based Systems in Future

2012 2015 2019

Gordon Wrangler

Comet

Aurora

24

Trinity

Page 25: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Evolution of Flash Systems

25

2012 text

2013 text

2012text

2015text

2015 text

2016text

2016

2019text

Wrangler

Cori

Catalyst

Gordon-vsmp

• Local Flash Architecture• Flash Devices (SSD) on remote

nodes• Pooled Flash • Aggregates 16 flash devices at job

config• Local Flash • 800 GB of flash on compute node

via PCI Ex.• Pooled Flash• DSSD devices as flash• All-to-all connection

• Local Flash• 320 GB of flash on each compute

node• Burst Buffer• 750 TB of flash and 750 GB/s

bandwidth

• Burst Buffer

• Burst Buffer• Xeon processor based burst buffer

nodes Aurora

Comet

Gordon-std

Trinity

Page 26: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Gordon Performance Analysis

26

2

4

816

3264

02468

1012

1/42/8

4/168/32

16/64

I/O

Req

uest

Size

(KB)

Thro

ughp

ut o

f SSD

/ Th

roug

hput

of H

DD

# of Processes / Amount of Data Written (GB)

0-2 2-4 4-6 6-8 8-10 10-12

• Scalability • Workload

Page 27: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Gordon Performance Analysis

27

2

4

816

3264

02468

1012

1/42/8

4/168/32

16/64

I/O

Req

uest

Size

(KB)

Thro

ughp

ut o

f SSD

/ Th

roug

hput

of H

DD

# of Processes / Amount of Data Written (GB)

0-2 2-4 4-6 6-8 8-10 10-12

• Scalability • Workload

Page 28: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Wrangler Performance Analysis

28

2

4

8

16

32

64

0

10

20

30

40

50

60

70

80

90

100

1/42/8

4/168/32

16/64

I/O

Req

uest

Size

(KB)

Thro

ughp

ut o

f SSD

/ Th

roug

hput

of H

DD

# of Processes / Amount of Data Written (GB)

0-10 10-20 20-30 30-4040-50 50-60 60-70 70-80

• Consistent

Page 29: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

Performance Comparison

29

012345

Ice Land ATM ATM S.E.

Spee

dup

Prov

ided

by

Flas

h

Dataset

Gordon Wrangler

Page 30: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National

30

0

1

2

3

4

Atmosphere Atmosphere S.E. Ocean

Spee

dup

Gordon Best Time over Wrangler HDD TimeWrangler HDD Time over Wrangler Flash Time