![Page 1: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/1.jpg)
Lessons from Post-processing Climate Data on Modern Flash-based HPC
Systems
Adnan Haider1, Sheri Mickelson2, John Dennis2
1Illinois Institute of Technology, USA; 2National Center of Atmospheric Research, USA
![Page 2: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/2.jpg)
Post-processing Climate Data Software
• Post-processing software analyze climate data • Important goal of post-processing software:
• Allow scientists to do more science in less time • Two post-processing software
• PyAverager: Computes averages • PyReshaper: Converts input to different file layout
2
![Page 3: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/3.jpg)
I/O Workload Characteristics
3
• I/O bound! • Varying I/O workloads • Can flash based HPC systems reduce execution time?
0.06250.251416
0%20%40%60%80%
100%
Ice Land Atmosphere AtmosphereS.E
Ocean
Aver
age
I/O
Req
uest
Size
(M
B)
Perc
enta
ge o
f Run
time
spen
t doi
ng I/
O
Dataset
% I/O Time Average I/O Request Size
PyReshaper
![Page 4: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/4.jpg)
What is Flash?
• Faster hardware which accelerates I/O
• Flash in HPC systems
4
![Page 5: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/5.jpg)
What is Flash?
• Faster hardware which accelerates I/O
• Flash in HPC systems
5
![Page 6: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/6.jpg)
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
Difference in Flash Device
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
6
![Page 7: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/7.jpg)
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
Difference in Storage
Architecture
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
7
![Page 8: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/8.jpg)
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
• Interconnect • Gordon: Infiniband • Wrangler: PCI Express
Difference in Interconnect
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
8
![Page 9: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/9.jpg)
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
• Interconnect • Gordon: Infiniband • Wrangler: PCI Express
• Yellowstone has disks
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
9
![Page 10: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/10.jpg)
PyReshaper on 1 Compute Node – Ocean(large)
10
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Single SSD runs out of capacity!
![Page 11: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/11.jpg)
PyReshaper on 1 Compute Node – Ocean(large)
11
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Reading from SSD increases runtime by 75%
![Page 12: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/12.jpg)
PyReshaper on 1 Compute Node – Ocean(large)
12
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
3.6x reduction in execution time compared to Yellowstone
![Page 13: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/13.jpg)
PyReshaper on 1 Compute Node – Ice(small)
13
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
SSDs decrease runtime by 47 %
![Page 14: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/14.jpg)
PyReshaper on 1 Compute Node – Ice(small)
14
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Hybrid I/O decreases runtime by 6x
![Page 15: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/15.jpg)
PyReshaper on 1 Compute Node – Ice(small)
15
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
11x reduction in execution time compared to Yellowstone
![Page 16: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/16.jpg)
Lesson 1 • Incorrect matching between storage architecture and
I/O workload can hide the benefits of flash devices by increasing runtime by 4x.
16
Flash Based IO Node
Mem
SSD
Compute Nodes
Local Flash Design: Gordon
SSDSSDSSD
` ` ` `
RDMA via Infiniband
Single SSD & Interconnect
![Page 17: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/17.jpg)
Lesson 2 • Local flash architecture is
more common • Number of flash devices per
compute node should increase
17
0
100
200
300
400
1/16 2/8 4/4 8/2 16/1
Seco
nds
# of Compute Nodes (SSDS) / # of Processes per Node
Ice LandAtmosphere Atmosphere S.E.
- Optimal number of SSDs
Performance on Gordon with 16 Processes
![Page 18: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/18.jpg)
Lesson 3 • Hybrid I/O (reading and writing to difference device types) decreases
flash storage consumption by half while decreasing runtime by 6x.
18
![Page 19: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/19.jpg)
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
19
![Page 20: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/20.jpg)
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
20
![Page 21: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/21.jpg)
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
• Moving from Yellowstone’s HDD to Wrangler’s flash provided 11x reduction in execution time.
21
![Page 22: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/22.jpg)
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
• Moving from Yellowstone’s HDD to Wrangler’s flash provided up to a 11x reduction in execution time.
• With data amount surmounting, consideration must be placed on a cost-effective I/O architecture.
22
![Page 23: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/23.jpg)
Acknowledgements • Sheri Mickelson and John Dennis • Kevin Paul & the ASAP group
23
![Page 24: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/24.jpg)
Flash Based Systems in Future
2012 2015 2019
Gordon Wrangler
Comet
Aurora
24
Trinity
![Page 25: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/25.jpg)
Evolution of Flash Systems
25
2012 text
2013 text
2012text
2015text
2015 text
2016text
2016
2019text
Wrangler
Cori
Catalyst
Gordon-vsmp
• Local Flash Architecture• Flash Devices (SSD) on remote
nodes• Pooled Flash • Aggregates 16 flash devices at job
config• Local Flash • 800 GB of flash on compute node
via PCI Ex.• Pooled Flash• DSSD devices as flash• All-to-all connection
• Local Flash• 320 GB of flash on each compute
node• Burst Buffer• 750 TB of flash and 750 GB/s
bandwidth
• Burst Buffer
• Burst Buffer• Xeon processor based burst buffer
nodes Aurora
Comet
Gordon-std
Trinity
![Page 26: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/26.jpg)
Gordon Performance Analysis
26
2
4
816
3264
02468
1012
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-2 2-4 4-6 6-8 8-10 10-12
• Scalability • Workload
![Page 27: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/27.jpg)
Gordon Performance Analysis
27
2
4
816
3264
02468
1012
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-2 2-4 4-6 6-8 8-10 10-12
• Scalability • Workload
![Page 28: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/28.jpg)
Wrangler Performance Analysis
28
2
4
8
16
32
64
0
10
20
30
40
50
60
70
80
90
100
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-10 10-20 20-30 30-4040-50 50-60 60-70 70-80
• Consistent
![Page 29: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/29.jpg)
Performance Comparison
29
012345
Ice Land ATM ATM S.E.
Spee
dup
Prov
ided
by
Flas
h
Dataset
Gordon Wrangler
![Page 30: Lessons from Post-processing Climate Data on Modern Flash … · 2020. 1. 7. · Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd380f996b0d2c52508daf/html5/thumbnails/30.jpg)
30
0
1
2
3
4
Atmosphere Atmosphere S.E. Ocean
Spee
dup
Gordon Best Time over Wrangler HDD TimeWrangler HDD Time over Wrangler Flash Time