lessons from post-processing climate data on modern flash … · 2020. 1. 7. · adnan haider 1,...
TRANSCRIPT
Lessons from Post-processing Climate Data on Modern Flash-based HPC
Systems
Adnan Haider1, Sheri Mickelson2, John Dennis2
1Illinois Institute of Technology, USA; 2National Center of Atmospheric Research, USA
Post-processing Climate Data Software
• Post-processing software analyze climate data • Important goal of post-processing software:
• Allow scientists to do more science in less time • Two post-processing software
• PyAverager: Computes averages • PyReshaper: Converts input to different file layout
2
I/O Workload Characteristics
3
• I/O bound! • Varying I/O workloads • Can flash based HPC systems reduce execution time?
0.06250.251416
0%20%40%60%80%
100%
Ice Land Atmosphere AtmosphereS.E
Ocean
Aver
age
I/O
Req
uest
Size
(M
B)
Perc
enta
ge o
f Run
time
spen
t doi
ng I/
O
Dataset
% I/O Time Average I/O Request Size
PyReshaper
What is Flash?
• Faster hardware which accelerates I/O
• Flash in HPC systems
4
What is Flash?
• Faster hardware which accelerates I/O
• Flash in HPC systems
5
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
Difference in Flash Device
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
6
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
Difference in Storage
Architecture
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
7
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
• Interconnect • Gordon: Infiniband • Wrangler: PCI Express
Difference in Interconnect
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
8
Two Flash System Architectures
• Flash Devices • Gordon: SSD • Wrangler: DSSD
• Storage Architecture • Gordon: Local • Wrangler: Pooled
• Interconnect • Gordon: Infiniband • Wrangler: PCI Express
• Yellowstone has disks
Flash Based IO Node
Mem
SSD
Compute Nodes
Parallel File System/Object Store
Local Flash Design Gordon
Pooled Flash Design Wrangler
SSDSSDSSD
` ` ` ` ` ` ` `Compute Nodes
PCI Express InterfaceRDMA via Infiniband
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
DSSD Rack
9
PyReshaper on 1 Compute Node – Ocean(large)
10
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Single SSD runs out of capacity!
PyReshaper on 1 Compute Node – Ocean(large)
11
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Reading from SSD increases runtime by 75%
PyReshaper on 1 Compute Node – Ocean(large)
12
Gordon
Wrangler
Seconds 0 200 400 600 800 1000 1200 1400
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD*
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
3.6x reduction in execution time compared to Yellowstone
PyReshaper on 1 Compute Node – Ice(small)
13
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
SSDs decrease runtime by 47 %
PyReshaper on 1 Compute Node – Ice(small)
14
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
Hybrid I/O decreases runtime by 6x
PyReshaper on 1 Compute Node – Ice(small)
15
Gordon
Wrangler
Seconds 0 20 40 60 80 100
Read & Write HDDRead & Write DSSD
Read DSSD Write HDDRead HDD Write DSSD
Read & Write HDDRead & Write SSD
Read SSD Write HDDRead HDD Write SSD
Yellowstone (Read & Write HDD)
Metadata Time Read Time Write Time
11x reduction in execution time compared to Yellowstone
Lesson 1 • Incorrect matching between storage architecture and
I/O workload can hide the benefits of flash devices by increasing runtime by 4x.
16
Flash Based IO Node
Mem
SSD
Compute Nodes
Local Flash Design: Gordon
SSDSSDSSD
` ` ` `
RDMA via Infiniband
Single SSD & Interconnect
Lesson 2 • Local flash architecture is
more common • Number of flash devices per
compute node should increase
17
0
100
200
300
400
1/16 2/8 4/4 8/2 16/1
Seco
nds
# of Compute Nodes (SSDS) / # of Processes per Node
Ice LandAtmosphere Atmosphere S.E.
- Optimal number of SSDs
Performance on Gordon with 16 Processes
Lesson 3 • Hybrid I/O (reading and writing to difference device types) decreases
flash storage consumption by half while decreasing runtime by 6x.
18
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
19
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
20
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
• Moving from Yellowstone’s HDD to Wrangler’s flash provided 11x reduction in execution time.
21
Conclusion • Pooled architecture performs better than local architecture but if the
local architecture alleviates bottlenecks it can be a more feasible solution.
• Moving from Yellowstone’s HDD to Wrangler’s HDD provided up to 3.6x reduction in execution time
• Moving from Yellowstone’s HDD to Wrangler’s flash provided up to a 11x reduction in execution time.
• With data amount surmounting, consideration must be placed on a cost-effective I/O architecture.
22
Acknowledgements • Sheri Mickelson and John Dennis • Kevin Paul & the ASAP group
23
Flash Based Systems in Future
2012 2015 2019
Gordon Wrangler
Comet
Aurora
24
Trinity
Evolution of Flash Systems
25
2012 text
2013 text
2012text
2015text
2015 text
2016text
2016
2019text
Wrangler
Cori
Catalyst
Gordon-vsmp
• Local Flash Architecture• Flash Devices (SSD) on remote
nodes• Pooled Flash • Aggregates 16 flash devices at job
config• Local Flash • 800 GB of flash on compute node
via PCI Ex.• Pooled Flash• DSSD devices as flash• All-to-all connection
• Local Flash• 320 GB of flash on each compute
node• Burst Buffer• 750 TB of flash and 750 GB/s
bandwidth
• Burst Buffer
• Burst Buffer• Xeon processor based burst buffer
nodes Aurora
Comet
Gordon-std
Trinity
Gordon Performance Analysis
26
2
4
816
3264
02468
1012
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-2 2-4 4-6 6-8 8-10 10-12
• Scalability • Workload
Gordon Performance Analysis
27
2
4
816
3264
02468
1012
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-2 2-4 4-6 6-8 8-10 10-12
• Scalability • Workload
Wrangler Performance Analysis
28
2
4
8
16
32
64
0
10
20
30
40
50
60
70
80
90
100
1/42/8
4/168/32
16/64
I/O
Req
uest
Size
(KB)
Thro
ughp
ut o
f SSD
/ Th
roug
hput
of H
DD
# of Processes / Amount of Data Written (GB)
0-10 10-20 20-30 30-4040-50 50-60 60-70 70-80
• Consistent
Performance Comparison
29
012345
Ice Land ATM ATM S.E.
Spee
dup
Prov
ided
by
Flas
h
Dataset
Gordon Wrangler
30
0
1
2
3
4
Atmosphere Atmosphere S.E. Ocean
Spee
dup
Gordon Best Time over Wrangler HDD TimeWrangler HDD Time over Wrangler Flash Time