Context Approach Evaluation Memory Partitionning Conclusion
I/O optimizations with Data Aggregation
Florin Isaila‡∗, Emmanuel Jeannot†, Preeti Malakar∗, François Tessier∗,Venkatram Vishwanath∗,
∗Argonne National Laboratory, USA†Inria Bordeaux Sud-Ouest, France‡University Carlos III, Spain
October 8, 2018
1 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Data Movement at Scale
I JLESC project started early 2015I Focus from flops to bytes:
allocate datamove datastore databring data to the right place at the right time
I Computer simulation: climate simulation, heart or brain modelling,cosmology, etc
2 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Data Movement at Scale
I Large needs in terms of I/O: high resolution, high fidelity
Table: Example of large simulations I/O coming from diverse papers
Scientific domain Simulation Data sizeCosmology Q Continuum 2 PB / simulationHigh-Energy Physics Higgs Boson 10 PB / yearClimate / Weather Hurricane 240 TB / simulation
I Growth of supercomputers to meet the performance needs but withincreasing gaps
0.0001
0.001
0.01
0.1
1997 2001 2005 2009 2013 2017
Rati
o o
f I/O
(TB
/s)
to F
lop
s (T
F/s)
in p
erc
ent
Years
Figure: Ratio IOPS/FLOPS of the #1 Top 500 for the past 20 years. Computing capability hasgrown at a faster rate than the I/O performance of supercomputers.
3 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Complex Architectures
I Complex network topologies tending to reduce the distance between thedata and the storage
Multidimensional tori, dragonfly, ...I Partitioning of the architecture to avoid I/O interference
IBM BG/Q with I/O nodes (Figure), Cray with LNET nodesI New tiers of storage/memory for data staging
MCDRAM in KNL, NVRAM, Burst buffer nodes
Compute nodes I/O nodes
Storage
QD
R In
finib
and
switc
h
Bridge nodes
5D Torus network2 GBps per link 2 GBps per link 4 GBps per link
PowerPC A2, 16 cores 16 GB of DDR3
GPFS filesystem
IO forwarding daemon GPFS client
Pset128 nodes
2 per I/O node
Mira- 49,152 nodes / 786,432 cores- 768 TB of memory- 27 PB of storage, 330 GB/s (GPFS)- 5D Torus network- Peak performance: 10 PetaFLOPS
4 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Two-phase I/O
I Present in MPI I/O implementations like ROMIOI Optimize collective I/O performance by reducing network contention and
increasing I/O bandwidthI Chose a subset of processes to aggregate data before writing it to the
storage system
Limitations:I Better for large messages
(from experiments)I No real efficient aggregator
placement policyI Informations about
upcoming data movementcould help
X Y Z X Y Z X Y Z X Y Z
Processes
Data
AggregatorsX X X X Y Y
Y Y Z Z Z Z FileX X X X Y Y
Y Y Z Z Z Z
I/O Phase
Aggr. phase
3210
0 2
Figure: Two-phase I/O mechanism
5 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Outline
1 Context
2 Approach
3 Evaluation
4 Number of Aggregators
5 Conclusion and Perspectives
6 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Approach
I Relevant aggregator placement while taking into account:The topology of the architectureThe data pattern
I Efficient implementation of the two-phase I/O schemeI/O scheduling with the help of information about the upcomming readingsand writingsPipelining aggergation and I/O phase to optimize data movementsOne-sided communications and non-blocking operations to reducesynchronizations
I TAPIOCA (Topology-Aware Parallel I/O: Collective Algorithm)MPI Based library for 2-phase I/Oportable through abstracted representation of the machinegeneralize toward data movement at scale on HPC facilityresults published in Cluster 2017
7 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Aggregator Placement
I Goal: find a compromise betweenaggregation and I/O costs
I Four tested strategiesShortest path: smallest distance to theI/O nodeLongest path: longest distance to theI/O nodeGreedy: lowest rank in partition (can becompared to a MPICH strategy)Topology-aware
How to take the topology into account inaggregators mapping?
Compute node
Bridge node
Storage system
Aggregator
L
S
S
L
Shortest path
Longest path
Greedy
Topology-Aware
Aggregation partition
T
G
G
T
Figure: Data aggregation for I/O:simple partitioning and aggregatorelection on a grid.
8 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Aggregator Placement - Topology-aware strategy
I ω(u, v): Amount of data exchanged betweennodes u and v
I d(u, v): Number of hops from nodes u to v
I l : The interconnect latencyI Bi→j : The bandwidth from node i to node j
I C1 = max(l × d(i ,A) + ω(i,A)
Bi→A
), i ∈ VC
I C2 = l × d(A, IO) + ω(A,IO)BA→IO
Vc : Compute nodesIO : I/O nodeA : Aggregator
C1
C2
Objective function:
TopoAware(A) = min (C1 + C2)
I Computed by each process independently in O(n), n = |VC |9 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Micro-benchmark - Placement strategies
I Evaluation on Mira (BG/Q), 512 nodes, 16 ranks/nodeI Each rank sends an amount of data distributed randomly between 0 and
2 MBI Write to /dev/null of the I/O node (performance of just aggregation and
I/O phases)I Aggregation settings: 16 aggregators, 16 MB buffer size
Table: Impact of aggregators placement strategy
Strategy I/O Bandwidth (MBps) Aggr. Time/round (ms)Topology-Aware 2638.40 310.46Shortest path 2484.39 327.08Longest path 2202.91 370.40
Greedy 1927.45 421.33
10 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Micro-benchmark - Theta
I 10 PetaFLOPS Cray XC40 supercomputerI 512 Theta-nodes, 16 ranks per nodeI 48 aggregators, 8 MB aggregation buffer size
0
1
2
3
4
5
6
7
8
0 0.5 1 1.5 2 2.5 3 3.5 4
Bandw
idth
(G
Bps)
Data size per rank (MB)
TAPIOCAMPI I/O
Compute node
Storage
Aries routerKnights Landing proc. 4 per router
Lustre filesystem
2D all-to-all structure 96 routers per group
36 tiles (2 cores, L2)16 GB MCDRAM192 GB DDR4128 GB SSD
Intel KNL 7250Dragonfly network
Elec. links 14 GBps
6 (le
vel 2
)
16 (level 1)
Dragonfly networkOpt. links 12.5 GBps
Compute node
2-cabinet group 9 groups - 18 cabinets 16 x 6 routers hosted All-to-all
(leve
l 3)
IB FDR 56 GBps
Service node LNET, gateway, … Irregular mapping
11 / 17
Context Approach Evaluation Memory Partitionning Conclusion
HACC-IO – MIRA (BG/Q)
I I/O part of a large-scale cosmological application simulating the massevolution of the universe with particle-mesh techniques
I Each process manage particles defined by 9 variables (XX , YY , ZZ , VX ,VY , VZ , phi , pid and mask)
X Y Z X Y Z X Y Z X Y Z
Processes
Data
X Y Z X Y Z X Y Z X Y Z
Data layoutsin file
X X X X Y Y Y Y Z Z Z Z
Array of structures
Structure of arrays
0 1 2 3
Figure: Data layouts implemented in HACC
12 / 17
Context Approach Evaluation Memory Partitionning Conclusion
HACC-IO – MIRA (BG/Q)
I I/O part of a large-scale cosmological application simulating the massevolution of the universe with particle-mesh techniques
I Each process manage particles defined by 9 variables (XX , YY , ZZ , VX ,VY , VZ , phi , pid and mask)
0
20
40
60
80
100
0 0.5 1 1.5 2 2.5 3 3.5 4
Bandw
idth
(G
Bps)
Data size per rank (MB)
TAPIOCA AoSMPI I/O AoS
TAPIOCA SoAMPI I/O SoA
I BG/Q (Mira) one file per PsetI 4096 nodes (16 ranks/node)I TAPIOCA: 16 aggregators per
Pset, 16 MB aggregator buffer size
Very poor performance fromMPI I/O on AoSUp to 2.5 faster than MPI I/Oon SoA
13 / 17
Context Approach Evaluation Memory Partitionning Conclusion
HACC-IO – Theta (Cray XC40)
I I/O part of a large-scale cosmological application simulating the massevolution of the universe with particle-mesh techniques
I Each process manage particles defined by 9 variables (XX , YY , ZZ , VX ,VY , VZ , phi , pid and mask)
0
2
4
6
8
10
12
14
16
18
0 0.5 1 1.5 2 2.5 3 3.5 4
Bandw
idth
(G
Bps)
Data size per rank (MB)
TAPIOCA AoSMPI I/O AoS
TAPIOCA SoAMPI I/O SoA
I Theta from 2048 nodes (16ranks/node)
I Lustre: 48 OSTs, 16 MB stripesize
I TAPIOCA: 384 aggregators, 16MB aggregator buffer size
AoS, 3.6 MB: 4 time faster thanMPI I/O
14 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Number of aggregators
X Y Z X Y Z X Y Z X Y Z
Processes
Data
X Y Z X Y Z X Y Z X Y Z
Data layoutsin file
X X X X Y Y Y Y Z Z Z Z
Array of structures
Structure of arrays
0 1 2 3
Figure: Array of structure vs structure of array
Trade-Off:I Many aggregators: lot of I/O roundsI Few aggregators: I/O Latency dominates
15 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Model Simulation
1
10
100
1000
1 2 4 8 16 32 64 128 256 512 1024 2048 4096NB_Agg
Rea
d_tim
e
Layout
AoS
SoA
AoS and SoA aggregation time vs #aggregators
I 4096 nodesI 9*1MB structuresI 16 ranks/nodesI 56 I/O streams
16 / 17
Context Approach Evaluation Memory Partitionning Conclusion
Conclusion and Perspectives
ConclusionI I/O library based on the two-phase scheme developed to optimize data
movementsTopology-aware aggregator placementOptimized buffering (two pipelined buffers, one-sided communications,block size awareness)
I Very good performance at scale, outperforming standard approachesI On the I/O part of a cosmological application, up to 15× improvementI Start modeling I/O performance.
17 / 17