mvapich performance on arm at scale
TRANSCRIPT
MVAPICH Performance on Arm at Scale
Dhabaleswar K. (DK) PandaThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~panda
Arm HPC User Group Talk (SC ’19)
by
ARM-HUG (SC ‘19) 2Network Based Computing Laboratory
High-End Computing (HEC): PetaFlop to ExaFlop
Expected to have an ExaFlop system in 2020-2021!
100 PFlops in 2017
1 EFlops in 2020-2021?
143 PFlops in 2018
ARM-HUG (SC ‘19) 3Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
High Performance Interconnects -InfiniBand
<1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
K - ComputerSunway TaihuLightSummit Sierra
ARM-HUG (SC ‘19) 4Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications (HPC and DL)
Networking Technologies(InfiniBand, 40/100/200GigE,
Slingshot, and Omni-Path)
Multi-/Many-coreArchitectures
Accelerators(GPU and FPGA)
MiddlewareCo-Design
Opportunities and
Challenges across Various
Layers
PerformanceScalabilityResilience
Communication Library or Runtime for Programming ModelsPoint-to-point
CommunicationCollective
CommunicationEnergy-
AwarenessSynchronization
and LocksI/O and
File SystemsFault
Tolerance
ARM-HUG (SC ‘19) 5Network Based Computing Laboratory
• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)– Scalable job start-up– Low memory footprint
• Scalable Collective communication– Offload– Non-blocking– Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node
• Support for efficient multi-threading• Integrated Support for Accelerators (GPGPUs and FPGAs)• Fault-tolerance/resiliency• QoS support for communication and I/O• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,
MPI+UPC++, CAF, …)• Virtualization • Energy-Awareness
Designing (MPI+X) at Exascale
ARM-HUG (SC ‘19) 6Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Jun ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 15th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the TACC Frontera System
ARM-HUG (SC ‘19) 7Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000Se
p-04
Feb-
05Ju
l-05
Dec-
05M
ay-0
6O
ct-0
6M
ar-0
7Au
g-07
Jan-
08Ju
n-08
Nov
-08
Apr-
09Se
p-09
Feb-
10Ju
l-10
Dec-
10M
ay-1
1O
ct-1
1M
ar-1
2Au
g-12
Jan-
13Ju
n-13
Nov
-13
Apr-
14Se
p-14
Feb-
15Ju
l-15
Dec-
15M
ay-1
6O
ct-1
6M
ar-1
7Au
g-17
Jan-
18Ju
n-18
Nov
-18
Apr-
19
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
ARM-HUG (SC ‘19) 8Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization
Active Messages
Job StartupIntrospection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
Memory CMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
ARM-HUG (SC ‘19) 9Network Based Computing Laboratory
MVAPICH2 Software Family
Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
ARM-HUG (SC ‘19) 10Network Based Computing Laboratory
• Enhanced architecture and IB HCA detection for various ARM systems
• Optimization and tuning for– Intra-node and inter-node point-to-point operations
– Intra-node shared memory communication protocols
– Collective operations for different message sizes and job/system sizes using the existing collective algorithms in MVAPICH2-X
• Optimizations to job startup performance to achieve scalable job startup when running large-scale jobs on ARM systems
• Support for latest GCC and ARM compilers
Features and Improvement in MAVPIACH2-X for ARM
ARM-HUG (SC ‘19) 11Network Based Computing Laboratory
• EPCC Fulhame Cluster– Nodes: 16 x ARM ThunderX2
– Processor: 2x 32 core ARM ThunderX2
– Network: EDR 100Gbps MT4119
– Operating System: Linux 4.12.14-23-default
– MPI and Communication Libraries• MVAPICH2-X (latest)
• HPCX-v2.4.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse15.0-aarch64
• OpenMPI-4.0.2 w/ latest UCX
– OSU-Microbenchmarks-v5.6.2
Performance Evaluation of Optimized MVAPICH2-X• Mayer Cluster
– Nodes: 14 x ARM ThunderX2
– Processor: 2x 28 core ARM ThunderX2
– Network: EDR 100Gbps MT4119
– Operating System: Linux 4.14.0-115.13
– MPI and Communication Libraries• MVAPICH2-X (latest)
• OpenMPI 4.0.1
• UCX 1.5.2
– OSU-Microbenchmarks-v5.6.2
ARM-HUG (SC ‘19) 12Network Based Computing Laboratory
• EPCC Fulhame ARM cluster with up to 16 dual-socket 32-core ThunderX2 nodes
• Comparison among MVAPICH2X (Next), OpenMPI+UCX, and HPCX communication libraries
• OSU Micro-benchmark Suite (OMB) v5.6.2
• Measure the MPI-level communication performance of latency, bandwidth, bi-directional bandwidth, and message rate
• Three different configurations– Intra-socket
– Inter-socket
– Inter-node
Evaluation of Point-to-point on EPCC ARM System
ARM-HUG (SC ‘19) 13Network Based Computing Laboratory
0
0.1
0.2
0.3
0.4
0.5
0 2 8 32
Late
ncy
(us)
Message Size (Bytes)
Latency - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX 0
1
2
3
4
128 512 2048 8192
Late
ncy
(us)
Message Size (Bytes)
Latency - Medium MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX
0
100
200
300
400
500
600
32K 128K 512K 2M
Late
ncy
(us)
Message Size (Bytes)
Latency - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
100
200
300
400
500
600
1 4 16 64
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Small Messages
MVAPICH2-X-Next
HPCX
OpenMPI+UCX
0
2000
4000
6000
8000
10000
256 1K 4K 16K
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth – Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
3000
6000
9000
12000
15000
64K 256K 1M 4M
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
Point-to-point: Latency & Bandwidth (Intra-socket)
70% better
ARM-HUG (SC ‘19) 14Network Based Computing Laboratory
Point-to-point: Bi-Bandwidth (Intra-socket)
0
200
400
600
800
1 4 16 64
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bi-bandwidth - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
4000
8000
12000
16000
256 1K 4K 16K
Band
wid
th (
MB/
s)
Message Size (Bytes)
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
4000
8000
12000
16000
20000
64K 256K 1M 4M
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bi-bandwidth - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
37% better
Bi-bandwidth - Medium Messages
0123456
1 4 16 64Mill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
Message Rate - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
1
2
3
4
256 1K 4K 16K
Mill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
0.05
0.1
0.15
0.2
64K 256K 1M 4MMill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
Message Rate - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
Message Rate - Medium Messages
ARM-HUG (SC ‘19) 15Network Based Computing Laboratory
00.20.40.60.8
11.2
0 2 8 32
Late
ncy
(us)
Message Size (Bytes)
Latency - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX 0
2
4
6
8
128 512 2048 8192
Late
ncy
(us)
Message Size (Bytes)
Latency - Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
200
400
600
800
32K 128K 512K 2M
Late
ncy
(us)
Message Size (Bytes)
Latency - Large MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX
0
100
200
300
400
1 4 16 64
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Small MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX
0
2000
4000
6000
8000
10000
256 1K 4K 16K
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth – Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
4000
8000
12000
16000
64K 256K 1M 4M
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
Point-to-point: Latency & Bandwidth (Inter-socket)
42%better
76% better
ARM-HUG (SC ‘19) 16Network Based Computing Laboratory
Point-to-point: Bi-Bandwidth (Inter-socket)
0
100
200
300
400
1 4 16 64
Band
wid
th (
MB/
s) MVAPICH2-X-NextHPCXOpenMPI+UCX
0
3000
6000
9000
12000
256 1K 4K 16K
Band
wid
th (
MB/
s) MVAPICH2-X-NextHPCXOpenMPI+UCX
0
4000
8000
12000
16000
64K 256K 1M 4M
Band
wid
th (
MB/
s)
MVAPICH2-X-NextHPCXOpenMPI+UCX
24% better
Bi-bandwidth - Medium MessagesBi-bandwidth - Small Messages Bi-bandwidth - Large Messages
Message Size (Bytes)Message Size (Bytes)Message Size (Bytes)
0
1
2
3
4
1 4 16 64Mill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
Message Rate - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
0.04
0.08
0.12
0.16
64K 256K 1M 4MMill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
Message Rate - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
00.5
11.5
22.5
3
256 1K 4K 16K
Mill
ion
Mes
sage
/ S
ec
Message Size (Bytes)
Message Rate - Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
ARM-HUG (SC ‘19) 17Network Based Computing Laboratory
0
0.5
1
1.5
2
2.5
0 2 8 32
Late
ncy
(us)
Message Size (Bytes)
Latency - Small Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
8
16
24
32
128 512 2048 8192
Late
ncy
(us)
Message Size (Bytes)
Latency - Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0
100
200
300
400
32K 128K 512K 2M
Late
ncy
(us)
Message Size (Bytes)
Latency - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
0100200300400500600
1 4 16 64
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Small MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX
0
3000
6000
9000
12000
15000
64K 256K 1M 4M
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth - Large Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
Point-to-point: Latency & Bandwidth (Inter-Node)
02000400060008000
1000012000
256 1K 4K 16K
Band
wid
th (
MB/
s)
Message Size (Bytes)
Bandwidth – Medium Messages
MVAPICH2-X-NextHPCXOpenMPI+UCX
ARM-HUG (SC ‘19) 18Network Based Computing Laboratory
• Fulhame cluster with up to 16 dual-socket 32-core ThunderX2 nodes
• Comparison among MVAPICH2X (Next), OpenMPI+UCX, and HPCX communication libraries
• OSU Micro-benchmark Suite (OMB) 5.6.2
• Measure the MPI-level communication performance of collectives communication latency
• Evaluate single-socket (half-subscription) and dual-socket (full-subscription) scenarios on varying scale
Evaluation of Collectives Communication on EPCC ARM System
ARM-HUG (SC ‘19) 19Network Based Computing Laboratory
Collectives: Single Node (64-ppn)
1
10
100
1000
10000
100000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Allreduce – 64 ppn
MVAPICH2-X-Next
OpenMPI+UCX
1
10
100
1000
10000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Bcast – 64 ppnMVAPICH2-X-Next
OpenMPI+UCX
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Reduce – 64 ppn
MVAPICH2-X-Next
OpenMPI+UCX
1
10
100
1000
10000
1 4 16 64 256 1K 4K 16K 64K 256K 1MLa
tenc
y (u
s)Message Size (Bytes)
Scatter – 64 ppn
MVAPICH2-X-Next
OpenMPI+UCX
3x better 2x better
2.2x better
ARM-HUG (SC ‘19) 20Network Based Computing Laboratory
Collectives: 4 & 16 Nodes (32-ppn)
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Allreduce (4-node)MVAPICH2-X-NextHPCXOpenMPI+UCX
1
10
100
1000
10000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Bcast (4-node)
MVAPICH2-X-NextHPCXOpenMPI+UCX
1
10
100
100010000
100000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Allreduce (16-node)
MVAPICH2-X-NextHPCXOpenMPI+UCX
1
10
100
100010000
100000
1 4 16 64 256 1K 4K 16K 64K 256K 1MLa
tenc
y (u
s)Message Size (Bytes)
Bcast (16-node)
MVAPICH2-X-NextHPCXOpenMPI+UCX
5.7x better7.6x better
3.7x better
9.5x better
ARM-HUG (SC ‘19) 21Network Based Computing Laboratory
Collectives: 16 Nodes (64-ppn)
1
10
100
1000
10000
100000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Allreduce – 64 ppnMVAPICH2-X-Next
OpenMPI+UCX
110
1001000
10000100000
1 4 16 64256 1K 4K
16K64K
256K 1M
Late
ncy
(us)
Message Size (Bytes)
Bcast – 64 ppnMVAPICH2-X-Next
OpenMPI+UCX
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Reduce – 64 ppnMVAPICH2-X-Next
OpenMPI+UCX
1
10
100
100010000
100000
1 4 16 64 256 1K 4K 16K 64K 256KLa
tenc
y (u
s)Message Size (Bytes)
Scatter – 64 ppnMVAPICH2-X-Next
OpenMPI+UCX
10x better
5x better
7.5x better8x better
ARM-HUG (SC ‘19) 22Network Based Computing Laboratory
MPI Job Startup Evaluation on different ARM clustersEPCC Fulhame
0
5000
10000
15000
20000
64 128 256 512 1024 2048
MVAPICH2-XOpenMPI+UCXMVAPICH2-X-Next
Tim
e (m
illis
econ
d)
No. of Processes (64 ppn)
0
5000
10000
15000
20000
25000
56 112 224 448
MVAPICH2-XOpenMPI+UCXMVAPICH2-X-Next
Tim
e (m
illis
econ
d)
No. of Processes (64 ppn)
Mayer
6.4x better1.6x better
• Up to 1.6x speedup over OpenMPI w/UCX on Catalyst Fulhame system
• Up to 6.4x speedup over OpenMPI w/ UCX on Mayer system
.
ARM-HUG (SC ‘19) 23Network Based Computing Laboratory
• Evaluation of NAS Parallel Benchmarks, MiniAMR, and Cloverleaf kernels
• Comparison among MVAPICH2-X (Next), OpenMPI+UCX, and HPCX communication libraries
• Measure the application communication performance at varying scales with full-subscription scenarios on up to 1,024 processes
• Significant performance improvement is observed when using MVAPICH2-X
Evaluation of Application Kernels
ARM-HUG (SC ‘19) 24Network Based Computing Laboratory
0
200
400
600
800
1000
32 64 128 256 512
Exec
utio
n Ti
me
(Sec
)
No. of Processes (32 ppn)
NPB-CGMVAPICH2-X-NextHPCX
Application Evaluation – (NAS Parallel Benchmarks)
0
200
400
600
800
32 64 128 256
Exec
utio
n Ti
me
(Sec
)
No. of Processes (32 ppn)
NPB-FTMVAPICH2-X-NextHPCX
30% better
29% better
• NPB-3.4 Class-D comparing MVAPICH2-X (upcoming) and HPCX on EPCC Fulhame
• Up to 30% and 29% improvement over HPCX for CG and FT kernels.
ARM-HUG (SC ‘19) 25Network Based Computing Laboratory
Application Evaluation – (MiniAMR)
050
100150200250300350
32 64 128 256
Exec
utio
n Ti
me
(s)
No. of Processes (32 ppn)
MVAPICH2-X-NextHPCX
Input Parameters: --percent_sum 0 --num_vars 10 --stencil 21 --report_diffusion 0 --report_perf 2 --num_tsteps 100 --num_spikes 1
23% better
• MiniAMR kernel comparing MVAPICH2-X (upcoming) and HPCX on EPCC Fulhame
• Up to 23% improvement over HPCX is observed.
ARM-HUG (SC ‘19) 26Network Based Computing Laboratory
• ARM has emerged as a new platform for HPC systems
• Requires high-performance middleware designs while exploiting modern interconnects (InfiniBand)
• Provided the approaches being taken care of by the MVAPICH2 project to provide MPI support with high-performance
• Will continue to optimize and tune the MVAPICH2 stack for higher performance and scalability on ARM platforms
Conclusions
ARM-HUG (SC ‘19) 27Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
ARM-HUG (SC ‘19) 28Network Based Computing Laboratory
• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members
– External speakers
• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)
• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events
• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/
Multiple Events at SC ‘19
ARM-HUG (SC ‘19) 29Network Based Computing Laboratory
Funding AcknowledgmentsFunding Support by
Equipment Support by
ARM-HUG (SC ‘19) 30Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)– K. S. Kandadi (M.S.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur– X. Lu
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– M. S. Ghazimeersaeed
– A. Ruhela– K. Manian
Current Students (Undergraduate)– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist– M. Arnold
Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)
ARM-HUG (SC ‘19) 31Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/