sandia national laboratories · sandia national laboratories is a multiprogram laboratory managed...
TRANSCRIPT
Department of Computer Science
HansWeeksandPatrickBridges
UniversityofNewMexico
MatthewDosanjh andRyanGrantSandiaNationalLaboratories
} WewanttowriteperformantOpenSHMEMprograms◦ Howmanyprocessingelements?
◦ Howmanythreads?
◦ Optimalmessagesize?
◦ Whatsynchronizationmethods?
} VeryfewbenchmarksforOpenSHMEM
} Nonethataddressmulti-threading
Problem
2
} Weneedabenchmarksuitewith:◦ Synchronizationmethods◦ Multi-threading
◦ Variablemessagesizes
} AlltheseelementsarepresentinRMA-MT◦ AdapttheRMA-MTsuitetoOpenSHMEM!
Solution
3
} Threaded,1sidedMPIbenchmarks
} Basedon:◦ ThakurandGropp’smulti-threadedbenchmarks◦ SandiaMicroBenchmarks◦ MantevoMiniapps
} 4synchronizationmethods
RMA-MTBenchmarkSuite
4
} Microbenchmarks:◦ Latency◦ Bandwidth◦ MessageRate- singledirection,halo-exchange
} Mini-apps:◦ HPCCG◦ MiniFE◦ MiniMD
RMA-MTBenchmarkSuite
5
} Microbenchmarks:◦ Latency◦ Bandwidth◦ MessageRate- singledirection,halo-exchange
} Mini-apps:◦ HPCCG◦ MiniFE◦ MiniMD
RMA-MTBenchmarkSuite
6
Thakur and GroppSandia Micro Benchmarks
Mantevo Mini Apps
Synchronizations:
Initialization and allocation:
ConvertingbenchmarkstoOpenSHMEM
7
• MPI_Get()• MPI_Put()
• MPI_Init_thread()• MPI_Win_create()
• Fence• Lock• Lock-all• Post/Start/Complete/Wait
• shmem_getmem()• shmem_putmem()
• shmem_init_thread()• shmem_malloc()
• Barrier• Quiet
RMA operations:
} Status◦ Bandwidthandlatencyconvertedtopassivetarget◦ Messagerateconvertedtoactivetarget◦ 2miniappsconvertedtoactivetargetwithMPIcollectivesremaininginnon-criticalpaths◦ Coarsegranularityofthreading
} http://www.cs.sandia.gov/smb/rma-mt.html
ConvertingbenchmarkstoOpenSHMEM
8
} CrayXE30mcluster◦ SandiaNationalLabsVoltacluster
◦ Pernode:2xXeonIvyBridge2.4GHz12-coreprocessorwithhyper-threadingenabled
◦ 32GBRAM
◦ Cray Aries networkinterface
◦ Cray-shmem 7.3.2
} 10runswith10,000iterations perrun
Experimentalsetup
9
Micro-benchmarks
10
Latency
11
Late
ncy
(S
)
Message Size1 Thread 4 Threads 16 Threads
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
1B 2B 4B 8B 16B32B
64B128B
256B512B
1KiB2KiB
4KiB8KiB
16KiB
32KiB
64KiB
128KiB
256KiB
512KiB
1MiB
Latency:Sub-optimalProtocolSwitch
12
Late
ncy
(S
)
Message Size1 Thread 4 Threads 16 Threads
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
1B 2B 4B 8B 16B32B
64B128B
256B512B
1KiB2KiB
4KiB8KiB
16KiB
32KiB
64KiB
128KiB
256KiB
512KiB
1MiB
Bandwidth
13
Ba
nd
wid
th (
MiB
/S)
Message Size1 Thread 4 Threads 16 Threads
0
1000
2000
3000
4000
5000
6000
7000
8000
1B 2B 4B 8B 16B32B
64B128B
256B512B
1KiB2KiB
4KiB8KiB
16KiB
32KiB
64KiB
128KiB
256KiB
512KiB
1MiB
Bandwidth:4threadovertakessinglethread
14
Ba
nd
wid
th (
MiB
/S)
Message Size1 Thread 4 Threads 16 Threads
0
1000
2000
3000
4000
5000
6000
7000
8000
1B 2B 4B 8B 16B32B
64B128B
256B512B
1KiB2KiB
4KiB8KiB
16KiB
32KiB
64KiB
128KiB
256KiB
512KiB
1MiB
MessageRate:8NodeHaloExchange
15
Mess
age R
ate
Message Size1 Thread 4 Threads 16 Threads
0
2x106
4x106
6x106
8x106
1x107
1.2x107
8B 16B32B
64B128B
256B512B
1KiB2KiB
4KiB8KiB
16KiB
32KiB
64KiB
128KiB
256KiB
512KiB
1MiB
Mini-apps
16
Mini-apps
17
} Thread-basedsynchronizationmethods
} RemoveMPIcallsfromnon-criticalpathsinmini-apps
} Addingmulti-threadingcomputationcomponentofmini-apps
} Addmini-appswithasynchronousalgorithms
} InvestigateSHMEM-MTperformanceonvariousnetworkarchitectures
Futurework
18
} OakRidgeLeadershipComputingFacility
} ScalableSystemsLab
} SandiaNationalLaboratory◦ SandiaNationalLaboratoriesisamultiprogram laboratorymanagedandoperatedbySandiaCorporation,awholly
ownedsubsidiaryofLockheedMartinCorporation,fortheUnitedStatesDepartmentofEnergy'sNationalNuclearSecurityAdministrationundercontractDE-AC04-94AL85000.
Acknowledgements
19
Questions?
20