stratified magnetohydrodynamics accelerated using gpus:smaug

16
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG

Upload: marv

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG. The Sheffield Advanced Code. The Sheffield Advanced Code (SAC) is a novel fully non-linear MHD code based on the Versatile Advection Code (VAC) designed for simulations of linear and non-linear wave propagation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Stratified Magnetohydrodynamics

Accelerated Using GPUs:SMAUG

Page 3: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Full Perturbed MHD Equations for Stratified media

Page 4: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Numerical Diffusion

• Central differencing can generate numerical instabilities

• Difficult to find solutions for shocked systems• We define a hyperviscosity parameter which is

the ratio of the forward difference of a parameter to third order and first order

• Tracking evolution of the hyperviscosity we can identify numerical noise and apply smoothing where necessary

Page 5: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Why MHD Using GPU’s?

F(i-1,j+1) F(I,j+1) F(i+1,j+1)

F(i-1,j) F(i,j) F(i+1,j)

F(i-1,j-1) F(i,j-1) F(i+1,j-1)

SFt

.

• Excellent scaling with GPU’s but,

• Central differencing requires numerical stabilisation

• Stabilisation with GPU’s trickier, requires• Reduction/maximum

routine• An additional and larger

mesh

• Consider a simplified 2d problem

• Solving flux equation• Derivative using central

diffrencing• Time step using Runge Kutta

Page 6: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Halo Messaging• Each proc has a “ghost” layer

– Used in calculation of update– Obtained from neighbouring left and right processors– Pass top and bottom layers to neighbouring processors

• Become neighbours ghost layers

• Distribute rows over processors N/nproc rows per proc– Every processor stores all N columns

• SMAUG-MPI implements messaging using a 2D halo model for 2D and 3D halo model for 3D

• Consider a 2d model – for simplicity distribute layers over a line of processes

Page 7: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Processor 1

Processor 2

Processor 3

Processor 4

N+1

N+11

N

p2minp3max

p3maxp2min

p1minp2max

p2maxp1min

Send top layer

Send bottom layerReceive

top layer

Receive bottom layer

Page 8: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

MPI Implementation• Based on halo messaging technique

employed in SAC codevoid exchange_halo(vector v) {

gather halo data from v into gpu_buffer1

cudaMemcpy(host_buffer1, gpu_buffer1,...);MPI_Isend(host_buffer1,...,destination,...);MPI_Irecv(host_buffer2,...,source,...);

MPI_Waitall(...);

cudaMemcpy(gpu_buffer2,host_buffer2,...);scatter halo data from gpu_buffer2 to halo regions in v

}

Page 9: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Halo Messaging with GPU Direct

void exchange_halo(vector v) {

gather halo data from v into gpu_buffer1

MPI_Isend(gpu_buffer1,...,destination...);MPI_IRecv(gpu_buffer2,...,source...)

MPI_Waitall(...);

scatter halo data from gpu_buffer2 to halo regions in v}

• Simpler faster call structure

Page 10: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Progress with MPI Implementation

• Successfully running two dimensional models under GPU direct– Wilkes GPU cluster at The University of Cambridge– N8 - GPU Facility, Iceberg

• 2D MPI version is verified• Currently optimising communications performance under

GPU direct• 3D MPI implementation is already implemented still

requires testing

Page 11: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Orszag-Tang Test

200x200 Model at t=0.1, t=0.26, t=0.42 and t=0.58s

Page 12: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

A Model of Wave Propagation in the Magnetised Solar Atmmosphere

The model features a Flux Tube with Torsional Driver, with a fully stratified quiet solar atmosphere based on VALIIIC

Grid size is 128x128x128, representing a box in the solar atmosphere of dimensions 1.5x2x2Mm

Flux tube has a magnetic field strength of 1000G

Driver Amplitude 200km/s

Page 13: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Timing for Orszag-Tang Using SAC/SMAUG with Different Architetures

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000

Grid Dimension

Tim

e fo

r 100

Iter

atio

ns (s

econ

ds)

NVIDIA M2070NVIDIA K20Intel E5 2670 8cNVIDIA K40K20(2x2)K20(4x4)

Page 14: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Performance Results (Hyperdiffusion disabled)

Grid Size(number of GPUs in brackets)

With GPU direct ( time in s)

Without GPU direct (time in s)

1000x1000(1) 31.54 31.5

1000x1000(2x2) 11.28 11.19

1000x1000(4x4) 12.89 13.7

2044x2044(2x2) 41.3 41.32

2044x2044(4x4) 42.4 43.97

4000x4000(4x4) 77.37 77.44

8000x8000(8x8) 63.3 61.7

8000x8000(10x10) 41.9 41.0

• Timings in seconds for 100 iterations (Orszag-Tang test)

Page 15: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Performance Results (With Hyperdiffusion enabled)

Grid Size(number of GPUs in brackets)

Without GPU direct (time in s)

2044x2044(2x2) 184.1

2044x2044(4x4) 199.89

4000x4000(4x4) 360.71

8000x8000(8x8) 253.8

8000x8000(10x10) 163.6

• Timings in seconds for 100 iterations (Orszag-Tang test)

Page 16: Stratified  Magnetohydrodynamics  Accelerated Using  GPUs:SMAUG

Conclusions• We have demonstrated that we can successfully compute

large problems by distributing across multiple GPUs • For 2D problems the performance using messaging with and

without GPUdirect is similar.– This is expected to change when 3D models are tested

• It is likely that much of the communications overhead arises from routines used transfer data within the GPU memory– Performance enhancements possible through application architecture

modification• Further work needed with larger models for comparisons with

X86 implementation using MPI• The algorithm has been implemented in 3D testing of 3D

models will be undertaken over the forthcoming weeks