accelerating atmospheric simulation on gpu, fpga, and mic · 19/09/2013  · 19/31...

38
Accelerating Atmospheric Simulation on GPU, FPGA, and MIC Haohuan Fu [email protected] Center for Earth System Science Tsinghua University, Beijing Sep/19/2013 @ NCAR

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Accelerating Atmospheric Simulation

on GPU, FPGA, and MIC

Haohuan Fu

[email protected]

Center for Earth System Science

Tsinghua University, Beijing

Sep/19/2013 @ NCAR

Page 2: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

The Center for Earth System Science,

Tsinghua University

Started in 2009

Study of the earth as an integrated system

to investigate interactions between atmosphere, land, water, ice, biosphere, societies, technologies, and economics

observing, understanding, and predicting global changes

to guide political / economical / technical decisions at different scales for assuring sustainable development

2

CMIP5: LASG-CESS

Page 3: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

The Present Faculty

3

Page 4: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Tsinghua HPGC Group

HPGC: high performance geo-computing http://www.thuhpgc.org

High performance computational solutions for geoscience applications simulation-oriented research: providing highly efficient and highly

scalable simulation applications (climate modeling, exploration geophysics)

data-oriented research: data processing, data compression, and data mining

Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures

Page 5: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Application Climate Modeling

global-scale atmospheric simulation (800 Tflops Shallow Water Equation)

GPU-based acceleration for GEOS-Chem

FPGA-based acceleration for weather forecasting acceleration

Exploration Geophysics

forward modeling / inversion / migration

Remote Sensing Data Processing

data analysis, visualization, correlation of different data sets

Algorithm parallel Stencil on Different HPC Architectures

parallel Sparse Matrix Solver

parallel Data Compression (PLZMA)

hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup

Architecture multi-core/many-core (CPU, GPU, MIC)

reconfigurable hardware (FPGA)

Page 6: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Application Climate Modeling

global-scale atmospheric simulation (800 Tflops Shallow Water Equation)

GPU-based acceleration for GEOS-Chem

FPGA-based acceleration for weather forecasting acceleration

Exploration Geophysics

forward modeling / inversion / migration

Remote Sensing Data Processing

data analysis, visualization, correlation of different data sets

Algorithm parallel Stencil on Different HPC Architectures

parallel Sparse Matrix Solver

parallel Data Compression (PLZMA)

hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup

Architecture multi-core/many-core (CPU, GPU, MIC)

reconfigurable hardware (FPGA)

Page 7: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Outline

Tianhe-1A: GPU

Maxeler DFE: FPGA

Tianhe-2: MIC

Future Plan & Discussion

Page 8: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Multidisciplinary Collaborations

Prof. Chao Yang Institute of Software, CAS

computational mathematics

Dr. Wei Xue Department of Computer Science, Tsinghua University

HPC (MPI, OpenMP, MIC)

Dr. Haohuan Fu Center for Earth System Science, Tsinghua University

HPC (accelerators, GPU, FPGA, MIC)

Prof. Lanning Wang College of Global Change and Earth System Science, Beijing

Normal University (BNU-ESM)

climate scientist

Page 9: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Outline

Tianhe-1A: GPU

Maxeler DFE: FPGA

Tianhe-2: MIC

Future Plan & Discussion

Page 10: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Highly-Scalable Framework for Global

Atmospheric Simulation on Tianhe-1A

Starting from shallow wave equation cubed-sphere mesh grid

adjustable partition between CPU and GPU

scale to 40,000 CPU cores and 3750 GPUs with a sustainable performance of 800 TFlops

Page 11: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Mesh and Stencil

13-point stencil

• Spatially discretized with a cell-

centred finite volume method

• Integrated with a second-order

accurate TVD Runge-Kutta method

Interp. Across patches

1-d linear interpolation

Page 12: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Improved hybrid CPU-GPU Algorithm

Note: halo1/2/3/4 — the 4 steps of the “pipe-flow” communication scheme

adjustable partition between CPU and GPU

Page 13: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

“Pipe-Flow” Scheme for Message-Passing

on Cubed-Sphere

Four steps to arrange

conflict-free message-

passing on cubed-

sphere.

The arrows indicate

directions of data

entering or exiting

patches as a pipe flow.

For more details, please refer to our PPoPP 2013 paper: “A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric

Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

(PPoPP), pp. 1-12, Shenzhen, 2013. .

Page 14: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Outline

Tianhe-1A: GPU

Maxeler DFE: FPGA

Tianhe-2: MIC

Future Plan & Discussion

Page 15: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Highly-Scalable Atmospheric Simulation

on Data-Flow Engines

Maxeler Data-Flow Engine (DFE) Field-Programmable Gate Arrays (FPGA)

24 GB onboard memory

PCIE connection to host

MaxRing connection between cards

Page 16: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

halo

hal

o

hal

o

halo

For each stencil cycle

FPGA side:

①Inner-part stencil

CPU side:

①Update halos

②Interpolate if

necessary

③Outer-part stencils

BARRIER:

CPU-FPGA exchange

FPGA

CPU

Inner part

Outer part

Hybrid CPU+FPGA Design

Page 17: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

[email protected] 13/31

CPU-FPGA workflow

CPU

FPGA

Halo

Updating Interpolation Outer-part stencil

One stencil cycle

Inner-part stencil

C2F

F2C

① ② ③ ④ ⑤ BARRIER

Work Flow

Page 18: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

[email protected] 19/31

Baseline: a straightforward double-precision SWEs

Resource baseline

LUTs 299 %

FFs 220 %

BRAMs 20 %

DSPs 189 %

Resource Cost of SWEs

on Virtex-6 SX457T

Operations num

ADD/SUB 434

MUL 570

DIV 99

Others 45

Floating point operations

of SWE stencil

Precision-based optimization to further decrease the resource usage

Go for a Mixed-Precision Design

Page 19: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Analysis of the Dynamic Range

fixed-point fixed-point

fixed-point

reduced-precision

Page 20: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Precision Exploration

Page 21: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

General Architecture of the Mixed-

Precision Design

Page 22: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Resource Cost of SWEs on Virtex-6

SX457T

Baseline: a straightforward double-precision SWEs

Mixed-precision: fixed-point and reduced-precision floating-point

Resource baseline mixed-

precision

LUTs 299 % 76.17%

FFs 220 % 53.41%

BRAMs 20 % 12.59 %

DSPs 189 % 44.84 %

Page 23: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Hardware Platform: Maxeler DFEs

Environment

Maxcompiler development tool

MaxWorkstation • One Intel i7 quad-core CPU

• One Accelerator card (Virtex-6 SX 475T & 24 GB DRAM)

MaxNode • 12 Intel Xeon CPU cores

• four Accelerator cards (Virtex-6 SX 475T & 24 GB DRAM)

Page 24: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Performance Results

Platform Performance

(𝑝𝑜𝑖𝑛𝑡𝑠/𝑠𝑒𝑐𝑜𝑛𝑑)

Speedup

6-core CPU 4.66K 1

Tianhe-1A node 110.38K 23x

MaxWorkstation 468.1K 100x

MaxNode 1.54M 330x

14x Meshsize: 1024 × 1024 × 6

MaxNode speedup over Tianhe node: 14 times

Page 25: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Power Efficiency

Platform Efficiency (𝑝𝑜𝑖𝑛𝑡𝑠/(𝑠𝑒𝑐𝑜𝑛𝑑 × 𝑤𝑎𝑡𝑡) )

Speedup

6-core CPU 20.71 1

Tianhe-1A node 306.6 14.8x

MaxWorkstation 2.52K 121.6x

MaxNode 3K 144.9x

Meshsize: 1024 × 1024 × 6

MaxNode is 9 times more power efficient

9 x

For more details, please refer to our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed-

Precision Data Flow Engine”, in Proceedings of the 23rd International Conference on Field Programmable Logic and

Applications, 2013.

Page 26: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Outline

Tianhe-1A: GPU

Maxeler DFE: FPGA

Tianhe-2: MIC

Future Plan & Discussion

Page 27: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Tianhe-2: Brief Introduction

Tianhe-2

16,000 nodes

each node contains two 12-core Intel Ivy Bridge

CPUs, and 3 Intel Xeon Phi Acceleration Cards

peak: 54.9 PFlops

LINPACK: 33.8 PFlops

Page 28: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Running SWE on Tianhe-2

Hierarchical 2D

domain

decomposition

Balanced CPU/MIC

utilization

0-3 MICs

adjustable blocks

28

Without MICs With 1 MIC

With 2 MICs With 3 MICs

Page 29: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Running SWE on Tianhe-2: Workflow

29 Note: C2M — data movement from CPU to MIC

M2C — data movement from MIC to CPU

halo1/2/3/4 — the 4 steps of the “pipe-flow” communication scheme

Page 30: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Optimization Scheme

base version

• serial

• no-VEC

multi-thread

• OpenMP

multi-thread + VEC

• compiler

• Cilk

Cache blocking

• different level

• auto-search

30

Page 31: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Scaling the Performance

Page 32: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

SWE Performance on Tianhe-2

MIC against 24 CPU cores 1

.34

2.1

1

2.6

2

1.2

2

2.3

3

2.9

1

1.1

8 2

.26

3.1

5

0

0.5

1

1.5

2

2.5

3

3.5

1 MIC 2 MICs 3 MICs

1024*1024*6

2048*2048*6

4096*4096*6

Page 33: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

SWE Performance on Tianhe-2

Weak Scaling Using up to 8,652 nodes (207,648 CPU cores + 1,583,316 MIC

cores)

90.00%

92.00%

94.00%

96.00%

98.00%

100.00%

6 24 96 384 1536 1944 2400 3456 4056

# of

unknowns

for the

largest run:

200 billion

Page 34: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Outline

Tianhe-1A: GPU

Maxeler DFE: FPGA

Tianhe-2: MIC

Future Plan & Discussion

Page 35: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Highly-Scalable Framework for Global

Atmospheric Simulation

evolve from “2D Shallow Water Wave Equations”

to “3D Euler Equations”

35

Page 36: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Highly-Scalable Framework for Global

Atmospheric Simulation

Model development:

from 2D SWE to 3D Euler

coupling the 3D Euler dynamics with physics

processes to test for local and global scenarios

HPC:

an FPGA-based cluster for climate modeling?

larger-scale runs on 100P supercomputer

(dynamic + physics)

Page 37: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

Thank You!

[email protected]

Page 38: Accelerating Atmospheric Simulation on GPU, FPGA, and MIC · 19/09/2013  · 19/31 lin.an27@gmail.com Baseline: a straightforward double-precision SWEs Resource baseline LUTs 299

38