70su :ed73ecdp ( -aoec0e0c#0&:0e0c … fu presentatio… · longterm plan for sunwaytaihulight...

60
70S#U :#ED73ECDP( -AOEC0E0C #0& :70E0C ?EA0PEBE? LLHE?#PEK0O #P PDA ?#HA KB # 4EHHEK0 ,KNAO 7GUNAGT 5A =G@OUTGR AA<KXIUS<A@OTM 2KT@KX OT DADO 3K<GX@SKT@ UL 4GX@N AEY@KS AIOKTIK BYOTMNAG CTOBKXYO@E AK<@KSHKX @N % / 820A

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

S U : ED 3ECDP(-AOEC E C : E C ?EA PEBE? LLHE? PEK O

P PDA ? HA KB # 4EHHEK ,KNAO7GUN GT 5

=G OUTGR A KXIUS OTM 2KT KX OT D O3K GX SKT UL 4GX N A Y KS AIOKTIK BYOTMN G CTO KXYO

AK KSHKX N % / 820A

Page 2: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway Machine: the Challenges and Opportunities

Scientific Computing with 10 Million Cores

Long Term Plan for Sunway TaihuLight

6 PHE A

Page 3: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway-I:

- CMA service, 1998

- commercial chip

- 0.384 Tflops

- 48th of TOP500

Sunway BlueLight:

- NSCC-Jinan, 2011

- 16-core processor

- 1 Pflops

- 14th of TOP500

Sunway TaihuLight:

- NSCC-Wuxi, 2016

- 260-core processor

- 125 Pflops

- 1st of TOP500

:DA S U 4 ?DE A IEHU

Page 4: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Core Group 2

Data Transfer Network

MPE 8*8 CPE Mesh

PPU

iMC

Memory

Core Group 0

MPE8*8 CPE Mesh

iMC

PPU

Memory

Core Group 1

MPE8*8 CPE Mesh

PPU

Core Group 3 iMC

Memory

MPE8*8 CPE

Mesh

PPU

iMC

Memory

NoC

ComputingCore

LDM

ColumnCommunication Bus

Control Network

Registers

Row Communication

Bus

Transfer Agent (TA)

Memory Level

LDM Level

Register Level

Computing Level

8*8 CPE Mesh

# ( S U ,KNA NK?AOOKN

Page 5: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI

Page 6: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI

Page 7: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI

Page 8: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI

Page 9: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 0 5O K K KR 8T KMXG OUT 7OKXGXINp IUS OTM TUJK

p IUS OTM HUGXJ

p Y KX TUJK

p IGHOTK

p KT OXKIUS OTMY Y KS

0ECD -A OEPU 1 PACN PEK KB PDA ,KIL PE C UOPAI

Page 10: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0KS PK ,K A?P PDA # 4EHHEK ,KNAO)

Page 11: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0KS PK ,K A?P PDA # 4EHHEK ,KNAO)

2D core arraywith row andcolumn buses

Page 12: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0KS PK ,K A?P PDA # 4EHHEK ,KNAO)

2D core array with rowand column buses

Network on Chip

Page 13: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0KS PK ,K A?P PDA # 4EHHEK ,KNAO)

2D core array with rowand column buses

Network on Chip

Customized Network Board toFully Connect 256 Nodes

Page 14: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0KS PK ,K A?P PDA # 4EHHEK ,KNAO)

2D core array with rowand column buses

Network on Chip

Customized Network Board toFully Connect 256 Nodes

Sunway Net

Page 15: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K

Page 16: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K

Page 17: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K

Page 18: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K

Page 19: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

:SAAP ,KIIA PO BNKI NKB PKODE 4 PO K

Sunway Micro

Page 20: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway Machine: the Challenges and Opportunities

Scientific Computing with 10 Million Cores

Long Term Plan for Sunway TaihuLight

6 PHE A

Page 21: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

4 ?DE A , L EHEPU ,KIL NEOK

0

0.5

1

1.5

2

2.5

3

PeakPerformance

MemorySize

Gflops/Watt

Tflops/m^3

memorybandwidth

communicationbandwidth

TaihuLight Tianhe-2 Titan KComputer

0

0.5

1

1.5

2

2.5

3Linpack

Graph

HPCG

hpgmg

TaihuLight Tianhe-2

Titan KComputer

Page 22: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN A P NAO PK ,K OE AN

Page 23: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN A P NAO PK ,K OE AN

Intel KNL 7250 of Cori: 6.5 flops/byteNVIDIA P100 of Piz Daint: 7.2 flops/byte

Page 24: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN ,D HHA CA #( ? HE C

Page 25: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN ,D HHA CA ( 4AIKNU HH

Page 26: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway TaihuLight

125 Pflops

32 GB and136GB/s per node 22 flops/byte

10 millioncores

MPE + CPE

user-controlled64 KB LDM

registercommunicationamong CPEs

4 FKN ,D HHA CA ( 4AIKNU HH

Refactoring and Redesigning

Page 27: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

2016FullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling

PhaseFieldSimulationsofCoarseningDynamics

AtomisticSimulationofSiliconNanowires

Run-awayElectronTrajectorySimulation

GenomeFunctionalAnnotationandHomeoticGeneBuilding

SpacecraftCFDNumericalSimulation

2017Extreme-scaleGraphProcessingFramework

SimulationofPlanetaryRings

SimulationsofQuantumSpinLiquidStatesviaPEPS++

MolecularDynamicsSimulationofCondensedCovalentMaterials

cryo-EMMacromoleculeStructureDetermination

RedesigningCAM-SE

NonlinearEarthquakeSimulation

1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O

Page 28: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

2016 Gordon Bell FinalistsFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling

PhaseFieldSimulationsofCoarseningDynamics

AtomisticSimulationofSiliconNanowires

Run-awayElectronTrajectorySimulation

GenomeFunctionalAnnotationandHomeoticGeneBuilding

SpacecraftCFDNumericalSimulation

2017Extreme-scaleGraphProcessingFramework

SimulationofPlanetaryRings

SimulationsofQuantumSpinLiquidStatesviaPEPS++

MolecularDynamicsSimulationofCondensedCovalentMaterials

cryo-EMMacromoleculeStructureDetermination

RedesigningCAM-SE

NonlinearEarthquakeSimulation

1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O

Page 29: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling

PhaseFieldSimulationsofCoarseningDynamics

AtomisticSimulationofSiliconNanowires

Run-awayElectronTrajectorySimulation

GenomeFunctionalAnnotationandHomeoticGeneBuilding

SpacecraftCFDNumericalSimulation

2017Extreme-scaleGraphProcessingFramework

SimulationofPlanetaryRings

SimulationsofQuantumSpinLiquidStatesviaPEPS++

MolecularDynamicsSimulationofCondensedCovalentMaterials

cryo-EMMacromoleculeStructureDetermination

RedesigningCAM-SE

NonlinearEarthquakeSimulation

1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O

Page 30: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

2016 Gordon Bell PrizeFullyImplicitSolver for AtmosphericDynamics

SurfaceWaveModeling

PhaseFieldSimulationsofCoarseningDynamics

AtomisticSimulationofSiliconNanowires

Run-awayElectronTrajectorySimulation

GenomeFunctionalAnnotationandHomeoticGeneBuilding

SpacecraftCFDNumericalSimulation

2017 Gordon Bell FinalistsExtreme-scaleGraphProcessingFramework

SimulationofPlanetaryRings

SimulationsofQuantumSpinLiquidStatesviaPEPS++

MolecularDynamicsSimulationofCondensedCovalentMaterials

cryo-EMMacromoleculeStructureDetermination

RedesigningCAM-SE

NonlinearEarthquakeSimulation

1 ?KILHAPA 3EOP KB HH ? HA LLHE? PEK O

Page 31: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

racks chips core-groups cores total number of cores

163,840 processes 65 threads

DD-MG K-cycle

Very

sha

llow

Uniform DD

Plug & PlayNow let’s find a way to design a subdomain solver.

LLHE? PEK 1 ( 1ILHE?EP KHRAN BKN PIKOLDANE? -U IE?O

Page 32: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

racks chips core-groups cores total number of cores

163,840 processes 65 threads

Geometry-based pipelined ILU (GP-ILU)

YX

Z

8×88×8

8×88×8

Two-levelpipeline

blk_height

Synchronizationavoiding

11´

( ) dim_zblk_height1num_corescell_sizereg_size

<+-

DD-MG K-cycle

Subdomain matrix of 1st-order with geometric index

Our goal of design:1. Single sweep2. Synchronization-free3. Improved data-locality

Page 33: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

BNK S XKY X T, AF 3 O N )< IUXKY J .% Y 8 KTGR -(

PNK C O? HE C NAO HPO

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 11M0%

20%

40%

60%

80%

100%

Total number of cores

Para

llel e

ffici

ency

33% (GB’15)

67%

45%

Page 34: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0.00125

0.0025

0.005

0.01

0.02

0.04

0.08

0.16

0.33 M 1.33 M 5.32 M2.66 M 10.64 M

0.488

34X

SYPD

Total number of cores

Implicit Explicit

89.5X

2.480 1.389 0.920 0.620Resolution (km)

0.67 M

A O? HE C NAO HPO

7.95 DP-PF

23.66 DP-PF

DOFs=772B

“Exa-scale”for exp

The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit

Page 35: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP

35

CAM5.0 �����

����� �������

CPL7

CESM1.2.0

Tsinghua + BNU 30+ Professors and Students

• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight

• 24,000 MPI processes•Over one million cores

• 10-20x speedup for kernels• 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

Page 36: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

LLHE? PEK 11 ( KNPE C ,. 4 A AOEC E C , 4 . BKNS U : ED 3ECDP

36

CAM5.0 �����

����� �������

CPL7

CESM1.2.0

Tsinghua + BNU 30+ Professors and Students

• Four component models, millions lines of code• Large-scale run on Sunway TaihuLight

• 24,000 MPI processes•Over one million cores

• 10-20x speedup for kernels• 2-3x speedup for the entire model

“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of SC 2016.

Page 37: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

a high complexity in application, and a heavy legacy in thecode base (millions lines of code)

an extremely complicated MPMD program with nohotspots (or hundreds of hotspots)

misfit between the in-place design philosophy and thenew architecture

lack of people with interdisciplinary knowledge andexperience

4 FKN ,D HHA CAO

Page 38: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

6LA ,, OA AB ?PKNE C KB , 4

CAMinitial Dyn_run Phy_run1 Phy_run2

PassstatevariablesPassstatevariablesandtracers

Passtracers(u,v)todynamics

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

end do

Euler_step:

do ie = nets, netecompute Q min/max values for lim8compute Biharmonic mixing term f

end do

do ie = nets, nete2D advection stepdata packing

end do

Bonundary exchange

Data extracting

do ie = nets, netedo k = 1, nlev

dp(k) = func_1()do q = 1, qsize

Qtens(k,q,ie) = func_2(dp(k))end do

end doend do

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …

end doend do

end do

do ie = nets, nete do k = 1, nlev

do q = 1, qsizeQtens(k,q,ie) = …

end do end do

end doData packing

do ie = nets, netedo q = 1, qsize

do k = 1, nlev….

end doend do

end do

do ie = nets, netedo k = 1, nlev

dp0 = func_3()dpdiss = func_4()do q = 1, qsize

Qtens(k,q,ie) = func_5(dp0, dpdiss)

end doend do

end do

do ie = nets, netedo k = 1, nlev

dp(k) = func_5()Vstar(k) = func_6()

end do

do q = 1, qsizedo k = 1, nlev

Qtens(k,q,ie) = func_7(dp(k), Vstar(k))

end do

do k = 1, nlevdp_star(k) = func_8(dp(k))

end do

do k = 1, nlevQtens(k,q,ie) =

func_9(dp_star(k))end do

end doData packing

end do

optimized:

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeQtens(k,q,ie) = func_2(func_1())

end doend do

end do

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeqmin(k,q,ie) = …qmax(k,q,ie) = …

end doend do

end do

do ie = nets, netedo q = 1, qsize

do k = 1, nlev….

end doend do

end do

do ie = nets, netedo k = 1, nlev

do q = 1, qsizeQtens(k,q,ie) =

func_5(func_3(),func_4())end do

end doend do

do ie = nets, nete do q = 1, qsize

do k = 1, nlevQtens(k,q,ie) =

func_7(func_5(),func_6())end do

do k = 1, nlevQtens(k,q,ie) =

func_9(func_8(func_5())) end do

end doData packing

end do

do ie = nets, netedo q = 1, qsize

do k = 1, nlevqmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

end do

do ie = nets, netedo q = 1, qsize

do k = 1, nlevQtens(k,q,ie) = …

end do end do

end doData packing

!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)

do k = 1, nlevq = func(ie_q)ie = func(ie_q)qmin(k,q,ie) = …qmax(k,q,ie) = …Qtens(k,q,ie) = …

end doend do

!$ACC PARALLEL LOOPdo ie_q = 1, qsize*(nete-nets)

do k = 1, nlev q = func(ie_q)ie = func(ie_q)Qtens(k,q,ie) = …

end do end do!$ACC PARALLEL LOOPData packing

1

2

3

4

5

6

• manual transformation ofloops

• manual OpenACCparallelization andoptimization on code anddata structures

do begin_chunk to end_chunktphysbc(){convect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)

}enddo

tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)convect_shallow_tend(15.57%)macrop_driver_tend(8.38%)microp_aero_run(4.29%)microp_driver_tend(7.13%)aerosol_wet_intr(4.29%)convect_deep_tend_2(0.51%)radiation_tend(54.07%)

enddo}

tphysbc(){do begin_chunk to end_chunkconvect_deep_tend(6.47%)

enddo……do begin_chunk to end_chunk

microp_driver_tend(7.13%)enddo……do begin_chunk to end_chunk

radiation_tend(54.07%)enddo

}

do begin_chunk to end_chunkconvect_deep_tend(6.47%){zm_conv_tend(6.47%){zm_convr(2.03%)zm_conv_evap()montran()convtranc(0.06%)

}}

enddo

convect_deep_tend(6.47%){zm_conv_tend(6.47%){

do begin_chunk to end_chunkzm_convr(2.03%)

enddodo begin_chunk to end_chunkzm_conv_evap()

enddodo begin_chunk to end_chunkmontran()

enddodo begin_chunk to end_chunk

convtranc(0.06%)enddo

}}

• tool based transformation of loops

Page 39: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

0.040.15

0.24 0.25

0.6

0.780.87

1.54

1.2

1.621.75

2.81

0

0.5

1

1.5

2

2.5

3

1024 2400 4096 5120 7350 9600 12000 24000

Simula

tionS

peed(D

escribe

dinM

odelYearPe

rDay(M

YPD))

NumberofCGs(eachCGincludes1MPEand64CPEs)

MPEonly MPE+CPEfordynamiccore MPE+CPEforbothdynamiccoreandphysicsschemes

, 4 IK AH( O? H EHEPU OLAA L• SORROUT IUXK YIGRK % AF 3• SGT IUXK XKLGI UXOTM LUX NK

KT OXK SUJKR• IUS K O O K YOS RG OUT Y KKJ U

NK YGSK SUJKR UT =20@FKRRU Y UTK

Page 40: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n A K , XK XO K UL 5UX XGT KT022 IUJK U 0 NXKGJ 2 IUJKp LOTKX SKSUX IUT XUR NXU MN

G Y KIOLOI 3<0 YINKSK

p SUXK KLLOIOKT KI UXO G OUT

n A K %, XKMOY KX IUSS TOIG OUTHGYKJ XKJKYOMTp XKSU K JG G JK KTJKTI

p K UYK SUXK GXGRRKROYS

PDNA OA E A CN E A A AOEC

elek elek+7

C0,0

C7,0

a16*iStage 1 Ci, j a16*i+a16*i+1 a16*i+...+a16*i+15

Stage 2 a0C0,0 a0+...+a15

a16C1,0 a16+...+a31

p15

p31

akCk, 0 ak+...+ak+15

p15+ =

p127p111+ =

...

...

...

...

... ... ...

C0,1

C7,1

... ...

...

...

...

C0,7

C7,7

a0+a1

a8+a9

ak+ak+1

Stage 3 Ci, j ... p16*ia16*i a16*i+a16*i+1 a16*i+...+a16*i+15

...

Page 41: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

ANBKNI ?A 1ILNKRAIA P PDNK CD A AOEC

1 Sunway CG (64 CPEs)could be equivalent to0.1x Intel Coreor1.8x Intel Coreor7.2x Intel Coreor in certain cases43.1x Intel Core

Page 42: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

? HE CPDA

-U IE?,KNAPK

4EHHEK OKB ,KNAO

512 2048 8192 32768 131072

0.01

0.05

0.1

1

5

Number of Processes

PFlo

ps

48 elements in each process192 elements in each process768 elements in each process650 elelments in each process

3.3 PFlopsPara.eff 98.55%

2.4 PFlopsPara.eff 92.2%

1.76 PFlopsPara.eff 88.3%

2.72 PFlopsPara.eff 92.9%

Page 43: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

EI H PEK KB 0 NNE? A 2 PNE

Page 44: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n 3 TGSOI X XK YU XIKMKTKXG UX UXOMOTG KJ LXUS26 53<

n AKOYSOI G K XU GMG OUT UXOMOTG KJ LXUS 0D 32

n NKX ORO OKY,p YU XIK GX O OUTKXp 3 <UJKR 8T KX URG UXp @KY GX IUT XURRKX

LLHE? PEK 111 ( K HE A N . NPDM A EI H PEK KS U : ED 3ECDP

Source Partitioner

Restart Controller

3D Model Interpolator

LZ4 Compression, Group I/O, Balanced I/O Forwarding

Snapshot/Sesimo Recorder

Velocity Update

Stress Update

Source Injection

Stress Adjustment For

Plasticity

Dynamic Rupture Source Generator

(Based on CG-FDM)

Seismic Wave Propagation

(Based on AWP-ODC)

Next Timestep

3D Vel/Den Model

Fault Stress Init

Friction Law Ctrl

Wave Eqn Solver

Page 45: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

4 HPE 3ARAH -KI E -A?KILKOEPEK

𝑀𝑥

𝑀𝑦

(1) MPI decomposition

𝐵𝑧

𝐵𝑦

(2) CG blocking

Finished area

Computing area

Buffer area

Unfinished area

𝑤𝑥𝑤𝑧

𝑤𝑦

𝐶𝑧

𝐶𝑦

z

y

x

Computedirection

(3) Athread decomposition(4) LDM buffering scheme

Page 46: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

...

1v 2v nv

1u 2u nu

1w 2w nw

1v 2v nv

1u 2u nu 1w 2w nw

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

...

1xx 1yy 1zz 1xy 1xz 1yz

2xx 2yy 2zz 2xy 2xz 2yz

1u 1v 1w 2u 2v 2w

1xx 2xx nxx

1yy 2yy nyy

1zz 2zz nzz

1yz 2yz nyz

1v 2v nv

1u 2u nu 1w 2w nw

1u 1v 1w 2u 2v 2w

1xx 1yy 1zz 1xy 1xz 1yz

2xx 2yy 2zz 2xy 2xz 2yz

...

dvelcx

dstrqc

dvelcx

dstrqc

afterbefore

fusearrays

……

Leftboundary Innerpart

Rightboundary DMAtransfer

Registercommunication

Registercommunication

ID:63ID:00 ID:01 ID:02

# NN U B OEK

D HK AT?D CA PDNK CDNACEOPAN ?KII E? PEK

KLPEIE A HK? E C?K BEC N PEK C E A U

HUPE? H IK AH

H ?A 4AIKNU ?DAIA

Page 47: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

6 PDA BHU ,KILNAOOEK

CPE

(a) Collect statistic from coarse grid (b) Computation workflowCoarse Fine

(d) Compression algorithms

(2)

(1)

(3)

dma_get

Decompressed block

W

TN

(c) Decompress-compute-compress scheme

Compressed grid

dma_put

min/max

13-point stencilIn-place

decomposition

sign exp (8b) frac (24b) sign exp (0-8b) frac (7-15b)

... ...ef

e

NN

EEN

15

)(log minmax2

(str, r1,r2, ,r6,sigma2,yldfac)

sign exp (5b) frac (10b)sign exp (8b) frac (24b)

1EEE754 32b to 16b FP conversion

(vel,ww0,phi,cohes,taxx, ,taxz)

sign exp (8b) frac (24b)

IEEE754 32-bit floating point format

sign frac (15b)

8

)/(1 minmax

��

���

VV

VVVVV

cmpr

16-bit floating point formats

(d1,lam,mu,qp,qs,vx1,vx2,ww)

LDM

Main Memory

16b to 32b decompression General 32b computation 32b to 16b compression

Host Memory

Host Memory:

CPE:

Page 48: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

40.647.8 45.4

28.9 27.622.9

4.2

39.3

12.9 13.1

0102030405060

Speedup

MPE

PAR

MEM

CMPR

LAA L( , . RO # 4 .

Page 49: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

21.262%

21.262% 18.5

54%18.554%

23.870%

24.873%

12.436%

26.979%

2779%

2779%

0

510

1520

2530

DMA Bandwidth

PAR

MEM

CMPR

4AIKNU SE PD PEHE PEK

Page 50: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Number of processes8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K

PFLO

PS

0.6

1

1.5 2

3 4

6

9

14 18 Ideal (Linear)

Ideal (Non-linear)Ideal (Linear+Compress)Ideal (Non-linear+Compress)

Linear (Peak: 10.7PFLops, Para. eff. 97.9%)Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%)Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%)Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%)

A ? HE C

Page 51: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Spee

dup

1

2 3 4 6 8

121622

79.9%73.6%

63.6%

Linear

Idealdx=100mdx=50mdx=16m 75.5%

75.6%

53.3%

Non-Linear

Idealdx=100mdx=50mdx=16m

Spee

dup

1

2 3 4 6 8

121622

160K

75.8%

128K

72.4%

100K

51.2%

Linear+Compress

80K

64K

Number of processes48

K32

K24

K16

K12

K8K

Idealdx=100mdx=50mdx=16m

160K

67.5%

128K

Non-Linear+Compress

67.2%

100K80

K

51.7%

64K

Number of processes48

K32

K24

K16

K12

K8K

Idealdx=100mdx=50mdx=16m

PNK C ? HE C

Page 52: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

116˚E 117˚E 118˚E 119˚E

38˚N

39˚N

40˚N Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

(e)

116˚E 117˚E 118˚E 119˚E

Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

5

6

7

8

9

10

11

Inte

nsi

ty

(f)

38˚N

39˚N

40˚N Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

(c)

Beijing

Tangshan

Shunyi

Tianjin

Ninhe

Luanxian

Cangzhou

Bohai

LuannanWuqing

−0.2

−0.1

0.0

0.1

0.2

Vel

oci

ty (

m/s

)

(d)

0s 50s 100s

Ninghe (200m)

Cangzhou (200m) (a)

0s 50s 100s

Ninghe (16m)

Cangzhou (16m) (b)

EI H PEK AO HPO( I RO # I

Page 53: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Sunway Machine: the Challenges and Opportunities

Scientific Computing with 10 Million Cores

Long Term Plan for Sunway TaihuLight

6 PHE A

Page 54: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro

3K C :ANI H

54

Page 55: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro tanshan.gif

3K C :ANI H

55“15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios”, Gordon Bell Prize Finalist, SC 2017.

Page 56: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro

3K C :ANI H

56

Page 57: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro

3K C :ANI H

57

1.0x

2.2x

3.5x

1.0x

2.7x

4.5x

1.0x 7.1x 8.5x0

10

20

30

40

50

60

70

80

Intel(24cores) swBLAS(1CG) swDNN(1CG)

Averagetim

epe

rite

ration

TrainingAlexNet withswCaffe

total convolution fullyconnected

Page 58: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

Traditional HPCApplications

(Science -> Service)

Deep LearningRelated

ApplicationsSunway Micro

3K C :ANI H

58

Page 59: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2

n < AB 2NOTG, SGPUX Y UTYUXY UL NK 7 2 NGXJ GXK GTJ YUL GXK JK KRU SKT

n =@2 2, KTJUX UL NK SGINOTK

n =20@, @OIN UL UNT 3KTTOY 0RROYUT 1G KX 7GO OTM E Y UX GTJ GJ OIK UTNK 20< A4 UX

n A242, FOLKTM 2 O A K K 3G 3GTOKR @U KT :OS RYKT UYN BUHOT 0RK 1XK KX GTJ3G KO < JOYI YYOUT GTJ GJ OIK UT NK KGX NW G K YOS RG OUT UX

? KSHA CAIA PO

Page 60: 70SU :ED73ECDP ( -AOEC0E0C#0&:0E0C … Fu Presentatio… · LongTerm Plan for SunwayTaihuLight 6PHEA. Sunway-I: - CMA service, 1998 ... - 1st of TOP500:DA 70SU4?DE0A #IEHU. Core Group2