critical issues at exascale for algorithm and software design

Critical Issues at Exascalefor Algorithm and Software Design

SC12, Salt Lake City, Utah, Nov 2012Jack Dongarra, University of Tennessee, Tennessee, USA

Performance Development in Top500

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1

N=500

Potential System Architecture

Systems 2012Titan

Computer

2022 DifferenceToday &

2022

System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW(2 Gflops/W)

~20 MW(50 Gflops/W)

System memory 710 TB(38*18688)

32 - 64 PB O(10)

Node performance 1,452 GF/s(1311+141)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 232 GB/s(52+180)

2 - 4TB/s O(1000)

Node concurrency 16 cores CPU2688 CUDA

cores

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(10)

System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Potential System Architecturewith a cap of $200M and 20MW

Systems 2012Titan

Computer

2022 DifferenceToday &

2022

System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW(2 Gflops/W)

~20 MW(50 Gflops/W)

System memory 710 TB(38*18688)

32 - 64 PB O(10)

Node performance 1,452 GF/s(1311+141)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 232 GB/s(52+180)

2 - 4TB/s O(1000)

Node concurrency 16 cores CPU2688 CUDA

cores

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(10)

System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Critical Issues at Peta & Exascale for Algorithm and Software Design¨ Synchronization-reducing algorithms

Break Fork-Join model¨ Communication-reducing algorithms

Use methods which have lower bound on communication

¨ Mixed precision methods 2x speed of ops and 2x speed for data movement

¨ Autotuning Today’s machines are too complicated, build

“smarts” into software to adapt to the hardware¨ Fault resilient algorithms

Implement algorithms that can recover from failures/bit flips

¨ Reproducibility of results Today we can’t guarantee this. We understand the

issues, but some of our “colleagues” have a hard time with this.

5

• Must rethink the design of our algorithms and softwareManycore and Hybrid architectures are

disruptive technologySimilar to what happened with cluster

computing and message passing

Rethink and rewrite the applications, algorithms, and software

Data movement is expensiveFlops are cheap

Major Changes to Algorithms/Software

6

Fork-Join Parallelization of LU and QR.

Parallelize the update:• Easy and done in any reasonable software.• This is the 2/3n3 term in the FLOPs count.• Can be done efficiently with LAPACK+multithreaded BLAS

-dgemm

TimeCor

es

Synchronization (in LAPACK LU)

fork join bulk synchronous processing

8Allowing for delayed update, out of order, asynchronous, dataflow execution

¨Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms

¨Methodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout

¨Arbitrary DAG with dynamic scheduling

9

Fork-joinparallelism

PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures

DAG scheduledparallelism

Time

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 10



R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 11



R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 12



R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 13



R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 14

PowerPack 2.0

15The PowerPack platform consists of software and hardware instrumentation.Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/

Power for QR Factorization

16dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLASmatrix size is very tall and skinny (mxn is 1,152,000 by 288)

PLASMA’s Communication Reducing QR FactorizationDAG based

MKL’s QR FactorizationFork-join based

LAPACK’s QR FactorizationFork-join based

PLASMA’s Conventional QR FactorizationDAG based

critical issues at exascale for algorithm and software design

Technology

node performance1

tflops n

pflops n

system peak27 pflops

university of tennessee

gflops1010 gflops

gbs o10bwsystem size

software designsc12