critical issues at exascale for algorithm and software design

16
Critical Issues at Exascale for Algorithm and Software Design SC12, Salt Lake City, Utah, Nov 2012 Jack Dongarra, University of Tennessee, Tennessee, USA

Upload: top500

Post on 24-May-2015

7.064 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Critical Issues at Exascale for Algorithm and Software Design

Critical Issues at Exascalefor Algorithm and Software Design

SC12, Salt Lake City, Utah, Nov 2012Jack Dongarra, University of Tennessee, Tennessee, USA

Page 2: Critical Issues at Exascale for Algorithm and Software Design

Performance Development in Top500

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

2018

2020

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 Eflop/s

1 Gflop/s

1 Tflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

N=1

N=500

Page 3: Critical Issues at Exascale for Algorithm and Software Design

Potential System Architecture

Systems 2012Titan

Computer

2022 DifferenceToday &

2022

System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW(2 Gflops/W)

~20 MW(50 Gflops/W)

System memory 710 TB(38*18688)

32 - 64 PB O(10)

Node performance 1,452 GF/s(1311+141)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 232 GB/s(52+180)

2 - 4TB/s O(1000)

Node concurrency 16 cores CPU2688 CUDA

cores

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(10)

System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Page 4: Critical Issues at Exascale for Algorithm and Software Design

Potential System Architecturewith a cap of $200M and 20MW

Systems 2012Titan

Computer

2022 DifferenceToday &

2022

System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW(2 Gflops/W)

~20 MW(50 Gflops/W)

System memory 710 TB(38*18688)

32 - 64 PB O(10)

Node performance 1,452 GF/s(1311+141)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 232 GB/s(52+180)

2 - 4TB/s O(1000)

Node concurrency 16 cores CPU2688 CUDA

cores

O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

8 GB/s 200-400GB/s O(10)

System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Page 5: Critical Issues at Exascale for Algorithm and Software Design

Critical Issues at Peta & Exascale for Algorithm and Software Design¨ Synchronization-reducing algorithms

Break Fork-Join model¨ Communication-reducing algorithms

Use methods which have lower bound on communication

¨ Mixed precision methods 2x speed of ops and 2x speed for data movement

¨ Autotuning Today’s machines are too complicated, build

“smarts” into software to adapt to the hardware¨ Fault resilient algorithms

Implement algorithms that can recover from failures/bit flips

¨ Reproducibility of results Today we can’t guarantee this. We understand the

issues, but some of our “colleagues” have a hard time with this.

5

Page 6: Critical Issues at Exascale for Algorithm and Software Design

• Must rethink the design of our algorithms and softwareManycore and Hybrid architectures are

disruptive technologySimilar to what happened with cluster

computing and message passing

Rethink and rewrite the applications, algorithms, and software

Data movement is expensiveFlops are cheap

Major Changes to Algorithms/Software

6

Page 7: Critical Issues at Exascale for Algorithm and Software Design

Fork-Join Parallelization of LU and QR.

Parallelize the update:• Easy and done in any reasonable software.• This is the 2/3n3 term in the FLOPs count.• Can be done efficiently with LAPACK+multithreaded BLAS

-dgemm

TimeCor

es

Page 8: Critical Issues at Exascale for Algorithm and Software Design

Synchronization (in LAPACK LU)

fork join bulk synchronous processing

8Allowing for delayed update, out of order, asynchronous, dataflow execution

Page 9: Critical Issues at Exascale for Algorithm and Software Design

¨Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms

¨Methodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout

¨Arbitrary DAG with dynamic scheduling

9

Fork-joinparallelism

PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures

DAG scheduledparallelism

Time

Page 10: Critical Issues at Exascale for Algorithm and Software Design

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 10

Page 11: Critical Issues at Exascale for Algorithm and Software Design

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 11

Page 12: Critical Issues at Exascale for Algorithm and Software Design

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 12

Page 13: Critical Issues at Exascale for Algorithm and Software Design

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 13

Page 14: Critical Issues at Exascale for Algorithm and Software Design

Communication Avoiding QR Example

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rdConference on Hypercube Concurrent Computers and Applications, volume II, Applications,pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

R0

R1

R2

R3

R0

R2

R0R R

D1

D2

D3

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

Domain_Tile_QR

D0

D1

D2

D3

D0

04/12/2023 14

Page 15: Critical Issues at Exascale for Algorithm and Software Design

PowerPack 2.0

15The PowerPack platform consists of software and hardware instrumentation.Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/

Page 16: Critical Issues at Exascale for Algorithm and Software Design

Power for QR Factorization

16dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLASmatrix size is very tall and skinny (mxn is 1,152,000 by 288)

PLASMA’s Communication Reducing QR FactorizationDAG based

MKL’s QR FactorizationFork-join based

LAPACK’s QR FactorizationFork-join based

PLASMA’s Conventional QR FactorizationDAG based