hardware evolution trendshardware evolution trends of ...george/aybi199/kogge_exascale.pdf ·...

64
Hardware Evolution Trends Hardware Evolution Trends of Extreme Scale Computing TeraFlop Embedded PetaFlop Departmental ExaFlop Data Center Peter M. Kogge McCourtney Chair in Computer Science & Engineering Univ. of Notre Dame IBM Fellow (retired) 1 CalTech AyBi 199b April 26, 2011 IBM Fellow (retired) April 26, 2011

Upload: others

Post on 28-Sep-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Hardware Evolution TrendsHardware Evolution Trendsof Extreme Scale Computing

TeraFlopEmbedded

PetaFlopDepartmental

ExaFlopData Center

Peter M. KoggeMcCourtney Chair in Computer Science & Engineering

Univ. of Notre DameIBM Fellow (retired)

1CalTech AyBi 199b April 26, 2011

IBM Fellow (retired)April 26, 2011

Page 2: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Topics

• What Is Exascale

• Moore’s Law, Multicore, & Energy 101

• HPC Computing: How We Got To Today

• The DARPA Exascale Studyy

• Exascale Strawmen

• A Deep Dive into Memory EnergiesA Deep Dive into Memory Energies

• Rockster: The Problem Isn’t Just at the High End

Materials presented here based onMaterials presented here based onan 18 month DARPA-sponsored Study on “Exascale Computing”

2CalTech AyBi 199b April 26, 2011

y p gand later studies

Page 3: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Background-Computing

• Classical “Measures of Performance” Flop/s: floating point operations per second Bytes per Flop/s: memory capacity per unit of performance Bytes per Flop/s: memory capacity per unit of performance

• Classical Goal: 1 byte per flop/s Bytes/s per Flop/s: memory bandwidth per unit of performance

• Classical Goal: 1 byte/s per flop/s• Classical Goal: 1 byte/s per flop/s

• Last 30 years of HPC: focus on flops/s• Last 30 years of HPC: focus on flops/s 1980s: “Gigaflop” (109/s) first in single vector processor 1994: “Teraflop” (1012/s) first via thousands of microprocessors 2009: “Petaflop” (1015/s) first via several hundred thousand cores

3CalTech AyBi 199b April 26, 2011

• Let’s call technology to reach “xyz flop” as “xyz scale”

Page 4: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Background-Technology

• Commercial technology: to date Riding Moore’s Law to denser, “faster,” chips Always shrunk prior “XXX” scale systems to smaller form factor Always shrunk prior XXX scale systems to smaller form factor Shrink, with clock speedup, enabled next “XXX” scale

• But microprocessor clock rates flat-lined in 2004 Driven by power dissipation limits

• “Exascale” now on horizon (Exaflops = 1018/s) But technology change radically changing our path to get there

E i ll E /P

4CalTech AyBi 199b April 26, 2011

Especially Energy/Power

Page 5: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Exascale• Exascale = 1,000X capability of Today• Exascale != Exaflops but

Exascale at the data center size => Exaflops Exascale at the data center size => Exaflops Exascale at the “rack” size => Petaflops for

departmental systems Exascale embedded => Teraflops in a cube Exascale embedded => Teraflops in a cube

• It took us 14+ years to get from 1st Petaflops workshop: 1994 Th NSF t di HTMT HPCS Thru NSF studies, HTMT, HPCS … To give us to Petaflops in 2009

• Study Questions: Can we ride silicon to Exa? What will such systems look like? Where are the Challenges?

5CalTech AyBi 199b April 26, 2011

Page 6: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

For Those of You Who Want an Exaflop…..

You should be OVERJOYED if all you will need is:JUST Milli•JUST a Million cores

•ONLY 1 Nuclear Power PlantONLY 1 Nuclear Power Plant•MINIMAL programming support

6CalTech AyBi 199b April 26, 2011

Page 7: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Moore’s Law Energy 101 & MulticoreMoore s Law, Energy 101, & Multicore

7CalTech AyBi 199b April 26, 2011

Page 8: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Today’s CMOS Silicon Structures

Cell Plate Si

Capacitor Insulator

Storage Node PolyRefilling Poly

Si S b t t2nd Field Oxide

Si Substrate

A Single Transistor Multiple LayersOf Wire

A Trench CapacitorFor Memory

“Feature Size:” Minimum linear dimension• Width of Transistor Gate

½ it h b t t i t l t l l f t l

8CalTech AyBi 199b April 26, 2011

• ½ pitch between two wires at lowest levels of metal

Page 9: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Moore’s Law

• 1965, Gordon Moore: “the number of transistors that can be integrated on a die would double every 18 to 14 months ”14 months. Equivalent to Feature Size decreasing by factor of √2 Memory density 4X every 3 years

1 000 000

4,000,000

16,000,000

64,000,000

1000000

10000000

100000000

p

0.1 m

0.07 m

h u m a n m em or yh u m a n D NA

4X growth every 3 years!100

1000

y (G

T/cm

2)

4,000

16,000

64,000

256,000

1,000,000

10000

100000

1000000

it ca

paci

ty/c

hi0.7-0.8 m

0.5-0.6 m

0.35-0.4 m

0.18-0.25 m

0.13 m

en cycl oped i a2 h r s CD a u d i o

book

1

10

nsis

tor D

ensi

ty

64

256

1,000

10

100

1000

1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010

Kb

1.6-2.4 m

1.0-1.2 m

2 h r s CD a u d i o30 sec H D TV

pa g e0.1

1995

2000

2005

2010

2015

2020

2025

Tra n

9CalTech AyBi 199b April 26, 2011

YearLogic DRAM

Page 10: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Smaller Feature Sizes Also Enable Higher Clock Rates

100,000 100000

10,000

MH

z)

10000

MH

z)3 GHz The 3 GHz Wall

100

1,000

Clo

ck (M

100

1000

Clo

ck (M

The 3 GHz Wall

101975 1980 1985 1990 1995 2000 2005 2010 2015 2020

1010100100010000

Historical ITRS Max Clock Rate (12 invertors) Feature SizeHistorical ITRS Max

10CalTech AyBi 199b April 26, 2011

Page 11: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Why the Clock Flattening? POWERPOWER

1000 1000

100

Die

100

are

cm

Hot, Hot, Hot!Rocket Nozzle

10Wat

ts p

er D

10

tts p

er S

qua

Iron

1 0.1

1Wat Light Bulb

1976 1986 1996 2006 1976 1986 1996 2006

11CalTech AyBi 199b April 26, 2011

Page 12: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

CMOS Energy 101

1.2

0.6

0.8

1

olta

ge VinVg

Vin Vg

R & C C

0

0.2

0.4Vo Vg

00 5 10 15 20

Dissipate CV2/2A d t CV2/2

Dissipate CV2/2From CapacitanceAnd store CV2/2 From Capacitance

One clock cycle dissipates C*V2

12CalTech AyBi 199b April 26, 2011

One clock cycle dissipates C V

Page 13: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

How Did CV2 Improve With Time?1000

ize 33.5

44.5

5

d

100

Feat

ure

S

00.5

11.5

22.5

Vdd

101985 1990 1995 2000 2005 2010 2015 2020 2025

Assume capacitance of a circuitscales as feature size

01985 1990 1995 2000 2005 2010 2015 2020 2025

High PerfLow PowerDRAM

10.00

100.00

1000.00

nm H

i Per

f Log

ic 330X 90nm picked as breakpoint because that’s when Vdd and thus clocks flattened

0.01

0.10

1.00

CV^

2 re

lativ

e to

90n

15X

13CalTech AyBi 199b April 26, 2011

1/1/

88

1/1/

90

1/1/

92

1/1/

94

1/1/

96

1/1/

98

1/1/

00

1/1/

02

1/1/

04

1/1/

06

1/1/

08

1/1/

10

1/1/

12

1/1/

14

1/1/

16

1/1/

18

1/1/

20

1/1/

22

1/1/

24

High Perf Logic Low Operating Pow er Logic Memory Process

Page 14: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Loss of Clock Increase Left Brute Parallelism as Only Performance Option

Memory Memoryy yThe Same Straw

NNow

Up to ~2002Circa 2002

14CalTech AyBi 199b April 26, 2011

Page 15: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

The Multi-core Architectural Spectrum

Cache/M

Cache/M

Cache/Memory

Cache

. . .Cache C C CM

Cache/MemoryMemory

CoreMemory

Core. . .

Interconnect & ControlCache

Core Core

. . .Cache

Core Core

. . .

( ) Hi hi l D i

CORE

CORE

CORE

MEM . . .

(b) Pi li d D i

Cache/Memory

Core

Cache/Memory

Core

. . .

( ) A D i(a) Hierarchical Designs (b) Pipelined Designs (c) Array Designs

• Intel Core Duo• IBM Power5• AMD Opteron

• IBM Cell• Most Router chips• Many Video chips

• Tilera• Terasys• ExecubeAMD Opteron

• SUN Niagara• …

y p Execube• Yukon• Intel Teraflop

A d Th Th ’ “H t ” M lti C

15CalTech AyBi 199b April 26, 2011

And Then There’s “Heterogeneous” Multi Core

Page 16: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

NVIDIA GeFORCE Heterogeneous Architecture(Not shown: explicit graphics engines)

HostProcessor

HostMemory

Conventional Computer

a.k.a. the “Host”

St i

Kernel Scheduler GPU: GraphicsProcessing Unita.k.a. the “Device” Multithreaded

StreamingMulti Processor

(SM)Streaming

Multi Processor(SM)

StreamingMulti Processor

(SM)Streaming

Multi Processor(SM)

StreamingMulti ProcessorStreaming

Streaming esso

r

esso

r

esso

r

InstructionUnit

( )(SM)Multi Processor

(SM)Multi Processor(SM)

StreamingMulti Processor

(SM)

trea

m P

roce

trea

m P

roce

trea

m P

roce

L2 & Memory Channel Interfaces

MemoryChips

MemoryChips…. Graphics Memory

k th “D i Gl b l M ”St St St

LocalMemory

16CalTech AyBi 199b April 26, 2011

Chips Chips a.k.a. the “Device Global Memory”

Page 17: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Die Photos

AMD 12 core: HierarchicalIBM Execube: Tiled Tile64: Tiled

17CalTech AyBi 199b April 26, 2011

Nvidia: HeterogeneousIBM Cell: Pipelined/Heterogeneous

Page 18: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Modern High Performance S CSuperComputers

18CalTech AyBi 199b April 26, 2011

Page 19: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

A Modern “Heavyweight” HPC System

Power Distribution

Silicon Area Distribution

Processors

Routers3% Memory

86%

Random8%

Memory9%

Routers33%

Random2%

3%

33%

Board Area DistributionMemory

Processors56%

y10%

Processors24%

White Space50%

24%

Routers8%

19CalTech AyBi 199b April 26, 2011

8%Random8%

Page 20: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

A “Light Weight” Node Alternative

2 Nodes per “Compute Card.” Each node:2 Nodes per Compute Card. Each node:• A low power compute chip• Some memory chips• “Nothing Else”

System Architecture:• Multiple Identical Boards/Rack

• 2 simple dual issue cores• Each with dual FPUs• Memory controller

L DRAM L3• Each board holds multiple Compute Cards• “Nothing Else”

• Large eDRAM L3• 3D message interface• Collective interface• All at subGHz clock“Packaging the Blue Gene/L supercomputer ” IBM J R&D March/May 2005

20CalTech AyBi 199b April 26, 2011

Packaging the Blue Gene/L supercomputer, IBM J. R&D, March/May 2005“Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005

Page 21: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Heterogeneous Systems

• Mix of heavyweight masters and GPU compute enginescompute engines

• GPUs: hierarchy of cores:cores: Host microprocessors Streaming Multiprocessors Multiple Thread Engines

http://www.nvidia.com/object/fermi_architecture.html

Multiple Thread Engines Specialized function

engines

21CalTech AyBi 199b April 26, 2011

Page 22: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Where Are We Now?A T h h TOP 00 dA Tour through TOP500 Land

22CalTech AyBi 199b April 26, 2011

Page 23: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Top500 & The LinPack Benchmark

• TOP500: www.top500.org Ranking of top 500 supercomputers in world Updated every 6 months Updated every 6 months

• Benchmark Used: LinPack Solves a dense system of linear equations y q Assumes LU matrix factorization with partial pivoting Key parameter: floating point ops/second

Li P k P t• LinPack Parameters Rpeak: maximum possible # of flops per second Rmax: max observed flops/sec for any size LinPack Nmax: size of problem when achieving Rmax N1/2: problem size when ½ of Rmax is achieved

• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]

23CalTech AyBi 199b April 26, 2011

• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]

Page 24: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Overall Metrics

1.E+07

1.E+08

1.E+04

1.E+05

1.E+06

1.E+02

1.E+03

1.E+04

1.E+00

1.E+01

2 6 0 4 8 2 6

1/1/1992

1/1/1996

1/1/2000

1/1/2004

1/1/2008

1/1/2012

1/1/2016

Rpeak Rmax Memory

24CalTech AyBi 199b April 26, 2011

p yRpeak Trend Rmax Trend Memory Trend

Page 25: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

What Kind of Systems Are These?R d & 1 t H

1.E+07Roadrunner & 1st Heterogeneous

1.E+06

1.E+05

1.E+04

1st BlueGene

1/1/2004

1/1/2008

1/1/2012

H i ht Li ht i ht

25CalTech AyBi 199b April 26, 2011

Heavyweight LightweightRpeak Trend Heterogeneous

Page 26: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Flop Efficiency

0.90

1.00

0.60

0.70

0.80

peak

0.30

0.40

0.50

Rm

ax/R

p

0.00

0.10

0.20

87 91 95 99 03 07

12/3

1/19

8

12/3

1/19

9

12/3

1/19

9

12/3

1/19

9

12/3

1/20

0

12/3

1/20

0

Heavyweight Lightweight Heterogeneous

26CalTech AyBi 199b April 26, 2011But 80% is NOT typical of more complex applications

Heavyweight Lightweight Heterogeneous

Page 27: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Bytes per Flop (Peak)

10.00

1.00

Rpe

ak)

0.10

Byt

es/F

lop

(R

0.01

7 1 5 9 3 7

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07

H i ht Li ht i ht H t

27CalTech AyBi 199b April 26, 2011

Systems are getting less memory rich

Heavyweight Lightweight Heterogeneous

Page 28: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Bytes per Flop (Rmax)

10.00

1.00

(Rm

ax)

0.10

Byt

es/F

lop

(

0.01

87 91 95 99 03 07

12/3

1/19

8

12/3

1/19

9

12/3

1/19

9

12/3

1/19

9

12/3

1/20

0

12/3

1/20

0

Heavyweight Lightweight Heterogeneous

28CalTech AyBi 199b April 26, 2011

Heavyweight Lightweight Heterogeneous

Even on sustained performance, heterogeneous system are memory poor

Page 29: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

How Has “Processor Count” Changed?

1.E+06

1.E+04

1.E+05

1 E+02

1.E+03

otal Cou

nt

1.E+01

1.E+02To

1.E+00

/1992

/1996

/2000

/2004

/2008

/2012

29CalTech AyBi 199b April 26, 2011

1/1

1/1

1/1

1/1

1/1

1/1

Sockets Cores

Page 30: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

What About Clock Rate?

10,000

1,000

MHz)

100lock Rate (

C

10

1/1992

1/1996

1/2000

1/2004

1/2008

1/2012

30CalTech AyBi 199b April 26, 2011

1/1

1/1

1/1

1/1

1/1

1/1

Heavyweight Lightweight Heterogeneous

Page 31: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

And Thread Level Concurrency?

14

16

10

12

14

re/C

ycle

)

6

8

LC (F

lops

/Cor

0

2

4T

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07

31CalTech AyBi 199b April 26, 2011

Heavyweight Lightweight Heterogeneous

Page 32: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Finally, What About Total Concurrency?

1 000 000

10,000,000

100,000

1,000,000

1,000

10,000TC

100

1,000

10

/1992

/1996

/2000

/2004

/2008

/2012

32CalTech AyBi 199b April 26, 2011

1/1/

1/1/

1/1/

1/1/

1/1/

1/1/

Heavyweight Lightweight Heterogeneous

Page 33: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Total Power

9

10

6

7

8

W)

3

4

5

Pow

er (M

W

0

1

2

3

0

12/3

1/19

87

12/3

1/19

91

12/3

1/19

95

12/3

1/19

99

12/3

1/20

03

12/3

1/20

07

33CalTech AyBi 199b April 26, 2011

Heavyweight Lightweight Heterogeneous

Page 34: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Energy per Flop

1,000,000

10,000,000

10,000

100,000

per F

lop

(pJ)

100

1,000

Ener

gy p

10

100

1987

1991

1995

1999

2003

2007

12/3

1/1

12/3

1/1

12/3

1/1

12/3

1/1

12/3

1/2

12/3

1/2

34CalTech AyBi 199b April 26, 2011

Heavyweight Lightweight Heterogeneous Historical CMOS Projection

Page 35: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

The DARPA Exascale Study

http://www.darpa.mil/tcto/docs/ExaScale_Study_Initial.pdf

35CalTech AyBi 199b April 26, 2011

Page 36: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

The Exascale Study

Study Directive: can we ride silicon to Exa by 2015?

1. Baseline today’s: Commodity Technology Architectures Performance (Linpack)( p )

2. Articulate scaling of potential application classes

3. Extrapolate roadmaps for3. Extrapolate roadmaps for “Mainstream” technologies Possible offshoots of mainstream technologies Alternative and emerging technologies Alternative and emerging technologies

4. Use technology roadmaps to extrapolate use in “strawman” designs

36CalTech AyBi 199b April 26, 2011

g

5. Analyze results & id “Challenges”

Page 37: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Architectures Considered

• Evolutionary Strawmen“Heavyweight” Strawman based on

commodity-derived microprocessors “Lightweight” Strawman based on custom

microprocessors

• “Aggressive” Strawman“Clean Sheet of Paper” CMOS Silicon

37CalTech AyBi 199b April 26, 2011

In process of adding Heterogeneous

Page 38: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

1 EFlop/s “Clean Sheet of Paper” Strawman

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)• 1 Chip = 742 Cores (=4.5TF/s)

Sizing done by “balancing” power budgets with achievable capabilities

p ( )• 213MB of L1I&D; 93MB of L2

• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)1 Rack 32 Groups ( 1.7 PF/s)

• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

• 166 MILLION cores• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops• 68 MW w’aggressive assumptions

L l d t Bill D ll St f d

38CalTech AyBi 199b April 26, 2011

Largely due to Bill Dally, Stanford

Page 39: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

A Single Node (No Router)

“Stacked” Memory

(a) Quilt Packaging (b) Thru via chip stack

Leakage28%

FPU21%

Reg File11%

28%

NodePower

Characteristics:• 742 Cores; 4 FPUs/core • 16 GB DRAM11%

Cache Access

Off-chip13%

• 290 Watts• 1.08 TF Peak @ 1.5GHz• ~3000 Flops per cycle

39CalTech AyBi 199b April 26, 2011

20%On-chip 6% DRAM Access

1%

Page 40: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Strawman Power Distribution

Processor13%

FPU16%Leakage Reg File

Disk14%

Assumed a 60X per bit decrease in access energy

Router

Network50%Memory

26%g

28%

L2/L316%

DRAM3%

34%

DRAM3%

e o y29%

Interconnect29% L1

39%

(a) Overall System Power (b) Memory Power (c) Interconnect Power

• 12 nodes per group32 gro ps per rack

Growing DRAM capacity 30X at least doubles overall Memory

• 32 groups per rack• 583 racks• 1 EFlops/3.6 PB

40CalTech AyBi 199b April 26, 2011

least doubles overall Memory Power• 166 million cores

• 67 MWatts

Page 41: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Data Center Performance Projections

Exascale1.E+09

1.E+10

1.E+08Lightweight

1.E+06

1.E+07

GFl

ops

Heavyweight

1.E+05

1.E+03

1.E+04

1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Projections made in mid 2007

41CalTech AyBi 199b April 26, 2011

But not at 20 MW!

Page 42: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Data Center Core Parallelism

1.E+08

1.E+09 170 Million CoresAND we will need

1.E+06

1.E+07

ralle

lism

10-100X moreThreading forLatency Management

1.E+05

1.E 06

oces

sor

Pa

y g

1.E+03

1.E+04Pro

1.E+021/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Historical Top 10 Top System

42CalTech AyBi 199b April 26, 2011

Historical Top 10 Top SystemExa Strawman Evolutionary Light Node Evolutionary Heavy Node

Page 43: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Data Center Total Concurrency

1.E+09

1.E+10

Billion-way concurrency

1.E+07

1.E+08

curr

ecnc

y

1.E+05

1.E+06

Tota

l Con

c

Million-way concurrency

1.E+03

1.E+04

Thousand-way concurrency1.E 03

1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

Top 10 Top System Top 1 TrendHistorical Exa Strawman Evolutionary Light Node

43CalTech AyBi 199b April 26, 2011

y gEvolutionary Heavy Node

Page 44: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

“Business as Usual”Energy Projections

1.E+08

1.E+09

p)

“Bend” when Voltage stopped dropping

1 E+05

1.E+06

1.E+07

lop

( pJ/

Flop

1.E+03

1.E+04

1.E+05

nerg

y pe

r Fl

TechAs

Usual

1.E+01

1.E+02

1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20

En Usual

Historical Top 10 Green 500 UHPC 57KWUHPC 50 GF/W UHPC 80GF/W Top System Trend Line CMOS TechnologyExa HW Optimistic Exa HW Pessimistic Exa LW Optimistic Exa LW Pessimistic

44CalTech AyBi 199b April 26, 2011

Page 45: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

But At What Memory Capacity?1.E+00

1.E-01

er-s

econ

dpe

r flo

p-pe 300TB

/Petaflop

1.E-02

Byt

es

How many useful apps are down here?

1.E-031/1/93 1/1/95 1/1/97 1/1/99 1/1/01 1/1/03 1/1/05 1/1/07 1/1/09 1/1/11 1/1/13 1/1/15

45CalTech AyBi 199b April 26, 2011

Top 10 Exa Strawman

Page 46: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Key Take-Aways• Developing Exascale systems will be really tough In any time frame, for any of the 3 classes

E l ti P i i t b t 2020i h• Evolutionary Progression is at best 2020ish With limited memory

• 4 key challenge areas4 key challenge areas Power!!!!!!!! Concurrency Memory Capacity Memory Capacity Resiliency

• Will require coordinated, cross-disciplinary efforts

46CalTech AyBi 199b April 26, 2011

Page 47: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

A D Di iA Deep Dive into InterconnectInterconnect

to Deliver Operands

47CalTech AyBi 199b April 26, 2011

Page 48: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Assumed Technology Numbers

Operation Energy (pJ/bit)Register File Access 0.16Register File Access 0.16

SRAM Access 0.23DRAM Access 1DRAM Access 1

On-chip movement 0.0187Thru Silicon Vias (TSV) 0.011

Chip-to-Board 2Chip to Board 2Chip-to-optical 10Router on chip 2

48CalTech AyBi 199b April 26, 2011

Router on-chip 2

Page 49: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Aggressive Strawman Access Path: Interconnect-Driven

MoreRouters

ROUTER

Routers

Actual MemoryCore Actual MemoryAccess

Core Generating

Request

MemoryMICROPROCESSOR

49CalTech AyBi 199b April 26, 2011

Some sort of memory structure

Page 50: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Tracking the Energy

cces

s

spor

t

hip

Chi

p

(pJ)

cces

s

er L

ogic

Acc

ess

SRA

M A

c

M A

cces

s

hip

Tran

s

-Boa

rd-C

h

-Opt

ical

-C

0 Tota

l

RF

Ac

Rou

te

Tag

A

32K

B

DR

A M

On-

Ch

TSV

Chi

p -

Chi

p-

L1 Hit 39 30% 0% 28% 42% 0% 0% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%

Local 1380 1% 0% 2% 2% 68% 26% 1% 0% 0%Global 13819 0% 24% 0% 0% 1% 13% 0% 10% 52%

Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Write to L1 39 30% 0% 28% 42% 0% 0% 0% 0% 0%Flush L1 891 0% 0% 1% 26% 32% 39% 1% 0% 0%

50CalTech AyBi 199b April 26, 2011

Page 51: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Linpack Energy

• Linpack: standard benchmark for TOP500 rankings

• Solve Ax=b by LU matrix decomposition & back solve

• Computationally intensive step: update rows after pivot

• Basic function Y[j,k] = α Y[I,k] + Y[j,k][j ] [ ] [j ] Read in C αs (C = “size” of cache line in dwords) (remote memory) Read in C Y’s from row i (remote memory) Repeat for all elements of each row k Repeat for all elements of each row k

• Read in C Y’s from row k (local memory)• Do processing• Write C Y’s (to cache)Write C Y s (to cache)• At some time flush C Y’s back to local memory from cache

51CalTech AyBi 199b April 26, 2011

Page 52: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Resulting Energy per Flop

Step Target pJ #Occurrances Total pJ % of TotalRead Alphas Remote 13 819 4 55 276 16 5%Read Alphas Remote 13,819 4 55,276 16.5%

Read pivot row Remote 13,819 4 55,276 16.5%Read 1st Y[i] Local 1,380 88 121,400 36.3%

R d Oth Y[i] L1 39 264 10 425 3 1%Read Other Y[i]s L1 39 264 10,425 3.1%Write Y's L1 39 352 13,900 4.2%Flush Y's Local 891 88 78,380 23.4%

Total 334,656Ave per Flop 475

And a core flop in 201552CalTech AyBi 199b April 26, 2011

pis only 10pJ!!!

Page 53: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Rockster on a TiledProcessor

“Extreme computing” in another dimension

53CalTech AyBi 199b April 26, 2011

Page 54: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

The Problem:Take a Picture – Find the Rocks

• NxN pixel Image (1Kx1K)

• Grey scale pixels (8bits)y p ( )

• Intermediate “Rock Mask”

• Output: “Rock Descriptions”

• Image starts in memory (streamed from sensors)

54CalTech AyBi 199b April 26, 2011

Page 55: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

The Rockster Algorithm1 M th i f t th til1. Move the image from memory to the tiles

2. 5x5 Gaussian smoothing

3. 3x3 Canny edge sharpener1. Include magnitude/angle computations per pixel

4 Horizon determination (sky fill)4. Horizon determination (sky fill)1. Homogeneous regions touching the ”top”

5. Edge following, including gap fixg g g g p

6. Background fill to generate binary “Rock Mask”1. 1 bit/pixel: rock/not rock2. Write mask back to memory

7. Connected Component to trace & report “Edge List” of rocks Al ith f JPL

55CalTech AyBi 199b April 26, 2011

rocks Algorithm from JPL

Page 56: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Simplified Tiled Multicore Architecture for Study

M Ctl

Til Til Til

MemCtl Q Columns

Tile(1,1)

Tile(1,j)

Tile(1,Q)

… …

.

• PxQ “Tiles”• All Tiles identical

• Each 64KB L2 today• Only X Y neighbor

Tile Tile Tile

.

.• Only X-Y neighbor routing• Memory Access only from (1,1)

P R

ows

Tile(i,1)

Tile(i,j)

Tile(i,Q)

… …

.

.

P

Tile(P 1)

Tile(P j)

Tile(P Q)

… …

.

56CalTech AyBi 199b April 26, 2011

(P,1) (P,j) (P,Q)

Page 57: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

What Does All This Mean?• Under some reasonable assumptions with no off tile• Under some reasonable assumptions, with no off-tile

memory references:

% Core Logic44%

% L221%

A “Compute Cycle”

Of E

44%

% RegFile

% L120%

Of Energy% RegFile11%

Each off-tile memory reference costs eqvt of ddi i l 225 l f

% IFetch4%

57CalTech AyBi 199b April 26, 2011

an additional 225 compute cycles of energy

Page 58: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

How Much Energy to Move the Image?

• Assume PxP tiles

• Each tile accesses N2/P2 pixels = 214/P2 lines

• Each access: 1 read from DRAM + 1 write to L2: T t th h (i j 1) til Transport through (i + j -1) tiles:

• Total: 3.2 + 0.15*P Million compute cycles

N t M til !• Note: More tiles – more energy!

58CalTech AyBi 199b April 26, 2011

Page 59: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Partitioning the Imageonto PxQ Tiles

(a) Subimage per Tilep = N/P; q = N/P

(b) Slice per Tilep = N/PQ; q =Np = N/P; q = N/P p = N/PQ; q =N

• Subimages of size “p” rows of pixels by “q” columns of pixels• Equivalently: pxq = N^2/(PQ) pixels

• Partitioning works only if 2N^2KB/(PQ) < L2 cache size

59CalTech AyBi 199b April 26, 2011

Partitioning works only if 2N 2KB/(PQ) L2 cache size

Page 60: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Ghost CellsM t d “ i hb i ll l ”• Many steps need “neighboring cell values”

• Ghost cells: copies of neighboring cells from neighboring tilestiles

• Data exchanges costs energy

TotalEnergy 2 4 8 16 32

Pgy

Subimage 149,760  299,520  599,040  1,198,080  2,396,160 Sliced 78,336     92,160     147,456  368,640      1,253,376 

Explosive growth in energy as we grow # of tiles!

60CalTech AyBi 199b April 26, 2011

p g gy g

Page 61: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Cache Access Energy for Filters

Tiles 4 16 64 256 1024Filter PFilterEnergy 2 4 8 16 32

Subimage 3.9E+07 3.9E+07 3.9E+07 4.0E+07 4.0E+07

P

• Above #s in units of equivalent compute cycles summed

Sliced 3.2E+07 3.2E+07 3.2E+07 3.2E+07 3.2E+07

• Above #s in units of equivalent compute cycles, summed over all tiles

• Equivalent to about 30-40 compute cycles of energy perEquivalent to about 30 40 compute cycles of energy per pixel

• Remember need to add in Ghost energy

61CalTech AyBi 199b April 26, 2011

Page 62: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Rockster Conclusions

• Computation is cheap energy-wise• Accessing local memory is expensive• Accessing local memory is expensive• Accessing DRAM global memory is huge!• The way you partition data affects energy• Some odd counter-intuitive effectsSome odd, counter intuitive effectsSome energy grows as # tiles increase

62CalTech AyBi 199b April 26, 2011

Page 63: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

New DARPA Program: UHPC• UHPC - Ubiquitous High Performance Computing

• Focii on “Extreme Scale” (1000X) technology power consumption power consumption cyber resiliency productivity

T t h hi b 2018• Target research machines by 2018 Petascale in a rack Terascale module

• Challenge problems streaming sensor data large dynamic graph-based informatics problemg y g p p decision class problem that encompasses search, hypothesis testing, and

planning 2 problems from HPCMP or CREATE suites

63CalTech AyBi 199b April 26, 2011

• 4 teams: Intel, Nvidia, Tilera, Sandia National Labs

Page 64: Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf · Hardware Evolution TrendsHardware Evolution Trends of Extreme Scale Computing TeraFlop

Overall Conclusions• World has gone to multi-core to continue Moore’s Law

• Pushing performance another 1000X will be tough

• The major problem is in energy

• And that energy is in memory & interconnect For both scientific and embedded applications And even for data search applications

• We need to begin rearchitecting to reflect this• We need to begin rearchitecting to reflect this Need alternative metrics to “flops” DON’T MOVE THE DATA!

• We need to begin redesigning our architectures & codesfor energy minimization – now!

64CalTech AyBi 199b April 26, 2011