architectural considerations for petaflops and beyond bill camp sandia national lab’s march 4,2003...

Architectural Considerations for Petaflops and beyond

Bill CampSandia National Lab’s

March 4,2003SOS7

Durango, CO, USA-

Programming ModelsA historical perspective

1948--53 Machine Language Rules1953--1973 single-threaded Fortran1973--1980 single-threaded vector Fortran1978--1995 Shared memory parallel vector Fortran

Directives: multi-, auto- and microtasking1987--present Massively parallel, Message-passing Fortran and C1995--present Threads-based, shared memory parallelism1996--present Hybrid threads + message passing

Programming ModelsSome false starts

Late 80’s--early 90’s SIMD Fortran for heterogeneous problems

Mid-eighties--present Dataflow parallelism and Functional programming

Mid-eighties--late eighties AI-based languages, eg LISP

Mid-nineties: CRAFT-90 (shared memory approach to MPPs

Early-nineties to ~2000 MPP Threads

Programming Models--Observations

Shared memory programming models have never scaled well

Directives-based approaches lead to code explosion and are not effective at dealing with Amdahl’s Law

Outer-Loop, distributed memory parallelism requires a “physics-centric” approach. I.e., it changed the way we think about parallelism but (largely) preserved our code base, didn’t lead to code explosion, and made it easier to marginalize the effedcts of Amdahl’s Law.

People will change approaches only for a huge perceived gain

Petaflops-- can we get there with what we have now?

YES

What’s Important?

SURE: - Scalability - Usability - Reliability - Expense minimization

A more REAListic Amdahlian Law

The actual scaled speedup is more like

S(N) ~ SAmdahl(N)/[1 + fcomm x Rp/c],

where fcomm is the fraction of work devoted to communications and Rp/c is the ratio of processor speed to communications speed.

REAL Law Implications

Sreal(N) / SAmdahl(N)

Let’s consider three cases on two computers:

the two computers are identical except that one has an Rp/c of 1 and the second an Rp/c of 0.05

The three cases are fcomm = 0.01, 0.05 and 0.10

REAL Law Implications S(N) / SAmdahl(N)

Rp/c

fcomm0.01 0.05 0.10

1.0

0.05

0.99 0.95 0.9

0.83 0.50 0.33

Bottom line:

A well-balanced architecture is nearly insensitive to communications overhead

By contrast a system with weak communications can lose over half its power for applications in which communications is important

Petaflops-- Why can we get there with what we have

now?We only need 3 more spins of Moore’s Law

--Today’s 6-GF Hammer becomes a 48-GF processor by 2009--10-Gigabit ethernet becomes 40 or 80-Gbit ethernet--Memory capacities and prices continue to improve on current trend until 2009

Disk technology continues on its current trajectory for 6 more years

We use small, optical switches to give us 40--80 Gbyte/sec interconnects

Petaflops-- Why can we get there with what we have

now?We need 12,000--25,000 processors to get a peak PETAFLOP.It will have 250--1000 TB memoryIt will have several hundred petabytes disk storageIt will sustain about a half terabyte/sec I/O (more costs more)It will have about 30 TB/sec XC BWIt will have about 5--10 PB/Sec memory BW

BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED STORM DESIGN

COST: in 2009: $100M--$250M in then-year dollars

Petaflops-- Design issuesIt will use commodity processors with multiple cores per chipIt will run a partitioned OS based on LinuxIt could have partitions with fast vector processors in a mix-

and-match architecture It won’t look like the Earth SimulatorIt won’t run IA-64 based on current Intel design intentIt will probably run Power PC or HAMMER follow-ons

Petaflops-- Why not Earth Simulator?

On our codes, commodity processors are nearly as fast as the ES nodes and they have a 1.5--2.0 order of magnitude cost/performance advantage

BTW this is also true-- but with not as huge a difference-- for the McKinley versus the Pentium-4

Example: The geometric mean of Livermore Loops on ES is only 60% faster than on a 2 GHz Pentium-4

Example: A real CTH problem is about as fast on that P-4 as it is on the ES

Petaflops-- Why not Earth Simulator?

Amdahl’s Law and the high cost of custom processors

Why not Earth Simulator?

Amdahl’s LawS = TS / TV

S = 1/{[pW / (s N) + (1-p)W / (s/M) ] / [ W / s]}

S = [ p/N + M(1-p) ]-1

Let N = M = 4,

S = 1/[ p/4 + 4(1-p) ].

Why not Earth Simulator?

Amdahl’s Law (p = vector fraction of work)

S = [ p/N + M(1-p) ]-1

Let N = M = 4,

S = 1/[ p/4 + 4(1-p) ].

P must be greater than or equal to 0.8 for breakeven!

Petaflops-- Why not IA-64?

HeatSizeComplexityCostHigh latency/ low BWDifficulty in CompilabilityCompetition from Intel….

Processor Peak Speed

fma3d ratio

Normalized

Fma3d ratio

Intel Itanium II

4.0 Gflops

776 190

Intel Pentium-4

3.06 Gflops*

1038 340

IBM Power4

5.2 Gflops

1020 200

HP Alpha EV7

2.3 Gflops

1380 600

The Bad News

Somewhere between a petaflop and an Exaflop, we will run the string out on this approach to computing

The Good News

- For ExaFlops computing, there is lots ofpotential for innovation:

New approaches:DNA computersNew memory-centric technologies (eg, spin computers)(Not) quantum computersVery Low power semiconductor based systems

The Good News

- For ExaFlops computing, there is lots ofpotential for innovation:

The Requirements for SURE will not change!

The Good News

I’ll be gone fishing!

The END (almost)

architectural considerations for petaflops and beyond bill camp sandia national lab’s march 4,2003...

Documents