architectural considerations for petaflops and beyond bill camp sandia national lab’s march 4,2003...
TRANSCRIPT
Architectural Considerations for Petaflops and beyond
Bill CampSandia National Lab’s
March 4,2003SOS7
Durango, CO, USA-
Programming ModelsA historical perspective
1948--53 Machine Language Rules1953--1973 single-threaded Fortran1973--1980 single-threaded vector Fortran1978--1995 Shared memory parallel vector Fortran
Directives: multi-, auto- and microtasking1987--present Massively parallel, Message-passing Fortran and C1995--present Threads-based, shared memory parallelism1996--present Hybrid threads + message passing
Programming ModelsSome false starts
Late 80’s--early 90’s SIMD Fortran for heterogeneous problems
Mid-eighties--present Dataflow parallelism and Functional programming
Mid-eighties--late eighties AI-based languages, eg LISP
Mid-nineties: CRAFT-90 (shared memory approach to MPPs
Early-nineties to ~2000 MPP Threads
Programming Models--Observations
Shared memory programming models have never scaled well
Directives-based approaches lead to code explosion and are not effective at dealing with Amdahl’s Law
Outer-Loop, distributed memory parallelism requires a “physics-centric” approach. I.e., it changed the way we think about parallelism but (largely) preserved our code base, didn’t lead to code explosion, and made it easier to marginalize the effedcts of Amdahl’s Law.
People will change approaches only for a huge perceived gain
A more REAListic Amdahlian Law
The actual scaled speedup is more like
S(N) ~ SAmdahl(N)/[1 + fcomm x Rp/c],
where fcomm is the fraction of work devoted to communications and Rp/c is the ratio of processor speed to communications speed.
REAL Law Implications
Sreal(N) / SAmdahl(N)
Let’s consider three cases on two computers:
the two computers are identical except that one has an Rp/c of 1 and the second an Rp/c of 0.05
The three cases are fcomm = 0.01, 0.05 and 0.10
REAL Law Implications S(N) / SAmdahl(N)
Rp/c
fcomm0.01 0.05 0.10
1.0
0.05
0.99 0.95 0.9
0.83 0.50 0.33
Bottom line:
A well-balanced architecture is nearly insensitive to communications overhead
By contrast a system with weak communications can lose over half its power for applications in which communications is important
Petaflops-- Why can we get there with what we have
now?We only need 3 more spins of Moore’s Law
--Today’s 6-GF Hammer becomes a 48-GF processor by 2009--10-Gigabit ethernet becomes 40 or 80-Gbit ethernet--Memory capacities and prices continue to improve on current trend until 2009
Disk technology continues on its current trajectory for 6 more years
We use small, optical switches to give us 40--80 Gbyte/sec interconnects
Petaflops-- Why can we get there with what we have
now?We need 12,000--25,000 processors to get a peak PETAFLOP.It will have 250--1000 TB memoryIt will have several hundred petabytes disk storageIt will sustain about a half terabyte/sec I/O (more costs more)It will have about 30 TB/sec XC BWIt will have about 5--10 PB/Sec memory BW
BALANCE REMAINS ESSENTIALLY LIKE THAT IN THE RED STORM DESIGN
COST: in 2009: $100M--$250M in then-year dollars
Petaflops-- Design issuesIt will use commodity processors with multiple cores per chipIt will run a partitioned OS based on LinuxIt could have partitions with fast vector processors in a mix-
and-match architecture It won’t look like the Earth SimulatorIt won’t run IA-64 based on current Intel design intentIt will probably run Power PC or HAMMER follow-ons
Petaflops-- Why not Earth Simulator?
On our codes, commodity processors are nearly as fast as the ES nodes and they have a 1.5--2.0 order of magnitude cost/performance advantage
BTW this is also true-- but with not as huge a difference-- for the McKinley versus the Pentium-4
Example: The geometric mean of Livermore Loops on ES is only 60% faster than on a 2 GHz Pentium-4
Example: A real CTH problem is about as fast on that P-4 as it is on the ES
Why not Earth Simulator?
Amdahl’s LawS = TS / TV
S = 1/{[pW / (s N) + (1-p)W / (s/M) ] / [ W / s]}
S = [ p/N + M(1-p) ]-1
Let N = M = 4,
S = 1/[ p/4 + 4(1-p) ].
Why not Earth Simulator?
Amdahl’s Law (p = vector fraction of work)
S = [ p/N + M(1-p) ]-1
Let N = M = 4,
S = 1/[ p/4 + 4(1-p) ].
P must be greater than or equal to 0.8 for breakeven!
Petaflops-- Why not IA-64?
HeatSizeComplexityCostHigh latency/ low BWDifficulty in CompilabilityCompetition from Intel….
Processor Peak Speed
fma3d ratio
Normalized
Fma3d ratio
Intel Itanium II
4.0 Gflops
776 190
Intel Pentium-4
3.06 Gflops*
1038 340
IBM Power4
5.2 Gflops
1020 200
HP Alpha EV7
2.3 Gflops
1380 600
The Bad News
Somewhere between a petaflop and an Exaflop, we will run the string out on this approach to computing
The Good News
- For ExaFlops computing, there is lots ofpotential for innovation:
New approaches:DNA computersNew memory-centric technologies (eg, spin computers)(Not) quantum computersVery Low power semiconductor based systems
The Good News
- For ExaFlops computing, there is lots ofpotential for innovation:
The Requirements for SURE will not change!