hardware aware programming - fau...1 hardware aware programming exploiting the memory hierarchy and...
TRANSCRIPT
![Page 1: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/1.jpg)
1
Hardware Aware ProgrammingExploiting the memory hierarchy and
parallel multicore processors
Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Canberra, July 2008
U. Rüde (LSS Erlangen, [email protected])joint work with
J. Götz, M. Stürmer, K. Iglberger, S. Donath, C. Feichtinger, T. Preclik, T. Gradl, C. Freundl, H. Köstler, T. Pohl, D. Ritter, D. Bartuschat, P. Neumann,
G. Wellein, G. Hager, T. Zeiser, J. Habich (RRZE)N. Thürey (ETH Zürich)
![Page 2: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/2.jpg)
2
OverviewTo PetaScale and BeyondOptimizing Memory Access and Cache-Aware ProgrammingMassively Parallel Multigrid: Performance ResultsMultiCore ArchitecturesCase study: Lattice Boltzmann Methods for Flow Simulation on the Play StationConclusions
![Page 3: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/3.jpg)
3
Part I
Towards PetaScale and Beyond
![Page 4: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/4.jpg)
4
0
2000
4000
6000
8000
729 4.913 35.937 274.625 2,146,689
JDSStencils
HHG Motivation I Structured vs. Unstructured Grids
(on Hitachi SR 8000)
MFlops rates for matrix-vector multiplication on one node Structured versus sparse matrixMany emerging architectures have similar properties (Cell, GPU)
Extinct Dinosaur HLRB-I: Hitachi SR 8000
No. 5 in TOP-500 in 20002 TFlops
![Page 5: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/5.jpg)
5
HHG Motivation II: DiMe - Project
Started jointly with Linda Stals in 1996 in Ausgburg!Cache-optimizations for sparse matrix/stencil codes (1996-2007)Efficient hardware optimized
Multigrid SolversLattice-Boltzmann CFD
with free surface flowfluid structure interaction
www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/
Data Local Iterative Methods (1996-2007) for theEfficient Solution of Partial Differential Equations
![Page 6: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/6.jpg)
6
Evolution of Semiconductor Technolgy
Collects trends in semiconductor technologySee http://www.itrs.net/reports.html
![Page 7: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/7.jpg)
7
Where does Computer Architecture Go?Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements
Even today a single core CPU is a highly parallel system:superscalar execution, complex pipeline, ... and additional tricksInternal parallelism is a major reason for the performance increases until now, but ... There is a limited amount of parallelism that can be exploited automatically
Multi-core systems concede the architects´ defeat:Architects fail to build faster single core CPUs given more transistorsClock rate increases only slowly (due to power considerations)
Therefore architects have started to put several cores on a chip:programmers must use them directly
![Page 8: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/8.jpg)
8
What are the consequences?
For the application developers “the free lunch is over”
Without explicitly parallel algorithms, the performance potential cannot be used any more
For HPCCPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for thisWe will have to deal with systems with millions of cores
![Page 9: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/9.jpg)
9
Memory access as a major bottleneck25+ years ago: Telefunken TR440: 16 000 words
Memory fills a rack with8 × 20 drawerseach with 100 data cardsone rack of about 0,8m × 2m
Today: HLRB-II (Altix 4700): 5 × 1 000 000 000 000 wordsMemory fills a rack2m high and ranging roughly from earth to the moon
or, better organized, a rack system500 m wide (500 rows of racks)2m high (20 drawers, 100 cards, each)500 km long (5.000.000 columns)
![Page 10: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/10.jpg)
10
Part II
Optimizing Memory Access andCache-Aware Programming
![Page 11: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/11.jpg)
11
Increasing single-CPU performance by optimizing data locality
Caches work due to the locality of memory accesses (instructions + data)
(Numerically intensive) codes should exhibit:Spatial locality:
Data items accessed within a short time period are located close to each other in memoryTemporal locality:
Data that has been accessed recently is likely to be accessed again in the near future
Goal: Increase spatial and temporal locality in order to enhance cache utilization (cache-aware progr.)
![Page 12: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/12.jpg)
12
Cache performance optimizations
Data layout optimizations: Change the data layout in memory to enhance spatial
localityData access optimizations:
Change the order of data accesses to enhance spatial and temporal locality
These transformations preserve numerical results and their introduction can (theoretically) be automated!
![Page 13: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/13.jpg)
13
Data access optimizations:Loop fusion
Example: red/black Gauss-Seidel iteration in 2D
![Page 14: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/14.jpg)
14
Data access optimizations:Loop fusion (cont’d)
Code before applying loop fusion technique(standard implementation w/ efficient loop ordering, Fortran semantics: row major order):
for it= 1 to numIter do // Red nodes for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do relax(u(j,i)) end for end for
![Page 15: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/15.jpg)
15
Data access optimizations:Loop fusion (cont’d)
// Black nodes for i= 1 to n-1 do for j= 1+i%2 to n-1 by 2 do relax(u(j,i)) end for end forend for
This requires two sweeps through the wholedata set per single GS iteration!
![Page 16: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/16.jpg)
16
Data access optimizations:Loop fusion (cont’d)
How the fusion technique works:
![Page 17: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/17.jpg)
17
Data access optimizations:Loop fusion (cont’d)
Code after applying loop fusion technique:
for it= 1 to numIter do // Update red nodes in first grid row for j= 1 to n-1 by 2 do
relax(u(j,1)) end for
![Page 18: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/18.jpg)
18
Data access optimizations:Loop fusion (cont’d)
// Update red and black nodes in pairs for i= 1 to n-1 do for j= 1+(i+1)%2 to n-1 by 2 do
relax(u(j,i)) relax(u(j,i-1))
end for end for
![Page 19: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/19.jpg)
19
Data access optimizations:Loop fusion (cont’d)
// Update black nodes in last grid row for j= 2 to n-1 by 2 do relax(u(j,n-1)) end for
Solution vector u passes through thecache only once instead of twice per GSiteration!
![Page 20: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/20.jpg)
20
Data access optimizations:Loop split
The inverse transformation of loop fusionDivide work of one loop into two to make body less complicated
Leverage compiler optimizationsEnhance instruction cache utlization
![Page 21: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/21.jpg)
21
Data access optimizations:Loop blocking
Loop blocking = loop tilingDivide the data set into subsets (blocks) which are small enough to fit in cachePerform as much work as possible on the data in cache before moving to the next blockThis is not always easy to accomplish because of data dependencies
![Page 22: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/22.jpg)
22
Data access optimizations:Loop blocking
Example: 1D blocking for red/black GS, respect the data dependencies!
![Page 23: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/23.jpg)
23
Data access optimizations:Loop blocking
Code after applying 1D blocking techniqueB = number of GS iterations to be blocked/combined
for it= 1 to numIter/B do // Special handling: rows 1, …, 2B-1 // Not shown here …
![Page 24: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/24.jpg)
24
Data access optimizations:Loop blocking
// Inner part of the 2D grid for k= 2*B to n-1 do for i= k to k-2*B+1 by –2 do for j= 1+(k+1)%2 to n-1 by 2 do relax(u(j,i)) relax(u(j,i-1)) end for end for end for
![Page 25: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/25.jpg)
25
Data access optimizations:Loop blocking
// Special handling: rows n-2B+1, …, n-1 // Not shown here …end for
Result: Data is loaded once into the cache per B Gauss-Seidel iterations, provided that 2*B+2 grid rows fit in the cache simultaneouslyIf grid rows are too large, 2D blocking can be applied
![Page 26: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/26.jpg)
26
Data access optimizationsLoop blocking
More complicated blocking schemes existIllustration: 2D square blocking
![Page 27: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/27.jpg)
27
Part III
Towards Scalable FE Software
![Page 28: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/28.jpg)
28
Multigrid: V-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal: solve Ah uh = f h using a hierarchy of grids
![Page 29: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/29.jpg)
29
Cache-optimized multigrid:DiMEPACK library
DFG project DiME: Data-local iterative methodsFast algorithm + fast implementationCorrection scheme: V-cycles, FMGRectangular domainsConstant 5-/9-point stencilsDirichlet/Neumann boundary conditionshttp://www10.informatik.uni-erlangen.de/dime
![Page 30: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/30.jpg)
30
V(2,2) cycle - bottom line
Mflops For what
13 Standard 5-pt. Operator
56 Cache optimized (loop orderings, data merging, simple blocking)
150 Constant coeff. + skewed blocking + padding
220 Eliminating rhs if 0 everywhere but boundary
![Page 31: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/31.jpg)
31
Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domainparallelize all operations on all gridsuse clever data structures
Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?
Why we do not use conventional domain decompositionDD without coarse grid does not scale (algorithmically) and is suboptimal for large problems/ many processorsDD with coarse grids may be as efficient as multigrid but is as difficult to parallelize (the difficulty is in parallelizing the coarse grid)
![Page 32: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/32.jpg)
32
Hierarchical Hybrid Grids (HHG)Unstructured input grid
Resolves geometry of problem domainPatch-wise regular refinement
generates nested grid hierarchies naturally suitable for geometric multigrid algorithms
New: Modify storage formats and operations on the grid to exploit the regular substructures
Does an unstructured grid with 1000 000 000 000 elements make sense?
HHG - Ultimate Parallel FE Performance!
![Page 33: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/33.jpg)
33
HHG refinement example
Input Grid
![Page 34: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/34.jpg)
34
HHG Refinement example
Refinement Level one
![Page 35: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/35.jpg)
35
HHG Refinement example
Refinement Level Two
![Page 36: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/36.jpg)
36
HHG Refinement example
Structured Interior
![Page 37: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/37.jpg)
37
HHG Refinement example
Structured Interior
![Page 38: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/38.jpg)
38
HHG Refinement example
Edge Interior
![Page 39: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/39.jpg)
39
HHG Refinement example
Edge Interior
![Page 40: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/40.jpg)
40
Parallel HHG - FrameworkDesign Goals
To realize good parallel scalability:
Minimize latency by reducing the number of messages that must be sentOptimize for high bandwidth interconnects ⇒ large messagesAvoid local copying into MPI buffers
![Page 41: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/41.jpg)
41
HHG for ParallelizationUse regular HHG patches for partitioning the domain
![Page 42: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/42.jpg)
42
HHG Parallel Update Algorithmfor each vertex do apply operation to vertexend for for each edge do copy from vertex interior apply operation to edge copy to vertex haloend for
for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halosend for
update vertex primary dependencies
update edge primary dependencies
update secondary dependencies
![Page 43: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/43.jpg)
43
Towards Scalable FE Software
Performance Results
![Page 44: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/44.jpg)
44
Node Performance is Difficult! (B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)
grid size 173 333 653 1293 2573 5133standard 1072 1344 715 677 490 579no blocking 2445 1417 995 1065 849 8192x blocking 2400 1913 1312 1319 1284 12823x blocking 2420 2389 2167 2140 2134 2049
Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHzArray PaddingTemporal blocking - in EPIC assembly languageSoftware pipelineing in the extreme (M. Stürmer - J. Treibig)
Node Performance is Possible!
![Page 45: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/45.jpg)
45
Single Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element Mesh
![Page 46: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/46.jpg)
46
#Proc #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.
4 134.2 3.16 6.38* 37.98 268.4 3.27 6.67* 39.3
16 536.9 3.35 6.75* 40.332 1,073.7 3.38 6.80* 40.6
64 2,147.5 3.53 4.92 42.3128 4,295.0 3.60 7.06* 43.2
252 8,455.7 3.87 7.39* 46.4504 16,911.4 3.96 5.44 47.6
2040 68,451.0 4.92 5.60 59.03825 128,345.7 6.90 82.8
4080 136,902.0 5.68
6102 205,353.1 6.33
8152 273,535.7 7.43*
9170 307,694.1 7.75*
Parallel scalability of scalar elliptic problemdiscretized by tetrahedral finite elements.
Times for 12 V(2,2) cycles on SGI Altix: Itanium-2 1.6 GHz.
Largest problem solved to date:3.07 x 1011 DOFS on 9170 Procs: 7.8 s per V(2,2) cycle
B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006, also: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, Nov‘ 2005.
![Page 47: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/47.jpg)
47
So what?With scalable algorithms, well implemented, we can do (scalar) PDEs with
> 10 million unknows on a desktop> 300 billion unknowns on a TOP-50 class machine (HLRB-II, 63 Tflop peak, 40 TeraByte Mem)
In the future, we will be able to doaround 2010: ≈ 5 trillion unknowns on a PetaScale machine (assuming 1 PByte memory)around 2012-2015: ≈ 50 trillion unknowns on a machine delivering a petaflop for real applications (assuming 10 Pbyte memory)
This is e.g. sufficient to resolve all of earth‘s atmosphere with10 km grid resolution (current desktop)250 m mesh (current supercomputer)100 m mesh (Peak-Peta-Scale system in 2010?)50 m mesh (Application-Peta-Scale system in 2015?)
This is a buidling block for many other applications
![Page 48: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/48.jpg)
48
Programming techniques
Seemingly conflicting goals:Portability/Flexibility:Code should run on a variety of (parallel) target platforms, including PC clusters, NUMA machines, etc.
Efficiency:Code should run as efficiently as possible on each target platform
How can this conflict be solved?
![Page 49: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/49.jpg)
49
Part IV
Multicore Architectures
![Page 50: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/50.jpg)
50
The STI Cell Processorhybrid multicore processor based on IBM Power architecture(simplified) PowerPC core
runs operating systemcontrols execution of programs
multiple co-processors (8, on Sony PS3 only 6 available)operate on fast, private on-chip memoryoptimized for computationDMA controller copies data from/to main memory
• multi-buffering can hide main memory latencies completely for streaming-like applications
• loading local copies has low and known latencies
memory with multiple channels and links can be exploited if many memory transactions are in-flight
![Page 51: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/51.jpg)
51
The STI Cell Broadband Engine
![Page 52: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/52.jpg)
52
Cell LBM SimulationsGoal: demanding (flow) simulations at moderate cost but very fast, e.g. for simulation of blood-flow in an aneurysm for therapy and surgery planningAvailable cell systems:
BladesPlaystation 3
![Page 53: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/53.jpg)
53
Synergistic Processor Unit“very small computer of its own”
128 128-bit all-purpose registersoperates on 256 kB of Local Store (LS)
nearly all operations are SIMD onlyone scalar operation is more expensive than a SIMDonly load and store of 16 naturally aligned bytes from/to LS
25.6 GFlops (single precision fused-multiply-add)only truncation, fast double precision will be available soon
no dynamic branch prediction, only hints in softwarebut around 20 cycles branch miss penalty
no system calls or privileged operations
![Page 54: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/54.jpg)
54
Memory Flow Controllercommunication interface (to PPE and other SPEs)
mailboxes and signal notificationmemory mapping of Local Store and register fileutilized by PPU to upload programs and control SPU
asynchronous data transfers (DMA)LS <-> main memory, other LSes or devices16 DMAs in-flightlist transfers possible (scatter / gather)only naturally aligned transfers of 1, 2, 4, 8, n·16 bytesusually multiple transfers on multiple MFCs are necessary to saturate main memory bandwidth
all interaction with SPU through channel interface
![Page 55: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/55.jpg)
55
Programming the Cell-BEthe hard way
control SPEs using management librariesissue DMAs by language extensionsdo address calculations manuallyexchange main memory addresses, array sizes etc.synchronization using mailboxes, signals or libraries
frameworksAccelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBMRapidmind SDK
accelerated librariessingle-source-compiler
IBM’s xlc-cbe-sse is in alpha stage, uses OpenMP
![Page 56: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/56.jpg)
56
Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));
void scale( unsigned long long gs_buffer, // main memory address of vector
int number_of_chunks, // number of chunks of 32 floats
float factor ) { // scaling factor
vector float v_fac = spu_splats(factor); // create SIMD vector with all
// four elements being factor
for ( int i = 0 ; i < number_of_chunks ; ++i ) {
mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
for ( int j = 0 ; j < 8 ; ++j )
ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD
mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk
mfc_write_tag_mask( 1 << 0 ); // wait for DMA...
mfc_read_tag_status_all(); // ...to complete
gs_buffer += 128; // incr. global store pointer
} }
![Page 57: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/57.jpg)
57
Remove latencies using multi-buffering
mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk
for (int i = 0; i < number_of_chunks; ++i) {
int cur = ( i ) % 3; // buffer no. and DMA tag for i-th chunk
int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk
if (i < number_of_chunks-1) {
mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...
mfc_read_tag_status_all(); // ...has been stored
mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk
}
mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...
mfc_read_tag_status_all(); // ...is available
for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);
mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk
gs_buffer += 128;
}
mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...
mfc_read_tag_status_all(); // outstanding DMA
volatile vector float ls_buffer[3][8] __attribute__((aligned(128)));
...
![Page 58: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/58.jpg)
58
Part V
Case study: Lattice Boltzmann Methods for Flow Simulation on the Play Station
![Page 59: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/59.jpg)
59
Example OMP-parallel Flow AnimationResolution: 880*880*336; 260M cells, 6.5M active on average
![Page 60: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/60.jpg)
60
Simulationof Metal Foams
Joint work with C. Körner, WTM Erlangen
![Page 61: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/61.jpg)
61
Aneurysms• Aneurysm are local dilatations of the blood vessels• Localized mostly at large arteries in soft tissue (e.g.
aorta, brain vessels)• Can be diagnosed by modern imaging techniques (e.g.
MRT,DSA)• Can be treated e.g by clipping or coiling
![Page 62: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/62.jpg)
62
A data structure for simulating flow in blood vessels
• In a brain geometry only about 3-10% of the nodes are fluidal
• We use a domain decoupling in equally sized blocks, so-called patches and only allocate patches containing fluid cells
• The memory requirements and the computational time could be reduced significantly
• For the Cell processor we use patches of size 8x8x8, fitting into the SPU local memory
![Page 63: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/63.jpg)
63
Results
Velocity near the wall in an aneurysm
Oscillatory shear stress near the wall in an aneurysm
![Page 64: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/64.jpg)
64
Pulsating Blood Flow in an Aneurysm
Datensatz
Collaboration between:Neuro-Radiology (Prof. Dörfler, Dr. Richter)
Computer Science
Simulation
Imaging
CFD
![Page 65: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/65.jpg)
65
LBM Optimized for Cell
memory layoutoptimized for DMA transfersinformation propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange
code optimizationkernels hand-optimized in assembly codeSIMD-vectorized streaming and collisionbranch-free handling of bounce-back boundary conditions
![Page 66: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/66.jpg)
66
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160 PPE SPE*
49,0
2,04,8
10,4
straight-forward C codeSIMD-optimized assembly*on Local Store without DMA transfers
![Page 67: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/67.jpg)
67
Performance Results
30,0
47,5
65,0
82,5
100,0
1 2 3 4 5 6
95949493
81
42
![Page 68: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/68.jpg)
68
Performance Results
0
12,5
25,0
37,5
50,0
Xeon 5160* Playstation 3
43,8
11,7
21,1
9,1
1 core1 CPU
*performance optimized code by LB-DC
![Page 69: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/69.jpg)
Other work:LBM on Graphics Hardware
see also: work by Jonas Tölke and M. Krafczyk at TU BraunschweigMaster thesis by J. Habich (co-suprvised jointly with G. Wellein, RRZE Erlangen)
nVidia Geforce 8800 GTX (G80 Processor) up to 250 Fluid MLUP/safter careful tuning!
69
![Page 70: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/70.jpg)
Multigrid on Cell ProcessorMaster Thesis by Daniel Ritter:
A Fast Multigrid Solver for Molecular Dynamics on the Cell Broadband Engine
Performance limited by available bandwidthLocal store too small (?) for blocking techniques
70
![Page 71: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/71.jpg)
71
Part IV
Conclusions
![Page 72: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/72.jpg)
72
What have we learned?
The future is parallel on multi core CPUsMemory bandwidth per core will be a severe bottleneck
“inverse Moore’s law”Programming current leading edge multi-core architectures to exploit their performance potential requires expert knowledge of the architecture
better tool and system support neededcomplexity of the architecture
![Page 73: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/73.jpg)
73
An HPC Tutorial !Getting Supercomputer Performance is Easy!
If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive
Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive) algorithm from the start
Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot of overhead
Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research
Never waste your time by trying to use a complicated algorithm in parallel (such as multigrid)
the more primitive the algorithm the easier to maximize your MachoFlops.
![Page 74: Hardware Aware Programming - FAU...1 Hardware Aware Programming Exploiting the memory hierarchy and parallel multicore processors Lehrstuhl für Informatik 10 (Systemsimulation) Universität](https://reader034.vdocument.in/reader034/viewer/2022050407/5f8440d6e2627d120f429cf3/html5/thumbnails/74.jpg)
74
Talk is Over
Questions?