the chances and challenges of parallelism · 2007-01-23 · robert strzodka,stanford university,...
TRANSCRIPT
![Page 1: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/1.jpg)
Robert Strzodka, Stanford University, Max Planck Center
The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism
Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and
Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
1000 10000 100000 1e+06 1e+07
Seco
nds
per g
rid n
ode
Domain size in grid nodes
Normlized CPU (double) and CPU-GPU (mixed precision) execution time
1x1 CG: Opteron 2501x1 CG: GF7800GTX
2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
![Page 2: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/2.jpg)
2
The ChancesThe Chances
• GPU: 249 GFLOPS single precision166 GB/s internal bandwidth51.2 GB/s external bandwidth
• FPGA: 192 mad25x18 at 550MHz + logicalmost unrestricted internal bandwidth 120.0 GB/s external bandwidth (for all IO pins)
• Clearspeed: 50 GFLOPS double precision200.0 GB/s internal bandwidth
6.4 GB/s external bandwidth
• Cell BE: 230 GFLOPS single precision21 GFLOPS double precision
204.8 GB/s internal bandwidth25.6 GB/s external bandwidth
![Page 3: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/3.jpg)
3
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
![Page 4: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/4.jpg)
4
Processor
InstructionInstruction--StreamStream--Based ProcessingBased Processing
instructions
cache
mem
ory
mem
orydata datadata
datadata
datadata
Software Hardware
![Page 5: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/5.jpg)
5
Processor
DataData--StreamStream--Based ProcessingBased Processing
mem
ory
mem
ory
pipelinedatadata
configuration
pipelinepipeline
Flowware Hardware/Morphware
Configware
Nomenclature from[Reiner Hartenstein. Data-stream-based computing: Models and architectural resources, MIDEM 2003]
![Page 6: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/6.jpg)
6
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
![Page 7: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/7.jpg)
7
PDE Example: The Poisson ProblemPDE Example: The Poisson Problem
such that : find :function Given ++ →Ω→Ω RR ub
bu =Δ− Ωdomain the inside0=u Ω∂boundary the on
2
2
2
2
:),( asgiven isoperator Laplace the2DIn yu
xuyxu
∂∂
+∂∂
=Δ
Ωdomain
Ω∂boundary
),( yxusolution
![Page 8: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/8.jpg)
8
PDE Example: PDE Example: DiscretizationDiscretization and Solversand Solvers
buA =
After discretization the Poisson Problem becomes a linear equation system
bu =Δ−
)(:guess initial:
1
0
kkk uGuuu
+=
=+
For large systems, Au=b is typically solved with iterative schemes
( ) *210 ,,, uuuuu kk
k ⎯⎯ →⎯= ∞→KWe obtain a convergent series:
yuUbPyLLUPA
==
=
,
For small systems, Au=b is typically solved with a LU decomposition
![Page 9: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/9.jpg)
9
Matrix Vector Product as Stencil OperationMatrix Vector Product as Stencil Operation
Step n Step n+1
( )C
nv≤−αββ
1+nvα
( )( )C
nh vF
≤−αββ
∑≤− C
nvAαββ
ββα:
,
![Page 10: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/10.jpg)
10
• Configware
Maths: Banded Matrix Vector Product r = AvMaths: Banded Matrix Vector Product r = Av
( )
∑==
=
βββα
α
αα
vA
vAFvAr
,
,. ),( α
0β 1β2β
• Flowware
( ){ } ( ) HEIGHTWIDTH33
,,
HEIGHTWIDTHHEIGHTWIDTHHEIGHTWIDTH
0
,,⋅⋅
⋅⋅⋅
∈≠
∈∈
R
RR
βαβα AA
Avr
...221100 ,,, +++= ββαββαββαα vAvAvAr
3β
...),,(),,(),,( ,.,.,. 221100vAFrvAFrvAFr αααααα ===
![Page 11: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/11.jpg)
11
• Configware in C/C++float kernel( float v[HEIGHT][WIDTH], float A[HEIGHT][WIDTH][3][3],
int x, int y ) {float r= 0;for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {
r+= A[y][x][yo+1][xo+1] * v[y+yo][x+xo];}}return r;
}
CPU: Banded Matrix Vector Product r = AvCPU: Banded Matrix Vector Product r = Av
• Flowware in C/C++
extern float A[HEIGHT][WIDTH][3][3];extern float r[HEIGHT][WIDTH], v[HEIGHT][WIDTH];
for( int y= 0; y < HEIGHT; y++ ) {for( int x= 0; x < WIDTH; x++ ) {
r[y][x]= kernel( v, A, x, y );}}
![Page 12: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/12.jpg)
12
• Configware in Cg (high level language for GPUs)float kernel( array2d v, array2d Al, array2d Ac, array2d Au,
float2 xy : WPOS ) : COLOR {float r= 0; array2d A[3]= { Al, Ac, Au };for( int yo= -1; yo <= 1; yo++ ) {for( int xo= -1; xo <= 1; xo++ ) {
r+= arr2d(A[yo+1],xy)[xo+1] * arr2d(v,xy+float2(xo,yo));}}return r;
}
GPU: Banded Matrix Vector Product r = AvGPU: Banded Matrix Vector Product r = Av
• Flowware in C++// load configware to the GPU, define names for arrays, then initialize// enum EnumArr { ARR_r, ARR_v, ARR_Al, ARR_Ac, ARR_Au, ARR_NUM };for( int i= 0; i < ARR_NUM; i++ ) {
GPUArr* arr= new GPUArr( "Array name", (i<=ARR_v)? 1 : 3 );arr->Initialize(WIDTH, HEIGHT);arrP.push_back(arr);
}// ...SciGPU::op( ARR_r, VP_ID, FP_MAT_VEC, ARR_v, ARR_Al, ARR_Ac, ARR_Au );
![Page 13: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/13.jpg)
13
• Configware in ASC (high level language for FPGAs)void kernel() {
HWfloatFormat(32, 24, SIGNMAGNITUDE);Arch(OUT); IOtype<float> r_out; Arch(TMP); HWfloat r;Arch(IN); IOtype<float> v_in; Arch(TMP); HWfloat v;Arch(IN); IOtype<float> A_in[3][3]; Arch(TMP); HWfloat A[3][3];v= v_in; r= 0;UNROLL_LOOP( int yo= 0; yo < 3; yo++ ) {UNROLL_LOOP( int xo= 0; xo < 3; xo++ ) {
A[yo][xo]= A_in[yo][xo];r+= A[yo][xo] * prev(v, yo*WIDTH+xo);
}}r_out= r;
}
FPGA: Banded Matrix Vector Product r = AvFPGA: Banded Matrix Vector Product r = Av
• Flowware in C++
• The FPGA will use the same framework as the GPU.• Object-orientation: One interface, different implementations.• In development.
[Oskar Mencer: ASC, A Stream Compiler for Computing with FPGAs, IEEE Trans. CAD, 2006]
![Page 14: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/14.jpg)
14
Applicatione.g. in
C/C++, Java,Fortran, Perl
Shaderprograms
e.g. inHLSL, GLSL,
Cg
GPU ProgrammingGPU Programming
Graphicshardware
e.g.Radeon (ATI),GeForce (NV)
Operatingsystem
e.g.Windows, Unix,Linux, MacOS
Graphics APIe.g.
OpenGL,DirectX
Window manager
e.g.GLUT, Qt,
Win32, MotifGPU library
Hides thegraphicsspecificdetails
Flowware
Configware
![Page 15: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/15.jpg)
Logic Synthesis
System Level Model
Behavioral Synthesis
RTL / Libraries
- Traditional hardware design process is vertically fragmented across many companies, file formats, etc...this is the major culprit for the productivity gap.
ASC bridges the VLSI CAD Productivity Gap witha Software Approach to Hardware Generation
FPGA ProgrammingFPGA Programming
slide courtesy
of Oskar Mencer
ASC bridges VLSI CAD Productivity GapASC bridges VLSI CAD Productivity Gap
Architecture GenerationModule GenerationGate Level (PamDC)
Parallelizing Compileror Manual Optimization
ASCASC
- Very high performance:Programmer has easy access to the design on all levels of abstraction.
- Easy to use: C++ syntax with custom types.e.g. Most comprehensive Floating Point library available today (>200 different units) created in 2 months!
![Page 16: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/16.jpg)
16
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
![Page 17: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/17.jpg)
17
The Erratic The Erratic RoundoffRoundoff ErrorErrorS
mal
ler i
s be
tter
-100
-90
-80
-70
-60
-50
-40
-30
-20
0 10 20 30 40 50
y =
log2
( f(a
) ),
0 --
> 2^
-100
x = log2( 1/a ), a = 1 / 2^x
Roundoff error for: 0 = f(a):= |(1+a)^3 - (1+3a^2) - (3a+a^3)|
single precisiondouble precision
![Page 18: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/18.jpg)
18
Precision and AccuracyPrecision and Accuracy
• There is no monotonic relation between the computational precision and the accuracy of the final result.
• Increasing precision can decrease accuracy !
• The increase or decrease of precision in different parts of a computation can have very different impact on the accuracy.
• The above can be exploited to significantly reduce the precision in parts of a computation without a loss in accuracy.
• We obtain a mixed precision method.
![Page 19: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/19.jpg)
19
Resource Consumption for Integer OperationsResource Consumption for Integer Operations
Operation Area Latencymin(r,0)max(r,0) b+1 2
add(r1,r2)sub(r1,r2)
2b b
add(r1,r2,r3)→add(r4,r5) 2b 1mult(r1,r2)
sqr(r) b(b-2) b log(b)
sqrt(r) 2c(c-5) c(c+3)
b: bitlength of argument, c: bitlength of result
![Page 20: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/20.jpg)
20
Resource Consumption on a FPGAResource Consumption on a FPGA
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
Sm
alle
r is
bette
r
![Page 21: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/21.jpg)
21
Generalized Iterative RefinementGeneralized Iterative Refinement
with find parameters and :function aFor 0 NMNN XQF ℜ∈ℜ∈ℜ→ℜ
0);( 0 =QXF
equations of systemlinear a solve toused typicallyis This BAX =11111 ~:,0~,: +++++ +==−−= kkkkkkk XXXXABAXBB
process iterativean itself requires osolution t eapproximat The 2)directly osolution t eapproximatan findcan We1)
:cases h twodistinguis weNow
FF
iterate we some with starting exactly, solvecannot weAs 0 NXF ℜ∈
,~:,0);~(),,,,(: 111101 +++++ +=== kkkkkkkk XXXQXFQQXHQ K
. parametersdifferent with solve repeatedly wei.e. kPF
![Page 22: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/22.jpg)
22
CPU Results: LU SolverCPU Results: LU Solver
chart courtesy
of Jack Dongarra
Larg
er is
bet
ter
[Langou et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), SC 2006, to appear]
![Page 23: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/23.jpg)
23
GPU Results: Conjugate Gradient (CG) and GPU Results: Conjugate Gradient (CG) and MultigridMultigrid (MG)(MG)S
mal
ler i
s be
tter
5e-7
5e-6
5e-5
5e-4
6 7 8 9 10
Sec
onds
per
grid
nod
e
Data level
Performance of double precision CPU and mixed precision CPU-GPU solvers
CG CPUCG GPU
MG2+2 CPUMG2+2 GPU
![Page 24: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/24.jpg)
24
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18S
mal
ler i
s be
tter
0
10000
20000
30000
40000
50000
60000
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of Conjugate Gradient s??e11 float kernels on the xc2v8000
Number of SlicesQuadratic fit
Number of 4 input LUTsNumber of Slice Flip Flops
Number of MULT18X18s * 500
![Page 25: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/25.jpg)
25
FPGA Results: Conjugate Gradient with MUL18x18FPGA Results: Conjugate Gradient with MUL18x18La
rger
is b
ette
r
40
60
80
100
120
140
20 25 30 35 40 45 50
Freq
uenc
y / I
O B
lock
s
Bits of mantissa
Frequency/IO of Conjugate Gradient s??e11 float kernels on the xc2v8000
Maximal Frequency in MHzNumber of bonded IOBs in 10s
![Page 26: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/26.jpg)
26
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
![Page 27: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/27.jpg)
27
ArithmeticArithmetic Intensity in MatrixIntensity in Matrix--Vector ProductsVector Products
• Analysis of banded MatVec r=Av, pre-assembled– Reads per component of r:
9 times into v, once into each band of A– Operations per component of r:
9 multiply-adds
18 reads
18 ops
18/18=1
• Arithmetic intensity• Operations per memory access• Computation / bandwidth
> 8• Rule of thumb for CPU/GPU
• Arithmetic intensity on floats should be• On doubles twice as high
![Page 28: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/28.jpg)
28
Trading Computation for BandwidthTrading Computation for Bandwidth
• Three possibilities for a matrix vector product A·v if Adepends on some data and must be computed itself– On-the-fly: compute entries of A for each A·v application
• Lowest memory requirement• Good for simple entries or seldom use of A
– Partial assembly: precompute only some intermediate results• Allows to balance computation and bandwidth requirements • Good choice of precomputed results requires little memory
– Full assembly: precompute all entries of A, use these in A·v• Good if other computations hide bandwidth problem in A·v• Otherwise try to use partial assembly
( ).][div:][,][ 1h
kh
kkkkk UGUAUUUA ∇−==⋅ + τ1• For example, pre-compute only G[] when solving
![Page 29: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/29.jpg)
29
Standard Conjugate GradientStandard Conjugate Gradient
kUr
kRr
1−kPr
A
kβkk
kkkk
PAQ
PRPrr
rrr
=
+= −− 11β
Vector operations 1
kk QPrr
⋅Dot product 1
kα
kkkk
kkkk
QRR
PUUrrr
rrr
α
α
−=
+=+
+
1
1
Vector operations 2
11 ++ ⋅ kk RRrr
Dot product 2
1+kβ
kUr
kRr
kPr
kQr
kα
![Page 30: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/30.jpg)
30
Pipelined Conjugate GradientPipelined Conjugate Gradient
kUr
kRr
kPr
kQr
Akαkβ
kk
kkkk
kkkk
kkkk
PAQ
PRP
QRR
PUU
rr
rrr
rrr
rrr
=
+=
−=
+=
++
+
+
β
α
α
11
1
1
Vector operations
11
11
11
++
++
++
⋅
⋅
⋅
kk
kk
kk
QP
RR
rr
rr
rrDot products
Scalaroperations
1+kα1+kβ
![Page 31: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/31.jpg)
31
The ChallengesThe Challenges
• Computing Paradigms
• Parallel Programming
• Precision and Accuracy
• Algorithmic Optimization
• Large Range Scaling
![Page 32: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/32.jpg)
32
DiscretizationDiscretization GridsGrids
Deformed tensor-product gridParallel dynamic updates
One array for values,second for deformation
Equidistant gridEasy to implement
One array holds all values
![Page 33: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/33.jpg)
33
DiscretizationDiscretization GridsGrids
Adaptive gridCan handle coherently changing
dynamic grid topology
A hash, tree or page table is needed
Unstructured gridGood performance for static,
poor for dynamic grid topology
An index array is needed
![Page 34: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/34.jpg)
34
GliftGlift : Generic, Efficient, Random: Generic, Efficient, Random--Access GPU Access GPU Data StructuresData Structures
STL-like abstraction of data containers from algorithms for GPUs
The Glift slides are based on Aaron Lefohn‘s presentation at the GPGPU Vis05 tutorial
![Page 35: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/35.jpg)
35
GliftGlift: Virtual Memory: Virtual Memory
• Virtual N-D address space– Defined by physical memory and address translator– Address translator can be a simple analytical or a
complex mapping based on page table, tree or hash.– The same user interface irrespective of actual
physical storage
Abstraction
Virtual representation of memory: 3D grid
Translation
3D native mem
Translation
2D slices
Translation
Flat 3D array
![Page 36: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/36.jpg)
36
GliftGlift ComponentsComponents
Application
PhysMem AddrTrans
C++ / Cg / OpenGL
VirtMem
Container Adaptors
Implementation
Algorithms based on VirtMem do not depend on the physical memory capabilities: data layout opt., code reuse, portability
![Page 37: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/37.jpg)
37
FEAST: Generalized TensorFEAST: Generalized Tensor--Product GridsProduct Grids
• Sufficient flexibility in domain discretization– Global unstructured macro
mesh, domain decomposition– (an)isotropic refinement into
local tensor-product grids
• Efficient computation– High data locality, large problems map well to clusters – Problem specific solvers depending on anisotropy level– Hardware accelerated solvers on regular sub-problems
[Stefan Turek et al. Hardware–oriented numerics and concepts for PDE software, 2006]
![Page 38: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/38.jpg)
38
FEAST: Deformation FEAST: Deformation AdaptivityAdaptivity
• This grid is a tensor-product !
• Easier to accelerate in hardware than resolution adaptive grids
• Anisotropy leveldetermines optimal solver
![Page 39: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/39.jpg)
39
FEAST: AdFEAST: Ad--hoc GPU Cluster Performancehoc GPU Cluster PerformanceS
mal
ler i
s be
tter
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
0.002
0.0022
6 6.5 7 7.5 8 8.5 9
Sec
onds
per
mac
ro g
rid n
ode
Level
CPU, GPU Performance Study for 1x16p, 2x16p (Threshold=20K)
1x16p CPU MGCPU21x16p GPU FX1400
2x16p CPU MGCPU22x16p GPU FX1400
![Page 40: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/40.jpg)
40
ConclusionsConclusions
• Flowware/configware distinction is important for efficiency; abstract interfaces facilitate programming
• Mixed precision methods often allow to reduce the computational precision without a loss of final accuracy
• Balancing arithmetic intensity is more effective than one-sided bandwidth or computation optimizations
• Clever discretizations combine high flexibility with very efficient parallel data layout for PDE solvers
![Page 41: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/41.jpg)
41
Collaborators and Associated ProjectsCollaborators and Associated Projects
• FPGAs, ASC– Lee Howes, Oliver Pell, Oskar Mencer (Imperial College)
• Mixed Precision Methods, FEAST– Dominik Göddeke, Stefan Turek (Univeristy of Dortmund)
• Cluster Computing, Scout– Patrick McCormick, Advanced Computing Lab (LANL)
• Parallel Adaptive Grids, Glift– Aaron Lefohn (Neoptica), Joe Kniss (University of Utah), Shubhabrata
Sengupta, John Owens (University of California, Davis)
• Application Integration, PhysBAM– Ron Fedkiw’s group, physical simulation and computer graphics
(Stanford University)
![Page 42: The Chances and Challenges of Parallelism · 2007-01-23 · Robert Strzodka,Stanford University, Max Planck Center The Chances and Challenges of Parallelism Comparison of Hardwired](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6a8bb826a1ce513564d50a/html5/thumbnails/42.jpg)
Robert Strzodka, Stanford University, Max Planck Center
The Chances and Challenges The Chances and Challenges of Parallelismof Parallelism
Comparison ofComparison ofHardwired (GPU) and Hardwired (GPU) and
Reconfigurable (FPGA) DevicesReconfigurable (FPGA) Devices
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
1000 10000 100000 1e+06 1e+07
Seco
nds
per g
rid n
ode
Domain size in grid nodes
Normlized CPU (double) and CPU-GPU (mixed precision) execution time
1x1 CG: Opteron 2501x1 CG: GF7800GTX
2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Num
ber o
f slic
es
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)