Download - Daniel OrozcoDaniel Orozco Guang GaoGuang Gao. Mapping FDTD to Many-Cores ------- Daniel Orozco2
Mapping the FDTD Application to Many-Core
Chip Architectures
Computer Architecture and Parallel Systems Laboratory
Electrical and Computer Engineering DepartmentUniversity of Delaware
Daniel OrozcoGuang Gao
Mapping FDTD to Many-Cores ------- Daniel Orozco 2
Outline
Time Spent Explaining Stuff
What is the problem?What did others do?What did we do?Is it really better?So, What's Next?Questions
Mapping FDTD to Many-Cores Daniel Orozco 3
What is FDTD?FDTD = Finite Difference Time DomainFDTD simulates the propagation of electromagnetic waves through materials.
t
DJE
t
BE
B
D
f
f
0
Scientific Formulation Discretization Iteration
),,(
),,(
zyxH
zyxE
x
E
x
E
E(0,0) E(0,1) E(0,2) E(0,3)
E(1,0) E(1,1) E(1,2) E(1,3)
E(2,0) E(2,1) E(2,2) E(2,3)
E(3,0) E(3,1) E(3,2) E(3,3)
Mapping FDTD to Many-Cores ------- Daniel Orozco 4
A Simple FDTD Computation
Mapping FDTD to Many-Cores ------- Daniel Orozco 5
Memory Wall and Many Core Architectures
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P
M
FPU
P
M
P P P
FPU
MMM
On Chip Off Chip
M
Mapping FDTD to Many-Cores ------- Daniel Orozco 6
What is the point of this presentation?
What can be done about the off-chip memory bandwidth?
Use the on Chip Memory!!!
PP
FPU
PP
FPU
PP
FPU
...
...
P M
FPU
P M
Mapping FDTD to Many-Cores ------- Daniel Orozco 7
Background: What are Data Dependencies?
Data dependencies show the values needed to calculate a particular value.
This is a Data Dependency Graph or DDG
DDG are useful to know if code transformations are valid.If a particular transformation computes E(1,1) before E(0,2) we know that it is not a valid transformation.
E(0,0) E(0,1) E(0,2) E(0,3)
E(1,0) E(1,1) E(1,2) E(1,3)
Mapping FDTD to Many-Cores ------- Daniel Orozco 8
Stencil Computations
Image Processing
Solution of Partial Differential Equations
What do they have in common?
ReadCreate New
Overwrite
How are their Data Dependency Graphs?
A lot of Memory Bandwidth is required!
E(0,0) E(0,1) E(0,2) E(0,3)
E(1,0) E(1,1) E(1,2) E(1,3)
Mapping FDTD to Many-Cores ------- Daniel Orozco 9
TilingNo Tiling
TilingTiling is the process of calculating only a part of
the problem to reduce the memory limitations.
Memory Loads Per Element
Computed:9
Memory Loads Per Element
Computed:1.44
PP
FPU
P M
FPU
P M
Mapping FDTD to Many-Cores ------- Daniel Orozco 10
Tiling and Parallel Execution
Tiling in a 1 DimensionalAlgorithm
Rows represent successive loads and stores to memory
Invalid Tiles
Tiles can not be of more than one row due to mutual data dependence.
Tile Computed
T2T1
Mapping FDTD to Many-Cores ------- Daniel Orozco 11
Tiles are parallel AND bigger
Tiling after Skewing
Time Skewing
The DDG has been redrawn to show how tiles can go past several vertical directions.
This kind of parallelism is called Wavefront Parallelism and is harder to program than regular tiles.
Mapping FDTD to Many-Cores ------- Daniel Orozco 12
Logical ViewTile shape
Other Parallel Tiling Approaches:Overlapped Tiling
Only 50% of the computations are used!
Better Tiling, but There are Redundant Computations
Tiles are fully parallel.Lost computations not shown.
Lost Computations
UsefulComputations
Memory Load
Memory Store
Mapping FDTD to Many-Cores ------- Daniel Orozco 13
Logical ViewTile shape
Other Parallel Tiling Approaches:Split Tiling
No Lost Computations
Tiles are fully parallel.No lost computations.
This is the state of the art
UsefulComputations
Memory Load
Memory Store
Mapping FDTD to Many-Cores ------- Daniel Orozco 14
Logical ViewTile shape
Our Contribution: Diamond Tiling
No Lost Computations
Tiles are fully parallel.No lost computations.Maximum Reuse.
UsefulComputations
Memory Load
Memory Store
Mapping FDTD to Many-Cores ------- Daniel Orozco 15
Is there a Trick?
i
t
a)
i
t
b)
And we do have to load and store
TWO arrays to meet the
dependencies.
Well, we have tile borders across time iterations….
Start of Tile End of Tile
But it’s all for a good cause
Mapping FDTD to Many-Cores ------- Daniel Orozco 16
Logical ViewTile shape
We also tried: Triangle Tiling
No Lost Computations
Tiles are fully parallel.No lost computations.Very simple programming.
UsefulComputations
Memory Load
Memory Store
Mapping FDTD to Many-Cores ------- Daniel Orozco 17
Logical View
We also tried: Parametric Tiling
Tiles are fully parallel.No lost computations.Useful to understand the problem.
p=0.5 p=1p=0.16
UsefulComputations
Memory Load
Memory Store
Mapping FDTD to Many-Cores ------- Daniel Orozco 18
ReuseReuse is “The key concept” for on-chip memory
MReuse =
Number of elements computed
Number of memory operations
Why is reuse important?
20 Cores like this:
Need a connection like this:
Reuse = 40
Reuse = 5
P M
FPU
P M
Mapping FDTD to Many-Cores ------- Daniel Orozco 19
How good are Tiles at Reuse?
Series1
0.25 0.49
8.33
5.456.67 6.25
9.38
12.50
Reuse for a tile of size 100
No Tiling
Simple Tiling
Skewed Tiling
Overlapped Tiling
Split Tiling
Triangle Tiling
Parametric Tilingp = 0.5
DiamondTiling
Not Embarrassingly Parallel
Developed at CAPSL
The Fine Print: Values are for a tile size of 100. Reuse values change with the size of the tile.Results apply to 1 Dimensional Stencil Computation with dependencies similar to those of the examples.
Mapping FDTD to Many-Cores ------- Daniel Orozco 20
But, Does it Really Work?
Series1
1.00
3.19
7.94
6.04
13.51
Speedup
16 6416 64
No Tiling
TriangleSize =
16
TriangleSize =
64
DiamondSize =
64
DiamondSize =
16
The Fine Print: Simulated Speedup Results for FDTD 1D running on Cyclops-64 using FAST simulator. Problem size varies for each test, and was selected as big as possible. Only the computation time was measured. Problem data located in DRAM. Tiling done manually. GCC 3.4, -O3 used.
Mapping FDTD to Many-Cores ------- Daniel Orozco 21
If two tiles have the same width, the one with the MOST AREA has the
best reuse.
Other Considerations
Reuse =
Number of elements computed
Number of memory operationsReuse =
Area
Perimeter
O(N2)
O(N)
The Reuse is O(N)
The best tile is the BIGGEST tile
DiamondSize = NParametric
Size = N
LowReuse
HighReuse
LowReuse
HighReuse
Mapping FDTD to Many-Cores ------- Daniel Orozco 22
And get better performance!
So, Lead Us!
Reuse lowers the required Bandwidth.
Bandwidth is the Limiting Factor for FDTD.
Compute several TIMESTEPS at the same time.
M
M
Mapping FDTD to Many-Cores ------- Daniel Orozco 23
Future Work:Multidimensional Diamonds?
????
How are we going to partition THAT???
Mapping FDTD to Many-Cores ------- Daniel Orozco 24
Future Work: Dataflow Diamonds
Barrier
It’s bad waiting for the slow tile…And then they all compete for Bandwidth at
the same time…
Dataflow will solve that.Implementation is still a research topic.
Mapping FDTD to Many-Cores ------- Daniel Orozco 25
Multiple Diamond Hierarchies
M
MM
MM
M
M
MM
MM
M
MM
M
MM
MM
M
MM
MM
M On-Chip Bus
Diamonds work…They use little Bandwidth
We have a strong On-Chip Bus. Maybe we can work with a Super
Diamond!
M M
On-Chip Bus
M M M...
But we still send the memory back after each
Diamond…
Mapping FDTD to Many-Cores ------- Daniel Orozco 26
Questions?
C
M
MM
MM
M
M
MM
MM
M
MM
M
MM
MM
M
MM
MM
M