memocode 2007 design contest – mit submission n. dave, k. fleming, m. king, m. pellauer, m....
TRANSCRIPT
MEMOCode 2007 Design Contest – MIT Submission
N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan
Resources
• Five “insufficiently busy” grad students
• Three weeks– Nine man weeks used
• Bluespec expertise– Easy parameterization/Fast Concurrency
• The promise of food
Basic Facts
• Matrix Multiply is embarrassingly parallel– More multipliers and adders should help
• Matrices are too large to be stored in FGPA memory
• Time was short, design needed to be partitioned to make use of all designers– Latency insensitive methodology
Outline
• The Problem • Partitioning the Computation • Architectural Overview• Implementation• Results• Things We Wish we could do
The Standard N3 Algorithm
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
for(int k=0; k < N; k++)
c[i][j] += a[i][k] * b[k][j];
and blocking is well understood…
for(int ib = 0; ib < N; ib+=K)
for(int io = 0; io < K; io++)
for(int jb = 0; jb < N/K; jb+=K)
for(int jo = 0; jo < K; jo++)
for(int k = 0; k < K; k++)
c[ib+io][jb+jo] +=a[ib+io][jb+k]
* b[ib+k][jb+jo];
split
split
reduces memory traffic
for(int ib = 0; ib < N; ib+=K)
for(int jb = 0; jb < N/K; jb+=K)
for(int io = 0; io < K; io++)
for(int jo = 0; jo < K; jo++)
for(int k = 0; k < K; k++)
c[ib+io][jb+jo] +=
(a[ib+io][jb+k] *
b[ib+k][jb+jo]);
swap
Kernel
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation• Results• Things We Wish we could do
Hardware Facts
• If we accelerate the computation, DRAM access becomes the bottleneck
• CPU has slow access to DRAM– HW can directly access DRAM via PLB
(Processor Local Bus)
Hardware Facts
• CPU to HW memory bandwidth bound at 150MB/sec– Software overhead in data orchestration, probably only
50% of this bandwidth can be used
• Memory Bus supports 800MB/sec– Direct interface can provide up to a 5x improvement
over software transfer
• Special hardware may not be complicated because memory access patterns are simple
High Level ArchitecutureFunc
Unit
Func
Unit
Func
Unit
CPU
PLB
DRAM
Interconnection
Logic
ArchitectureFunc
Unit
Func
Unit
Func
Unit
Controller
Feeder
CPU
PLB
Switch
PLB Master DRAM
Software Example (C = A x B)Func
Unit
Func
Unit
Func
Unit
Controller
Feeder
CPU
PLB
Switch
PLB Master DRAM
AB
Ld A 0Ld B 0St C 0MAC 0
C
In reality – the execution of several blocks will be overlapped
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results• Things We Wish we could do
Functional Unit - Design
• Instructions:– Load operand (memory) – Store operand (memory)– Zero (C = 0)– Multiply-Add-Accumulate (C += A*B)
• Two FSMs (Read/Write and Compute)– Allows overlapping of Instructions
Functional Unit – Algorithm
• Take algo & unroll P loop iterations
• Adder Tree of P– Crit. path grows
logarithmically
• Can pipeline– Complicated because
of parameterization
for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];
* *+
* *+
* *+
* *+
+ +
+
+
A[i]
[k+
7]
A[i]
[k]
A[i]
[k+
1]
A[i]
[k+
2]
A[i]
[k+
3]
A[i]
[k+
4]
A[i]
[k+
5]
A[i]
[k+
6]
B[k
][j]
B[k
+1]
[j]
B[k
+2]
[j]
B[k
+3]
[j]
B[k
+4]
[j]
B[k
+5]
[j]
B[k
+6]
[j]
B[k
+7]
[j]
C[i]
[j]
C[i]
[j]
Functional Unit – Algorithm
• Different algorithm – reorder multiplies– writes c[i][j] multple
times
• Unroll by P – same # of adders and
multipliers– shorter critical path
• Pipelining is easy– 2 stages
for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];
A[i]
[k]
B[j]
[i]
C[j]
[k]
C[j]
[k]
*+
A[i]
[k+
1]
B[j]
[i]
C[j]
[k+
1]
C[j]
[k+
1]
*+
A[i]
[k+
2]
B[j]
[i]
C[j]
[k+
2]
C[j]
[k+
2]
*+
A[i]
[k+
3]
B[j]
[i]
C[j]
[k+
3]
C[j]
[k+
3]
*+
A[i]
[k+
4]
B[j]
[i]
C[j]
[k+
4]
C[j]
[k+
4]
*+
A[i]
[k+
5]
B[j]
[i]
C[j]
[k+
5]
C[j]
[k+
5]
*+
FU Microarchitecture
FSM FSMFSM
BRAM A
BRAM B
BRAM C
LOAD/STOREFSM
COMPUTEFSM
* +
Memory Bus Master (PLB)• 32-bit bus interface
• 16-word burst transfers– Amortize bus setup
costs
• DRAM may refresh during transfer– Added burst buffer for
rapid recovery
PLB Bus
Input Burst Buffer
Output Burst Buffer
BusControl
FSM
Store Data
Load Data
Store FSM
LoadFSM
PLB Commands
Memory Bus Master (PLB)• Half of critical path
through bus arbiter– Beyond our control
• Substantial retiming needed– Register pushing– State decoupling
• Need fine-grained control over scheduling
PLB Bus
Input Burst Buffer
Output Burst Buffer
BusControl
FSM
Store Data
Load Data
Store FSM
LoadFSM
PLB Commands
Outline
• The Problem • Partitioning the Computation • Architectural Overview • Implementation • Results • Things We Wish we could do
Design Parameters
• Architecture: Number of functional units
• Functional Unit: degree of parallelism, matrix size
• Memory Bus (PLB) Master: matrix memory layout, matrix size
• Switch: Number of functional units• Algorithm Generator: Block size
Final Results
• 100MHz• 1 Functional Unit
– 642 subblocks – 8 Complex Multiplies
• Lines of code – 10K total– Unit Testing Framework – 1.5K– C Code – 2K– BSV – 5.5K– Multiple FU implementations 1K– Additional Unused Hardware 1K
• More than 3 GOps/Sec
Performance
Size Time (µs)
642 799
1282 5120
2562 45300
5122 332000
10242 2710000125x
Things we would have done with more time
• We believe we could have obtained 10 billion ops per second
• 32-PLB -> 64-bit PLB– Double memory bandwidth
• fairly simple improvement
• Multiple Clock Domains – implemented, but had trouble synthesizing in EDK
• Play with # of FUs / registers per FU– HW parameterized for this
• Explore alternative machine organization
• Algorithmic Exploration
Fin