on data forwarding in deeply pipelined soft processors 5/3_fajhmy_fpga2015.pdf · on data...
TRANSCRIPT
On Data Forwarding in Deeply Pipelined Soft Processors
H. Y. Cheah, S. A. Fahmy, N. Kapre School of Computer Engineering
Nanyang Technological University
FPGA 2015
Introduction
• Hard macros allow us to build high frequency soft processors (Octavo, iDEA, etc.)
• To achieve high frequency, deep pipelining is required
• Results in significant idle cycles to resolve dependencies
• Full data forwarding is too complex for a lean processor
• Exploit loopback path in DSP block to provide forwarding in iDEA soft processor
2
iDEA Soft Processor
• Fundamental novelty is exploiting DSP block dynamic control signals
3
A1
B1
A2
B2
M P
D
A
B
C
= patterndetect
C
AD
D
P
30
25
18
48
48
18
48
48
25
INMODE ALUMODEOPMODE
iDEA Soft Processor
• A DSP-block based soft processor • Maximises use of the DSP48E1 block • 450MHz+ achievable frequency • Architecture described in detail in FPT 2012 and
TRETS 2014
4
Fetch Decode Execute WriteBack
DSP
IMEM RF DMEM
loopback
iDEA Soft Processor
• We deeply pipeline iDEA by adding extra cycles to the different processor stages
• DSP48E1 primitive requires 3 cycles to achieve maximum speed
• Multiple ways of distributing extra cycles among stages
5
Fetch Decode Execute WriteBack
DSP
IMEM RF DMEM
loopback
Deep Pipelining and NOPs
• High number of empty cycles required to pad dependent instructions, increasing with depth
6
4 5 6 7 8 9 10 11 12 13 14 150
10 000
20 000
30 000
40 000
Pipeline Depth
Number
ofNOPs
crc fib firmed mmul qsort
Padding for Dependencies
• Instruction cannot enter decode stage until its input operand has been written back
• Insert empty instructions (NOPs) between dependent instructions
7
Padding for Dependencies
• Instruction cannot enter decode stage until its input operand has been written back
• Insert empty instructions (NOPs) between dependent instructions
7
IF ID ID EX EX WB
IF ID ID EX EX WB4 NOPs
Padding for Dependencies
• Instruction cannot enter decode stage until its input operand has been written back
• Insert empty instructions (NOPs) between dependent instructions
7
IF IF ID ID EX EX EX WB
IF IF ID ID EX EX EX WB5 NOPs
IF ID ID EX EX WB
IF ID ID EX EX WB4 NOPs
Padding for Dependencies
• A few routes to overcoming this issue: • Dynamic in hardware – cost • Static in software
• External Loopback – add a forwarding path around the execute stage
• Internal Loopback – proposed approach using DSP block feature
• Hardware support through modified opcodes
8
Assembly with Loopback• Original Assembly li $1, x li $2, a li $3, b li $4, c mul $5, $1, $2 nop nop mul $6, $5, $1 mul $5, $1, $3 nop nop add $7, $5, $6 nop nop add $8, $7, $4 nop nop sw $8, 0($y)
9
• With loopback instructions li $1, x li $2, a li $3, b li $4, c mul loop0, $1, $2 mul $6, loop0, $1 mul loop1, $1, $3 add loop2, loop1, $6 add $8, loop2, $4 nop nop sw $8, 0($y)
→ save 6 instructions
External Forwarding
• Add a path around the execute stage • Allows a dependent instruction to start its
execute stage once the previous one has completed
10
A
B
P
DSP Block
External Forwarding
• Add a path around the execute stage • Allows a dependent instruction to start its
execute stage once the previous one has completed
10
A
B
P
DSP Block
External Forwarding
• Add a path around the execute stage • Allows a dependent instruction to start its
execute stage once the previous one has completed
10
A
B
P
DSP Block
IF IF ID ID EX EX EX WB
IF IF ID ID EX EX EX WB2 NOPs
Internal Forwarding
• DSP block primarily used in filtering • Multiply-accumulate is functional primitive • Loopback path allows ALU output to be used for
accumulation • Dynamic control signal
determines whetherthis path is used
11
A
B
P
DSP Block
Internal Forwarding
• DSP block primarily used in filtering • Multiply-accumulate is functional primitive • Loopback path allows ALU output to be used for
accumulation • Dynamic control signal
determines whetherthis path is used
11
A
B
P
DSP Block
Internal Forwarding
• DSP block primarily used in filtering • Multiply-accumulate is functional primitive • Loopback path allows ALU output to be used for
accumulation • Dynamic control signal
determines whetherthis path is used
11
A
B
P
DSP Block
IF IF ID ID EX EX EX WB
IF IF ID ID EX EX EX WB
Loopback Analysis
• Identify loopback opportunities in assembly • Loopback instruction — a subsequent dependent
arithmetic operation • For external loopback: sufficient NOPs to pad the
execute stage • For internal loopback: no NOPs required • NOPs are inserted for dependencies that cannot
be resolved by this forwarding path
12
Parametric Verilog
Constraints
XilinxISE
BenchmarkC code
LLVMMIPS
Area
Freq.
Cycles
LoopbackAnalysis
Functional Simulator
RTL SimulatorAssembler
FPGADriver
ML605Platform
In-System Execution
Loopback Analysis
13
Processor Design Space
• First explore the impact of pipeline depth on area and frequency
• Parameterised register stages in each processor stage (1–5 cycles for Fetch, Decode, Execute)
• Overall pipeline depths of 4–15 cycles • Multiple possible combinations for each overall
depth
14
Processor Design Space• Consider all combinations of stage depths for each
overall processor depth • Reach close to 500MHz from 10 stages onwards
154 5 6 7 8 9 10 11 12 13 14 15
200
250
300
350
400
450
500
Pipeline Depth
Frequen
cy(M
Hz)
Processor Design Space
• Clusters near where DSP block throughput limits performance
• Significantly more registers than LUTs
16150 200 250 300 350 400 450 500
200
400
600
800
1,000
Frequency (MHz)
AreaUnit
RegistersLUTs
Processor Design Space
• Generally minimal impact for internal forwarding and a little more for external forwarding
17
0
500
1,000
237 371
370
407 479 543
542 602
764
756
754
992
232 372
375
416 482 543
535 611
795 877
866 990
179
377
373
362 5
18 585
591 662
823
793
793
1,000
Registers
None Internal Loopback External Forwarding
4 5 6 7 8 9 10 11 12 13 14 15
0
500
280
281
283
319
336
349
362
369
384
385
383
430
288
301
298
335
348
360
374
378
422
425
408
431
331
314
310 362
367
380
393
406
420
418
418
457
Pipeline Depth
LUTs
IPC Improvement
• External Loopback • Padding NOPs not
totally eliminated • For depths of 4–6
(with 1-cycle EX), no NOPs needed
• As EX depth increases, savings are curtailed due to need for some NOPs
18
4 5 6 7 8 9 10 11 12 13 14 150
5
10
15
20
25
30
35
Pipeline Depth
%IP
CSavings
crc
fib fir
median mmult
qsort
IPC Improvement
• Internal Loopback • Sustained savings
over no forwarding • Benchmarks with long
dependency chains • A 5–30%
improvement
19
4 5 6 7 8 9 10 11 12 13 14 150
5
10
15
20
25
30
35
Pipeline Depth
%IP
CSavings
crc
fib fir
median mmult
qsort
Execution Time
• Frequency and geomean wall clock time for all benchmarks • Frequency of external loopback implementation lags initially • At higher depths, differences disappear as the extra cycles
are added to stages other than execute • A 25% improvement in runtime compared to no forwarding
20
14
16
18
20
22
24
26
28
30Tim
e(us)
4 5 6 7 8 9 10 11 12 13 14 15
200
250
300
350
400
450
500
Pipeline Depth
Frequen
cy(M
Hz)
Frequency (Int)
Frequency (Ext)
Internal Loopback
External Forwarding
Execution Time
• External loopback • Depth 4 — 6 increase in execution time due to lower
operating frequency • Gap is significant in shallower depths, but closes as depth
increases • Reduced frequency is fundamental barrier 21
14
16
18
20
22
24
26
28
30Tim
e(us)
4 5 6 7 8 9 10 11 12 13 14 15
200
250
300
350
400
450
500
Pipeline Depth
Frequen
cy(M
Hz)
Frequency (Int)
Frequency (Ext)
Internal Loopback
External Forwarding
Execution Time
• Internal loopback • Minimum runtime is at 10 cycle processor depth • Anomalous peak at depth 9, due to a significant EX cycle
increase • Results in 11–20% improvement at low depths
22
14
16
18
20
22
24
26
28
30Tim
e(us)
4 5 6 7 8 9 10 11 12 13 14 15
200
250
300
350
400
450
500
Pipeline Depth
Frequen
cy(M
Hz)
Frequency (Int)
Frequency (Ext)
Internal Loopback
External Forwarding
Limitations
• As a post-assembly step, limited by the instruction order chosen by compiler
• Only ALU operations (add, subtract) can benefit from internal forwarding
• Ideally, a compiler that could ensure such instructions are kept together would improve results
23
Potential
• Aside from small benchmarks, explored potential for CHStone benchmarks
24
Table 5: LLVM IR Profiling Results for CHStone
Benchmark
Static Dynamic
Instr. Occur. % Instr. Occur. %
adpcm 1367 184 13 71,105 8,300 11aes 2259 51 2 30,596 3,716 12blowfish 1184 314 26 711,718 180,396 25gsm 1205 82 6 27,141 1,660 6jpeg 2388 95 4 1,903,085 131,092 6mips 378 15 3 31,919 123 0.3mpeg 782 80 10 17,032 60 0.3sha 405 64 15 990,907 238,424 24
and just-in-time (JIT) compiler to obtain dynamic counts ofpossible loopback occurrences. The results in Table 5 showa mean occurrence of over 10% within these benchmarks.We cannot currently support CHStone completely due tomissing support for 32b instructions and other developmentissues.
Program Size Sensitivity: We also synthesized RTLfor iDEA with increased instruction memory sizes to holdlarger programs. We observed, that iDEA maintains its op-timal frequency for up to 8 BRAMs. Beyond this, frequencydegrades by 10–30% to support routing delays and place-ment e↵ects of these larger memories. However, we envi-sion tiling multiple smaller soft processors with fewer than8 BRAMs to retain frequency advantages for a larger multi-processor system, and to reflect more closely the resource ra-tio on modern FPGAs. Each processor in this system wouldhold only a small portion of the complete system binary.
6. CONCLUSIONS AND FUTURE WORKWe have shown an e�cient way of incorporating data
forwarding in DSP block based soft processors like iDEA.By taking advantage of the internal loopback path typicallyused for multiply accumulate operations, it is possible to al-low dependent ALU instructions to immediately follow eachother, eliminating the need for padding NOPs. The resultis an increase in e↵ective IPC, and 5– 30% (mean 25%) im-provement in wall clock time for a series of benchmarks whencompared to no forwarding and a 5% improvement whencompared to external forwarding. We have also undertakenan initial study to explore the potential for such forwardingin more complex benchmarks by analysing LLVM interme-diate representation, and found that such forwarding is sup-ported in a significant proportion of dependent instructions.We aim to finalise full support for the CHStone benchmarksuite as well as open-sourcing the updated version of iDEAand the toolchain we have described.
7. REFERENCES[1] Aeroflex Gaisler. GRLIB IP Library User’s Manual,
2012.[2] Altera Corpration. Nios II Processor Design, 2011.[3] G. M. Amdahl. Validity of the single processor
approach to achieving large scale computingcapabilities. In Proceedings of the Spring JointComputer Conference, pages 483–485, 1967.
[4] ARM Ltd. Cortex-M1 Processor, 2011.
[5] S. Beyer, C. Jacobi, D. Kroning, D. Leinenbach, andW. J. Paul. Putting it all together - formal verificationof the VAMP. International Journal on Software Toolsfor Technology Transfer, 8:411–430, 2006.
[6] H. Y. Cheah, F. Brosser, S. A. Fahmy, and D. L.Maskell. The iDEA DSP block based soft processor forFPGAs. ACM Transactions on ReconfigurableTechnology and Systems, 7(3):19, 2014.
[7] H. Y. Cheah, S. A. Fahmy, and N. Kapre. Analysisand optimization of a deeply pipelined FPGA softprocessor. In Proceedings of the InternationalConference on Field Programmable Technology (FPT),pages 235–238, 2014.
[8] H. Y. Cheah, S. A. Fahmy, and D. L. Maskell. iDEA:A DSP block based FPGA soft processor. InProceedings of the International Conference on FieldProgrammable Technology (FPT), pages 151–158, Dec.2012.
[9] P. G. Emma and E. S. Davidson. Characterization ofbranch and data dependencies in programs forevaluating pipeline performance. IEEE Transactionson Computers, 36:859–875, 1987.
[10] Y. Hara, H. Tomiyama, S. Honda, and H. Takada.Proposal and quantitative analysis of the CHStonebenchmark program suite for practical C-basedhigh-level synthesis. In Journal of InformationProcessing, 2009.
[11] A. Hartstein and T. R. Puzak. The optimum pipelinedepth for a microprocessor. ACM Sigarch ComputerArchitecture News, 30:7–13, 2002.
[12] R. Jayaseelan, H. Liu, and T. Mitra. Exploitingforwarding to improve data bandwidth ofinstruction-set extensions. In Proceedings of theDesign Automation Conference, pages 43–48, 2006.
[13] N. Kapre and A. DeHon. VLIW-SCORE: Beyond Cfor sequential control of SPICE FPGA acceleration. InProceedings of the International Conference on FieldProgrammable Technology (FPT), Dec. 2011.
[14] C. E. LaForest and J. G. Ste↵an. Octavo: anFPGA-centric processor family. In Proceedings of theACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 219–228,Feb. 2012.
[15] Lattice Semiconductor Corp. LatticeMico32 ProcessorReference Manual, 2009.
[16] C. Lattner and V. Adve. LLVM: A compilationframework for lifelong program analysis andtransformation. In International Symposium on CodeGeneration and Optimization, pages 75–86, 2004.
[17] M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, andR. Zafalom. Exploiting data forwarding to reduce thepower budget of VLIW embedded processors. InProceedings of Design, Automation and Test inEurope, 2001, pages 252–257, 2001.
[18] K. Vipin, S. Shreejith, D. Gunasekara, S. A. Fahmy,and N. Kapre. System-level FPGA device driver withhigh-level synthesis support. In Proceedings of theInternational Conference on Field ProgrammableTechnology (FPT), pages 128–135, Dec. 2013.
[19] Xilinx Inc. UG081: MicroBlaze Processor ReferenceGuide, 2011.
Conclusion
• Exploiting low-level DSP block feature enabled efficient data forwarding to overcome significant number of padding NOPs
• Maintain frequency of iDEA at close to 500MHz • Up to 25% improvement across range of
benchmarks • Demonstrated applicability to larger programs
25
On Data Forwarding in Deeply Pipelined Soft Processors
Thank You