TMBox: A Configurable 16-core Hybrid TM FPGA prototype
Osman Unsal
The people
• Nehir Sonmez (BSC)• Oriol Arcas (BSC)• Osman Unsal (BSC)• Adrian Cristal (BSC)• Satnam Singh (MSR Cambridge)
2
BeeFarm
• Software simulators are poorly parallelized• An FPGA can be significantly faster for
multicore emulation:
FPGA emulator at 25 MHzcan be faster than
Software simulator on 2 GHz host
3
From Plasma to BeeFarm: Design Experience of an FPGA-based Multicore Prototype.Nehir Sonmez, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero.In 7th International Symposium on Applied Reconfigurable Computing (ARC 2011), March 2011.
BeeFarm
• 8-core, FPGA-based multiprocessor• Completely modifiable from top to bottom
4
Bus
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
ArbiterDDR2Controller
Bootmem
I/O
25 MHz125 MHz
HoneycombMIPS R3000 compatible
Shared bus128-bit split bus
L1 cacheUnified 8 KB cache
The Honeycomb core
R3000-compatible Honeycomb with flexible HTM support
= Original Plasma (MIPS R2000-compatible)
+ MMU, FPU
+ exceptions support
+ synchronization primitives: LL/SC
+ snooping, coherent caches (MSI)
+ debugging, performance counters
+ system libraries to support string, I/O, TM
5
BeeFarm performance
1 core 2 cores 4 cores0
1
2
3
4
5
ScalParC simulation
BeeFarmM5M5 -timing
Sp
eed
up
6
Results normalized to M5 with 1 thread.
Functional simulation
Detailed simulation
TMbox
• HTM multiprocessor on FPGA– Inspired by AMD’s Advanced Synchronization
Facility• BeeFarm improved:
– Ring bus instead of shared bus (which doesn’t fit well on FPGA)
– x2 frequency (50 MHz)
7
TMbox: A Flexible and Reconfigurable 16-core Hybrid Transactional Memory SystemNehir Sonmez, Oriol Arcas, Otto Pflucker, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero.In 19th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2011), May 2011.
HTM ISA extensions
• Inspired by AMD-ASF• 10 new MIPS instructions
– XBEGIN (addr)– XLB, XLH, XLW, XSB, XSH, XSW– XCOMMIT, XABORT (code)– MFTM
• 4 new special registers– Can only be read with the MFTM (move from TM) instruction– $TM0 register contains the abort address (XCOMMIT)– $TM1 has a copy of the stack pointer (XCOMMIT)– $TM2 contains the abort cause (overflow, contention or explicit) – $TM3 stores a 20-bit software abort code (XABORT)
8
HTM example• atomic {a++} example in MIPS assembler:
9
$ERR: ...
$ABORT: MFTM $12, $TM2 BEQ $12, $13, $ERR ADDIU $10, $10, 1 SLTU $12, $10, $11 BEQZ $12, $ERR2 J $TX
$TX: XBEGIN $ABORT XLW $8, 0($a0) ADDI $8, $8, 1 XSW $8, 0($a0) XCOMMIT
LI $11, 5 LI $13, HW_OFLOW J $TX
next code...
Abort due to conflict, retry...
HW capacity exceeded
AbortTransaction committed
TinySTM – ASF integration
• atomic {a++} example with TinySTM hybrid TM:
10
Switch to software
tm_start(); t = tm_read(a); tm_write(a, t); tm_commit();
tm_thread_init();
next code...
Abort due to conflict, retry...
HW capacity exceeded, explicit
SW abort
AbortTransaction committed
TinySTM conflict management
Compilation
• Standard GCC-MIPS cross-compiler
+ HyTM extensions
(to use 10 new tx instr.)• 4 new TM registers, read
with MFTM instr.• Also extend the cache
FSM to support TM
11
TMbox architecture
12
C7 C0
C1
C2
C3C4
C5
C6
DDRResponses Requests
Invalidations
L1 Honeycomb CPU
TM Unit
CAM RAM data
hit
addr
BusNode
BusCtrl.
To commit (serialized):1. Lock ring (to prevent other writes and commits)
Will destroy ongoing write/commit requests2. Commit the TX writes through channelWill abort conflicting TXs snooping the ring3. Unlock ring
Performance
• Eigenbench synthetic TM benchmark on 16 cores (lower is better):– Left: 10 element r/w set: overflows the TM cache– Right: 8 element r/w set: fits in the TM cache
13
HyTM betterHyTM better
Performance (cont.)• From the STAMP TM bench. Suite:
– SSCA2: An efficient and scalable graph kernel constant algorithm.– Intruder: A high abort rate benchmark.
• If the program scales, so do we… (higher is better)
14
5-8% betterSSCA2 Intruder
48% TX in HW(HW aborts are less expensive)
Future Work: TMbox 2?
15
• Distributed memory directory
• 4 FPGAs
• 64 cores
• Maps well on FPGA
• Similar to Stanford Dash
DDR Directory Switch DDRDirectorySwitch
DDR Directory Switch DDRDirectorySwitch
FPG
A A
FPG
A B
FPG
A C
FPG
A D
BEE3 board
RS232 PCIeEthernetMIPS R3000
8 KB I$1 + 8 KB D$1100 MHz
Low-overhead,online profiling
4 GB DDR2 256 LB L2 cache
Bluespec System Verilog
• Functional language for HW modeling– Functional, object-oriented, rule-based– HW functional verification is fast and easy
(static rule conditions verification)– Compiles to Verilog source code (better for
component refinement)• First prototype: MIPS 5-stage processor
– Faster (100 MHz) and smaller
16
TMbox is available at: http://www.velox-project.eu/releases
Any questions?Contact: [email protected]