ramp gold hardware and software architecture -...
TRANSCRIPT
RAMP Gold Hardware and Software Architecture
Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis, David Patterson, Krste Asanovic
UC Berkeley
Jan 2009
2 2
Personal
Health
Image
Retrieval
Hearing,
Music Speech
Parallel
Browser
Design Patterns/Motifs
Sketching
Legacy
Code Schedulers
Communication &
Synch. Primitives
Efficiency Language Compilers
Par Lab Research Overview Easy to write correct programs that run efficiently on manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
ParLab Manycore/RAMP
Hypervisor
Corr
ectn
ess
Composition & Coordination Language (C&CL)
Parallel
Libraries
Parallel
Frameworks
Static
Verification
Dynamic
Checking
Debugging
with Replay
Directed
Testing Autotuners
C&CL Compiler/Interpreter
Efficiency
Languages
Type
Systems
Dia
gnosin
g P
ow
er/
Perf
orm
ance
3
RAMP Gold : A Parlab manycore emulator
Leverage RAMP FPGA emulation infrastructure to build prototypes of proposed architectural features
Fast enough to run real apps
“tapeout” everyday
RAMP Gold
Single-socket tiled manycore target SPARC v8 -> ISA neutural
Shared memory, distributed coherent cache
Multiple on-chip networks and memory channels
Split functional/timing model, both in hardware
Host multithreading of both functional and timing models
Parlab Manycore
Functional Model
Pipeline
Arch
State
Timing Model
Pipeline
Timing
State
Host multithreading Single hardware pipeline with multiple copies of CPU state
Virtualized the pipeline with fine-grained multithreading to emulate more target CPUs with high efficiency (i.e. MIPS/FPGA)
Hide emulation latencies
Not multithreading target
4
+1
PC
1 PC
1 PC
1 PC
1
I$ IR GPR1 GPR1 GPR1 GPR1
X
Y
A
L
U
D$
6 6
DE
6
Thread
Select
CPU
1
CPU
2
CPU
63
CPU
64 Target Model
Functional CPU model on FPGA
text
Decode Timing
Timing
State
I$ timing
PC
Inst / Partial
Decode result
Execution
Timing
D$ Timing
Mem Address
Commit Timing
Functional
Status
Functional
Pipeline Controls
Memory
Timing Model
text
Instruction
Fetch
Static Thread
Selection
(Round Robin)
Special
Registers
Host I$/I-
MMUDecode
32-bit Instruction
Tag/Data read request
Regfile
Decoding
32-bit
Multithreade
d Register
File
Decode ALU
control/Exception
Detectionimm
pc
OP2 OP1
Execution Unit 1
(Simple ALU)
Execution Unit 2
(Complex ALU)
Special Register
OPs
Memory
Host D$/
DMMU
Write Back/
Exception
Tag/Data read request
128-bit dataWrite Back
/ Exception
Host memory
interface
Host memory
interface
Thread
Selection
Instruction
Fetch
Decode
Register File
Access
Execution
Memory
Functional ModelTiming Model 5
RAMP Gold v1 model
Balancing Func. And Timing Models
6
Timing Model
Timing Model
Host DRAM Cache
G
P
R
1
G
P
R
1
G
P
R
1
G
P
R
1
G
P
R
1
G
P
R
1
Functional Model
Timing Model
Single functional model supports multiple timing models on FPGAs
RAMP Gold Implementation
7
Xilinx ML505
BEE3
Single FPGA Implementation
Low cost Xilinx ML505 board
64~128 cores, 2GB DDR2, FP, timing model, 100~130 MIPS
A 64-core functional model demo
Multi-FPGA Implementation
BEE3 : 4 Xilinx Virtex 5 LX155T
1K~2K cores, 64GB DDR2, FP, timing model Higher emulation capacity and memory
bandwidth
RAMP Gold Prototyping Version 1
Single FPGA implementation on Xilinx ML505
64~128 cores, integer only, 2G byte memory, running at 100 MHz
Simple timing model
Single in-order issue CPU with an ideal shared memory (2-cycle access latency)
RAMP performance counter support
Full verification environment Software simulator (C-gold)
RTL/netlist verification
HW on-chip verification
BSD license : Everything is built from scratch!
8
CPU Functional Model (1)
64 HW threads, full 32-bit SPARC v8 CPU
The same binary runs on both SUN boxes and RAMP
Optimized for emulation throughput (MIPS/FPGA)
1 cycle access latency for most of the instructions on host
Microcode operation for complex and new instructions
E.g. trap, active messages
Design for FPGA fabric for optimal performance
“Deep” pipeline : 11 physical stages, no bypassing network
DSP based ALU
ECC/parity protected RAM/cache lines and etc.
Double clocked BRAM/LUTRAM
Fine-tuned FPGA resource mapping
9
CPU Functional Model (2)
Status
Coded in Systemverilog
Passed the verification suite donated by SPARC International
Verified against our C functional simulator
Mapped and tested on HW @ 100 MHz
Maximum frequency > 130 MHz
FPGA resource consumption (XCV5LX50T)
1 CPU + SRAM controller + memory network
10
LUT Register BRAM DSP
2,719 4,852 13 2
9% 16% 22% 4%
Verification/Testing Flow
11
ELF to
BRAM
Loader
Standard
ELF Loader
Frontend Test Server
ELF to
DRAM
Loader
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
Customized
Linker Script
(.lds)
RTL src files /
netlist
(.sv, .v)
ELF
Binaries
Modelsim SE/Questasim 6.4a
Host dynamic
simulation
libraries (.so)
libbfd
Disassembler C
implementationFrontend Links
Xilinx Unisim Library Systemverilog DPI interface
Simulation
logs
Checker
C Functional
Simulator
Frontend Links
Reference Result
FPGA Target
Xilinx ML505, BEE3
Frontend Links
HW state
dumps
Solaris/Linux Machine
(Handle Syscall)
GCC Toolchain SPARC cross compiler with newlib
Built with (binutils-2.18, gcc-4.3.2/gmp-4.3.2/mpfr-2.3.2, newlib-1.16.0)
sparc-elf-{gcc, g++, ld, nm, objdump, ranlib, strip, …}
Link newlib statically
newlib is a C library intended for use on embedded systems
C functions in newlib are narrowed down to 19 system calls
_exit, close, environ, execve, fork, fstat, getpid, isatty, kill, link, lseek, open, read, sbrk, stat, times, unlink, wait, write
Now we can compile our C, C++ source code (w/ standard C functions) to a SPARC executable
sparc-elf-gcc –o hello hello.c –lc –lsys –mcpu=v8
12
Frontend Machine
Multiple backends
C-gold functional simulator (link: function calls)
Modelsim simulator (link: DPI)
Actual H/W (Xilinx ML505, BEE3) (link: gigabit ethernet)
Narrow interface to support a new backend
Host/Target interface
CPU reset
Memory interface
{read,write}_{signed,unsigned}_{8,16,32,64}
Execute system calls received from the backend
Signaled by the backend proxy kernel
Map a Solaris system call to a Linux system call and execute 13
Proxy Kernel How could we support I/Os?
The target doesn’t have any peripherals (e.g. disks)
It would be a pain to program a system which can’t read or write to anything…
It would be more pain to make the peripherals work with the actual H/W
A minimal kernel which acts as a proxy for system calls invoked by newlib
Proxy kernel sends the arguments and the system call number to the frontend machine
The frontend machine does the actual system call and returns the results back to the proxy kernel
Finally the PC is moved back to the application and everybody is happy
14
C-gold Functional Simulator
Baseline functional model to verify our functional model written in system verilog
Full 32-bit SPARC v8
Includes an IEEE 754 compatible FPU
New instruction introduced to support active messages
SENDAM
Written from scratch, no junk in it
Very fast, ~25 MIPS
Easy to understand
Easy to add/modify modules for experiments
Flexible parameters (Number of target threads, host threads, …)
15
ELF to
BRAM
Loader
Standard
ELF Loader
Frontend Test Server
ELF to
DRAM
Loader
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
Customized
Linker Script
(.lds)
RTL src files /
netlist
(.sv, .v)
ELF
Binaries
Modelsim SE/Questasim 6.4a
Host dynamic
simulation
libraries (.so)
libbfd
Disassembler C
implementationFrontend Links
Xilinx Unisim Library Systemverilog DPI interface
Simulation
logs
Checker
C Functional
Simulator
Frontend Links
Reference Result
FPGA Target
Xilinx ML505, BEE3
Frontend Links
HW state
dumps
Solaris/Linux Machine
(Handle Syscall)
16
Code & Cross-Compile
Proxy Kernel
write(…)
syscall req
syscall res
On the fly DEMO
17
Plans for the Next Version
Try out several research projects
Performance Counter
Virtual Local Stores
Enhance the functional model and timing model
Add FPUs
Memory/cache timing models
18
Acknowledgement
Special thanks to Prof. Kurt Keutzer for help from Synopsys!
19
Backup Slides
20
Pipeline Architecture
21
Instruction Fetch 1(Issue address Request)
Static Thread
Selection(Round Robin)
Special Registers
(pc/npc, wim, psr,
thread control
registers)
I-Cache
(nine 18kb
BRAMs)
Microcode ROM
Instruction Fetch 2(compare tag)
32-bit
Instruction
Synthesized
InstructionTag compare result
Micro inst.
Tag/Data read
request
Decode(Resolve Branch,
Decode register file
address)
Regfile Read2 cycles (pipelined)
32-bit
Multithreaded
Register File
(four 36kb
BRAMs)
Decode ALU
control/Exception
Detectionimm
pc
OP2 OP1
MUL/DIV/SHF(4 DSPs)
Simple ALU (1 DSP)
/LDST decoding
Special register
handling(RDPSR/RDWIM)
Mem request
under cache miss
Tag
Unaligned address
detection / Store
preparation
Issue Load(issue address)
D-Cache
(nine 18kb
BRAMs)
Trap/IRQ handling Read & Select
Tag/Data read
request
Tag / 128-bit data
Generate
microcode request
Load align /
Write Back
128-bit read & modify data
128-bit memory
interface
128-bit memory
interface
Thread
Selection
Instruction
Fetch 1
Decode
Register File
Access 1 & 2*
Execution
Memory 1
Write Back
/ Exception
LUT RAM (clk x2)
LUT ROM
BRAM (clk x2)
DSP (clk x2)
Instruction
Fetch 2
Register File
Access 3
Memory 2
Single issue in order pipeline (integer only) 11 pipeline stages (no forwarding) -> 7
logical stages
Static thread scheduling, zero overhead context switch
Avoid complex operations with “microcode”
E.g. traps, ST
32-bit I/O bus (threaded) with interrupt support
Physical implementation All BRAM/LUTRAM/DSP blocks in double
clocked or DDR mode
Manually-controlled BRAM mapping
LUTRAM mapping by memory compiler
Extra pipeline stages for routing
ECC/Parity protected BRAMs
Deep submicron effect on FPGAs
Implementation Challenges
22
CPU state storage
Where?
How large? Does it fit on FPGA?
Minimize FPGA resource consumption
E.g. Mapping ALU to DSPs
Host cache & TLB
Need cache?
Architecture and capacity
Bandwidth requirement and R/W access ports
host multithreading amplifies the requirement
State storage
23
Complete 32-bit SPARC v8 ISA w. traps/exceptions
All CPU states (integer only) are stored in SRAMs on FPGA
Per context register file -- BRAM
3 register windows stored in BRAM chunks of 64
8 (global) + 3*16 (reg window) = 54
6 special registers
pc/npc -- LUTRAM
PSR (Processor state register) -- LUTRAM
WIM (Register Window Mask) -- LUTRAM
Y (High 32-bit result for MUL/DIV) -- LUTRAM
TBR (Trap based registers) -- BRAM (packed with regfile)
Buffers for host multithreading (LUTRAM)
Maximum 64 threads per pipeline on Xilinx Virtex5
Bounded by LUTRAM depth (6-input LUTs)
Mapping SPARC ALU to DSP
24
Xilinx DSP48E advantage
48-bit add/sub/logic/mux + pattern detector
Easy to generate ALU flags: < 10 LUTs for C, O
Pipelined access over 500 MHz
DSP advantage
25
Instruction coverage (two double clocked DSPs / pipeline)
1 cycle ALU (1 DSP)
LD/ST (address calculation)
Bit-wise logic (and, or, …)
SETHI (value by pass)
JMPL, RETT, CALL (address calculation)
SAVE/RESTORE (add/sub)
WRPSR, RDPSR, RDWIM (XOR op)
Long latency ALU instructions (1 DSP)
Shift/MUL (2 cycles)
5%~10% logic save for 32-bit data path
Host Cache/TLB
26
Accelerating emulation performance! Need separate model for target cache
Per thread cache (Partitioned) Split I/D direct-map write-allocate write-back cache
Block size: 32 bytes (BEE3 DDR2 controller heart beat)
64-thread configuration: 256B I$, 256B D$
Size doubled in 32-thread configuration
Non-blocking cache, 64 outstanding requests (max)
Physical tags, indexed by virtual or physical address
$ size < page size
67% BRAM usage
Per thread TLB Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB
Dummy currently
Static translation for Solaris virtual address layout
Cache-Memory Architecture
27
Cache controller Non-blocking pipelined access (3-stages) matches CPU pipeline Decoupled access/refill: allow pipelined, OOO mem accesses Tell the pipeline to “replay” inst. on miss 128-bit refill/write back data path
fill one block at 2x clk rate
RAMB18SDP RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72)
Tag (Parity)
512 x 36
Data (ECC)
512x72x4
Prepare LD/ST
address
Memory Stage (1)
Load Select / Routing
Cache FSM(Hit, exception, etc)
Exception/Write
Back Stage
Memory Stage (2)
Read & Modify
64-bit data
Tag
replay?
Pipeline Register
Write
Back
Cache
Integer Pipeline
Pipeline Register
Pipeline State
Control
Load Align/Sign
Memory
Command FIFO
64-bit data
+ Tag
128-bit data
Refill
Memory
Controller
128-bit
data
Memory request
addressVictim data
write back
Refill
Index
Mem ops
Lookup
Index
Example: A distributed memory non-cache coherent system
28
Eight multithreaded SPARC v8 pipelines in two clusters Each thread emulates one independent node
in target system 512 nodes/FPGA Predicted emulation performance:
~1 GIPS/FPGA (10% I$ miss, 30% D$ miss, 30% LD/ST)
x2 compared to naïve manycore implementation
Memory subsystem Total memory capacity 16 GB, 32MB/node
(512 nodes) One DDR2 memory controller per cluster Per FPGA bandwidth: 7.2 GB/s Memory space is partitioned to emulate
distributed memory system 144-bit wide credit-based memory network
Inter-node communication (under development) Two-level tree network with DMA to provide
all-to-all communication
Project Status
29
Done with RTL implementation
~12,000 lines synthesizable Systemverilog code
FPGA resource utilization per pipeline on Xilinx V5 LX110T
~4% logic (LUT), ~10% BRAM
Max 10 pipelines, but back off to 8 or less
Built RTL verification infrastructure
SPARC v8 certification test suite (donated by SPARC international) + Systemverilog
Can be used to run more programs but very slow
(~0.3 KIPS)
Project Status
30
Passed all SPARC v8 integer diagnostics in pre-synthesized RTL simulation
Run single threaded Solaris apps (5 syscall supported so far)
Working on HW verification after synthesis and P&R
Synthesized with an alpha version of Synplify
Will support MentorGraphics Precision 2008a Update 2 in late Nov
Planned Release in Jan 09
64/128 emulated CPUs on Xilinx ML505 board @ $500 + 2GB DDR2 DRAM cost
Source code will be available under BSD license
Thank you
31