ramp gold hardware and software architecture -...

RAMP Gold Hardware and Software Architecture

Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis, David Patterson, Krste Asanovic

UC Berkeley

Jan 2009

2 2

Personal

Health

Image

Retrieval

Hearing,

Music Speech

Parallel

Browser

Design Patterns/Motifs

Sketching

Legacy

Code Schedulers

Communication &

Synch. Primitives

Efficiency Language Compilers

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

ParLab Manycore/RAMP

Hypervisor

Corr

ectn

ess

Composition & Coordination Language (C&CL)

Parallel

Libraries

Parallel

Frameworks

Static

Verification

Dynamic

Checking

Debugging

with Replay

Directed

Testing Autotuners

C&CL Compiler/Interpreter

Efficiency

Languages

Type

Systems

Dia

gnosin

g P

ow

er/

Perf

orm

ance

3

RAMP Gold : A Parlab manycore emulator

Leverage RAMP FPGA emulation infrastructure to build prototypes of proposed architectural features

Fast enough to run real apps

“tapeout” everyday

RAMP Gold

Single-socket tiled manycore target SPARC v8 -> ISA neutural

Shared memory, distributed coherent cache

Multiple on-chip networks and memory channels

Split functional/timing model, both in hardware

Host multithreading of both functional and timing models

Parlab Manycore

Functional Model

Pipeline

Arch

State

Timing Model

Pipeline

Timing

State

Host multithreading Single hardware pipeline with multiple copies of CPU state

Virtualized the pipeline with fine-grained multithreading to emulate more target CPUs with high efficiency (i.e. MIPS/FPGA)

Hide emulation latencies

Not multithreading target

4

+1

PC

1 PC

1 PC

1 PC

1

I$ IR GPR1 GPR1 GPR1 GPR1

X

Y

A

L

U

D$

6 6

DE

6

Thread

Select

CPU

1

CPU

2

CPU

63

CPU

64 Target Model

Functional CPU model on FPGA

text

Decode Timing

Timing

State

I$ timing

PC

Inst / Partial

Decode result

Execution

Timing

D$ Timing

Mem Address

Commit Timing

Functional

Status

Functional

Pipeline Controls

Memory

Timing Model

text

Instruction

Fetch

Static Thread

Selection

(Round Robin)

Special

Registers

Host I$/I-

MMUDecode

32-bit Instruction

Tag/Data read request

Regfile

Decoding

32-bit

Multithreade

d Register

File

Decode ALU

control/Exception

Detectionimm

pc

OP2 OP1

Execution Unit 1

(Simple ALU)

Execution Unit 2

(Complex ALU)

Special Register

OPs

Memory

Host D$/

DMMU

Write Back/

Exception

Tag/Data read request

128-bit dataWrite Back

/ Exception

Host memory

interface

Host memory

interface

Thread

Selection

Instruction

Fetch

Decode

Register File

Access

Execution

Memory

Functional ModelTiming Model 5

RAMP Gold v1 model

Balancing Func. And Timing Models

6

Timing Model

Timing Model

Host DRAM Cache

G

P

R

1

G

P

R

1

G

P

R

1

G

P

R

1

G

P

R

1

G

P

R

1

Functional Model

Timing Model

Single functional model supports multiple timing models on FPGAs

RAMP Gold Implementation

7

Xilinx ML505

BEE3

Single FPGA Implementation

Low cost Xilinx ML505 board

64~128 cores, 2GB DDR2, FP, timing model, 100~130 MIPS

A 64-core functional model demo

Multi-FPGA Implementation

BEE3 : 4 Xilinx Virtex 5 LX155T

1K~2K cores, 64GB DDR2, FP, timing model Higher emulation capacity and memory

bandwidth

RAMP Gold Prototyping Version 1

Single FPGA implementation on Xilinx ML505

64~128 cores, integer only, 2G byte memory, running at 100 MHz

Simple timing model

Single in-order issue CPU with an ideal shared memory (2-cycle access latency)

RAMP performance counter support

Full verification environment Software simulator (C-gold)

RTL/netlist verification

HW on-chip verification

BSD license : Everything is built from scratch!

8

CPU Functional Model (1)

64 HW threads, full 32-bit SPARC v8 CPU

The same binary runs on both SUN boxes and RAMP

Optimized for emulation throughput (MIPS/FPGA)

1 cycle access latency for most of the instructions on host

Microcode operation for complex and new instructions

E.g. trap, active messages

Design for FPGA fabric for optimal performance

“Deep” pipeline : 11 physical stages, no bypassing network

DSP based ALU

ECC/parity protected RAM/cache lines and etc.

Double clocked BRAM/LUTRAM

Fine-tuned FPGA resource mapping

9

CPU Functional Model (2)

Status

Coded in Systemverilog

Passed the verification suite donated by SPARC International

Verified against our C functional simulator

Mapped and tested on HW @ 100 MHz

Maximum frequency > 130 MHz

FPGA resource consumption (XCV5LX50T)

1 CPU + SRAM controller + memory network

10

LUT Register BRAM DSP

2,719 4,852 13 2

9% 16% 22% 4%

Verification/Testing Flow

11

ELF to

BRAM

Loader

Standard

ELF Loader

Frontend Test Server

ELF to

DRAM

Loader

App Source

Files

(.S or .C)

GNU SPARC v8

Compiler/

Linker

Customized

Linker Script

(.lds)

RTL src files /

netlist

(.sv, .v)

ELF

Binaries

Modelsim SE/Questasim 6.4a

Host dynamic

simulation

libraries (.so)

libbfd

Disassembler C

implementationFrontend Links

Xilinx Unisim Library Systemverilog DPI interface

Simulation

logs

Checker

C Functional

Simulator

Frontend Links

Reference Result

FPGA Target

Xilinx ML505, BEE3

Frontend Links

HW state

dumps

Solaris/Linux Machine

(Handle Syscall)

GCC Toolchain SPARC cross compiler with newlib

Built with (binutils-2.18, gcc-4.3.2/gmp-4.3.2/mpfr-2.3.2, newlib-1.16.0)

sparc-elf-{gcc, g++, ld, nm, objdump, ranlib, strip, …}

Link newlib statically

newlib is a C library intended for use on embedded systems

C functions in newlib are narrowed down to 19 system calls

_exit, close, environ, execve, fork, fstat, getpid, isatty, kill, link, lseek, open, read, sbrk, stat, times, unlink, wait, write

Now we can compile our C, C++ source code (w/ standard C functions) to a SPARC executable

sparc-elf-gcc –o hello hello.c –lc –lsys –mcpu=v8

12

Frontend Machine

Multiple backends

C-gold functional simulator (link: function calls)

Modelsim simulator (link: DPI)

Actual H/W (Xilinx ML505, BEE3) (link: gigabit ethernet)

Narrow interface to support a new backend

Host/Target interface

CPU reset

Memory interface

{read,write}_{signed,unsigned}_{8,16,32,64}

Execute system calls received from the backend

Signaled by the backend proxy kernel

Map a Solaris system call to a Linux system call and execute 13

Proxy Kernel How could we support I/Os?

The target doesn’t have any peripherals (e.g. disks)

It would be a pain to program a system which can’t read or write to anything…

It would be more pain to make the peripherals work with the actual H/W

A minimal kernel which acts as a proxy for system calls invoked by newlib

Proxy kernel sends the arguments and the system call number to the frontend machine

The frontend machine does the actual system call and returns the results back to the proxy kernel

Finally the PC is moved back to the application and everybody is happy

14

C-gold Functional Simulator

Baseline functional model to verify our functional model written in system verilog

Full 32-bit SPARC v8

Includes an IEEE 754 compatible FPU

New instruction introduced to support active messages

SENDAM

Written from scratch, no junk in it

Very fast, ~25 MIPS

Easy to understand

Easy to add/modify modules for experiments

Flexible parameters (Number of target threads, host threads, …)

15

ELF to

BRAM

Loader

Standard

ELF Loader

Frontend Test Server

ELF to

DRAM

Loader

App Source

Files

(.S or .C)

GNU SPARC v8

Compiler/

Linker

Customized

Linker Script

(.lds)

RTL src files /

netlist

(.sv, .v)

ELF

Binaries

Modelsim SE/Questasim 6.4a

Host dynamic

simulation

libraries (.so)

libbfd

Disassembler C

implementationFrontend Links

Xilinx Unisim Library Systemverilog DPI interface

Simulation

logs

Checker

C Functional

Simulator

Frontend Links

Reference Result

FPGA Target

Xilinx ML505, BEE3

Frontend Links

HW state

dumps

Solaris/Linux Machine

(Handle Syscall)

16

Code & Cross-Compile

Proxy Kernel

write(…)

syscall req

syscall res

On the fly DEMO

17

Plans for the Next Version

Try out several research projects

Performance Counter

Virtual Local Stores

Enhance the functional model and timing model

Add FPUs

Memory/cache timing models

18

Acknowledgement

Special thanks to Prof. Kurt Keutzer for help from Synopsys!

19

Backup Slides

20

Pipeline Architecture

21

Instruction Fetch 1(Issue address Request)

Static Thread

Selection(Round Robin)

Special Registers

(pc/npc, wim, psr,

thread control

registers)

I-Cache

(nine 18kb

BRAMs)

Microcode ROM

Instruction Fetch 2(compare tag)

32-bit

Instruction

Synthesized

InstructionTag compare result

Micro inst.

Tag/Data read

request

Decode(Resolve Branch,

Decode register file

address)

Regfile Read2 cycles (pipelined)

32-bit

Multithreaded

Register File

(four 36kb

BRAMs)

Decode ALU

control/Exception

Detectionimm

pc

OP2 OP1

MUL/DIV/SHF(4 DSPs)

Simple ALU (1 DSP)

/LDST decoding

Special register

handling(RDPSR/RDWIM)

Mem request

under cache miss

Tag

Unaligned address

detection / Store

preparation

Issue Load(issue address)

D-Cache

(nine 18kb

BRAMs)

Trap/IRQ handling Read & Select

Tag/Data read

request

Tag / 128-bit data

Generate

microcode request

Load align /

Write Back

128-bit read & modify data

128-bit memory

interface

128-bit memory

interface

Thread

Selection

Instruction

Fetch 1

Decode

Register File

Access 1 & 2*

Execution

Memory 1

Write Back

/ Exception

LUT RAM (clk x2)

LUT ROM

BRAM (clk x2)

DSP (clk x2)

Instruction

Fetch 2

Register File

Access 3

Memory 2

Single issue in order pipeline (integer only) 11 pipeline stages (no forwarding) -> 7

logical stages

Static thread scheduling, zero overhead context switch

Avoid complex operations with “microcode”

E.g. traps, ST

32-bit I/O bus (threaded) with interrupt support

Physical implementation All BRAM/LUTRAM/DSP blocks in double

clocked or DDR mode

Manually-controlled BRAM mapping

LUTRAM mapping by memory compiler

Extra pipeline stages for routing

ECC/Parity protected BRAMs

Deep submicron effect on FPGAs

Implementation Challenges

22

CPU state storage

Where?

How large? Does it fit on FPGA?

Minimize FPGA resource consumption

E.g. Mapping ALU to DSPs

Host cache & TLB

Need cache?

Architecture and capacity

Bandwidth requirement and R/W access ports

host multithreading amplifies the requirement

State storage

23

Complete 32-bit SPARC v8 ISA w. traps/exceptions

All CPU states (integer only) are stored in SRAMs on FPGA

Per context register file -- BRAM

3 register windows stored in BRAM chunks of 64

8 (global) + 3*16 (reg window) = 54

6 special registers

pc/npc -- LUTRAM

PSR (Processor state register) -- LUTRAM

WIM (Register Window Mask) -- LUTRAM

Y (High 32-bit result for MUL/DIV) -- LUTRAM

TBR (Trap based registers) -- BRAM (packed with regfile)

Buffers for host multithreading (LUTRAM)

Maximum 64 threads per pipeline on Xilinx Virtex5

Bounded by LUTRAM depth (6-input LUTs)

Mapping SPARC ALU to DSP

24

Xilinx DSP48E advantage

48-bit add/sub/logic/mux + pattern detector

Easy to generate ALU flags: < 10 LUTs for C, O

Pipelined access over 500 MHz

DSP advantage

25

Instruction coverage (two double clocked DSPs / pipeline)

1 cycle ALU (1 DSP)

LD/ST (address calculation)

Bit-wise logic (and, or, …)

SETHI (value by pass)

JMPL, RETT, CALL (address calculation)

SAVE/RESTORE (add/sub)

WRPSR, RDPSR, RDWIM (XOR op)

Long latency ALU instructions (1 DSP)

Shift/MUL (2 cycles)

5%~10% logic save for 32-bit data path

Host Cache/TLB

26

Accelerating emulation performance! Need separate model for target cache

Per thread cache (Partitioned) Split I/D direct-map write-allocate write-back cache

Block size: 32 bytes (BEE3 DDR2 controller heart beat)

64-thread configuration: 256B I$, 256B D$

Size doubled in 32-thread configuration

Non-blocking cache, 64 outstanding requests (max)

Physical tags, indexed by virtual or physical address

$ size < page size

67% BRAM usage

Per thread TLB Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB

Dummy currently

Static translation for Solaris virtual address layout

Cache-Memory Architecture

27

Cache controller Non-blocking pipelined access (3-stages) matches CPU pipeline Decoupled access/refill: allow pipelined, OOO mem accesses Tell the pipeline to “replay” inst. on miss 128-bit refill/write back data path

fill one block at 2x clk rate

RAMB18SDP RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72)

Tag (Parity)

512 x 36

Data (ECC)

512x72x4

Prepare LD/ST

address

Memory Stage (1)

Load Select / Routing

Cache FSM(Hit, exception, etc)

Exception/Write

Back Stage

Memory Stage (2)

Read & Modify

64-bit data

Tag

replay?

Pipeline Register

Write

Back

Cache

Integer Pipeline

Pipeline Register

Pipeline State

Control

Load Align/Sign

Memory

Command FIFO

64-bit data

+ Tag

128-bit data

Refill

Memory

Controller

128-bit

data

Memory request

addressVictim data

write back

Refill

Index

Mem ops

Lookup

Index

Example: A distributed memory non-cache coherent system

28

Eight multithreaded SPARC v8 pipelines in two clusters Each thread emulates one independent node

in target system 512 nodes/FPGA Predicted emulation performance:

~1 GIPS/FPGA (10% I$ miss, 30% D$ miss, 30% LD/ST)

x2 compared to naïve manycore implementation

Memory subsystem Total memory capacity 16 GB, 32MB/node

(512 nodes) One DDR2 memory controller per cluster Per FPGA bandwidth: 7.2 GB/s Memory space is partitioned to emulate

distributed memory system 144-bit wide credit-based memory network

Inter-node communication (under development) Two-level tree network with DMA to provide

all-to-all communication

Project Status

29

Done with RTL implementation

~12,000 lines synthesizable Systemverilog code

FPGA resource utilization per pipeline on Xilinx V5 LX110T

~4% logic (LUT), ~10% BRAM

Max 10 pipelines, but back off to 8 or less

Built RTL verification infrastructure

SPARC v8 certification test suite (donated by SPARC international) + Systemverilog

Can be used to run more programs but very slow

(~0.3 KIPS)

Project Status

30

Passed all SPARC v8 integer diagnostics in pre-synthesized RTL simulation

Run single threaded Solaris apps (5 syscall supported so far)

Working on HW verification after synthesis and P&R

Synthesized with an alpha version of Synplify

Will support MentorGraphics Precision 2008a Update 2 in late Nov

Planned Release in Jan 09

64/128 emulated CPUs on Xilinx ML505 board @ $500 + 2GB DDR2 DRAM cost

Source code will be available under BSD license

Thank you

31

ramp gold hardware and software architecture -...

Documents