computer architecture lab at 1 usability challenges for ramp2 eric chung james c. hoe

45
1 Computer Architecture Lab at Usability Challenges for RAMP2 Eric Chung James C. Hoe

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

1

Computer Architecture Lab at

Usability Challengesfor RAMP2

Eric ChungJames C. Hoe

2

ProtoFlex in a nut shell

• FPGA-accelerated full-system simulation by virtualization– Hybrid Full-System Simulation

– Multiprocessor Host Interleaving

2

2

1

CPUP

Memory Devices

P

Common-case behaviors

Uncommon behaviors

Memory

4-way P 4-way P

PP

PP

PP

PP

1

2

2

SINGLE

-SLID

E

REVIEW

3

In the beginning . . .

4

Then came . . .

Simulated target console window

Host command-line for breakpoints, introspection, modification

Inspect/modify registers Create checkpoint Undo the last instruction!

5

Now “they” want . . .

• Using SW simulator, takes 5 lines of Python

– Body of callback runs arbitrary instrumentation code

Execute ‘exc_callback’ every time a CPU hits an exception

Print out exception name and triggering PC

How would you do this in FPGAs?

6

What else could “they” want

• Interaction with virtual display of target system

• Fully deterministic and controllable execution

• Command-line control and scripting capabilities

• API for state inspection/modification

• Modularity features for adding/changing components

• Checkpoint save/restore

• Host-target communication (e.g., for bootstrapping)

• Full-system I/O capabilities (e.g., OS)

• Target resource virtualization

7

Outline

• Introduction

• Practical Feature Development

• Case Study: ProtoFlex Monitoring

• Closing Thoughts

8

Practical Feature Development

• Porting simul. features into FPGA not easy

– RTL modification almost always required

– Unlike SW, state in FPGA not easy to inspect/modify

(but required in most cases)

• Goal: make feature porting easier!

– With minimum FPGA expertise

9

Example

• Using SW simulator, takes 5 lines of Python

– Body of callback runs arbitrary instrumentation code

Execute ‘exc_callback’ every time a CPU hits an exception

Print out exception name and triggering PC

How would we implement in FPGAs?

10

How to implement in FPGA?

• Necessary steps

– Modify RTL of FPGA soft core to monitor exceptions (add bits to pipeline stages, modify decoder)

– Collect PC register during exceptions into trace buffer

– Simulate, debug, synthesize, place + route

– Collect/compress traces from multiple CPU cores (possibly across multiple FPGAs)

– Decompress/post-process traces and print

Can we reduce effort for RAMP developers?

11

Justifying the hardware

• For some efforts, RTL change unavoidable

– Ex: redesign memory subsystem, change # cores

• But for other things, can we do better?

– Instrumentation example from earlier? (print the PC during exceptions)

– Testing a new instruction?

– Inspecting a few CPU registers?

12

Observation

• Only frequent uses of a given hardware modification benefit from FPGA speedup

• Can we relegate infrequent events to software?

• Examples

– Instrumenting rare events (e.g. exceptions)

– Monitoring/analyzing subset of instruction traces

– Periodic sampling of counters

– Monitor range of ‘watched’ memory addresses

13

Outline

• Usability Challenges

• Practical Feature Development

• Case Study: ProtoFlex Monitoring

• Closing Thoughts

14

Case Study: ProtoFlex Monitoring

• Our objective:

– Diagnose an ‘anomaly’ while running commercial apps in BlueSPARC simulator*

• Requirements:

– At runtime, get names of processes running on CPUs

– Extract/verify user- and kernel-level stack traces

– WITHOUT modifications to target workload or OS

*BlueSPARC is our 16-CPU full-system FPGA-based simulator

15

Technique Used: Whitebox Tool

• Whitebox Profiling

– Input: real-time traces from full-system simulation

– Output: human-readable stack traces and visualization

– Simics tool authored by Mike Ferdman & Brian Gold

– Less than 300L of ‘Simics’ Python

DB2 2-CPU-1CL (CPU 0 - Server)

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 to 2B cycles

% u

ser

unknown

tpcc

sh

sched

nscd

fsflush

db2sysc

db2set

db2fmp

db2fmd

db2fmcd

db2fm

db2bp

automountd

16

Supporting Whitebox in ProtoFlex

• Basic whitebox technique:

– Simulation runtime is periodically halted

– Various registers are first checked

– Virtual-to-physical translations used to locate key data structures

– Physical memory reads are used to extract kernel state

• Naïve solution

– Add state machine to FPGA soft core to perform the steps

– Works but inflexible; may require significant HW changes

Is there an easier way?

17

Solution: Hybrid Simulation for Monitoring

• ProtoFlex hybrid simulation

– Recall in ProtoFlex: CPU pipeline implements only subset of instructions; nearby hard core simulates ISA remainder

Virtex II Pro 70

16-wayPipeline

PowerPC(Hard core)

EthernetInterface to 2ndFPGA (memory)

Processor Bus

Transplants can be used for monitoring!

PowerPC simulates unimplemented

SPARC instructions.

Operation is called a ‘Transplant’

18

Transplants for Monitoring

Virtex II Pro 70

16-wayPipeline

PowerPC(Hard core)

EthernetInterface to 2ndFPGA (memory)

Processor Bus

1) ‘Simulation’ engine periodically requests transplant to PowerPC

2) PowerPC performs ‘inspection’ by requesting register/memory state from engine

transplant() { … read_register(…) translate(…) read_memory(…) …}

Inspection/monitoring code written in C language

SEE DEM

O!

19

Tradeoffs

• Advantages

– SW approach to flexibly monitor events of interest

– For rare events, performance impact negligible

– Validate instrumentation idea before building in HW

• Disadvantages

– If events occur too frequently, must accelerate in HW

– How to know which HW interfaces to provide?

– How to know which events to monitor?

– How to scale to multiple engines?

20

Designing the HW/SW Interface

• In our design, we needed new interfaces between ProtoFlex engine & PowerPC

– Engine can issue memory requests on behalf of PowerPC

– Engine can issue TLB translations on behalf of PowerPC

Still required HW modification!

• For general-purpose monitoring, what interfaces needed?

21

Designing the HW/SW Interface

• Existing simulators good place to look at

– E.g., Simics provides library of over > 100 API calls used for inspection/modification/monitoring

• Example API methods:

– read_register(), write_register(), translate(), etc.

• Also over 50 unique ‘event’ types in simics:

– Used to trigger monitoring callback functions

– Ex: exceptions, watched memory locations, etc.

Build these APIs into RAMP?

22

Addressing Scalability Challenges

• In BlueSPARCv1.0, only 1 centrally-located core

– How to scale monitoring up to tens or hundreds of cores?

• Can we disable ½ of the host cores?

– To monitor the other half

– And provide general-purpose instrumentation / monitoring?

– Compile Simics API calls into distributed kernels that run on cores in monitoring mode

CPU CPU CPU CPU CPU CPU CPU CPU

23

Outline

• Usability Challenges

• Practical Feature Development

• Case Study: ProtoFlex Monitoring

• Closing Thoughts

24

Closing Thoughts

• Attention to user and developer usability is critical for practical RAMP adoption

– Goal: minimize FPGA expertise required when possible

• For users, provide familiar SW-simulation interface

• For developers, provide general-purpose monitoring that is programmable, comprehensive, and scalable

25

Ongoing Work at CMU

• BlueSPARC simulator (ProtoFlex)– Currently supports subset of Simics user interface

– Supports general-purpose software programmable monitoring

– Virtual console/GFX supported via hybrid simulation

• Still many challenges left– Not all Simics commands map easily to FPGA

– Execution is non-deterministic

– Checkpoint generation/loading works (but very slow)

– No ‘Undo-ing’ instructions

– Fine-grained ‘stepping’ for large-scale configurations

– Minor monitoring changes still requires re-synthesizing

Release planned for 2009

26

Thanks! Any [email protected]://www.ece.cmu.edu/~protoflex

AcknowledgementsWe would like to thank our colleagues inthe RAMP and TRUSS projects.

COME SEE OUR DEMO!

27

BACKUP

28

Typical Simulator ‘Must-Haves’

• Features commonly available in simulators today:

– Interaction with virtual display of target system

– Fully deterministic and controllable execution

– Command-line control and scripting capabilities

– API for state inspection/modification

– Modularity features for adding/changing components

– Checkpoint save/restore

– Host-target communication (e.g., for bootstrapping)

– Full-system I/O capabilities (e.g., OS)

– Target resource virtualization

29

Software Usage Example

Simulated target console window

Host command-line for breakpoints, introspection, modification

Inspect/modify registers Create checkpoint Undo the last instruction!

30

Bringing SW features to RAMP

• Can users with no knowledge of FPGAs use RAMP out-of-the-box?

• The litmus test

– User is unable to tell using a simulator front-end whether back-end is FPGAs or not

31

Closing Thought: Unification

• Common UI to ‘unify’ simulators and FPGAs

– Ex: use ‘Simics’ front-end, back-end is either FPGAs or SW (ProtoFlex has limited form of this)

– Avoid reinventing API/interface; users already familiar

• Benefits of interoperability:

– Gentle transition of whole generation of full-system simulation users to RAMP

– Support legacy scripts, workloads, configurations

32

Other Simulation Features

• How to provide full-system checkpoints?

– Must save/restore CPU/memory/device states

– But can’t just quickly dump/load 64GB of memory!

• Supporting ‘pause’ and ‘rewind’ in HW

• Deterministic/controllable execution

• ‘Instantaneously’ inspection/modification of distributed CPU/Mem/Device state

33

Case Study: ProtoFlex WhiteBox

• Our goals

– Profile IBM DB2/TPCC

– Identify which processes executing on each CPU at fine-grained intervals (1000s of instructions)

• Technique

– Periodically suspend simulator then access kernel data structures (in known physical memory locations)

– Extract process information from kernel

34

Tools for Visualization/Monitoring

• How to build tools that can make sense out of the behavior of 1000 concurrent threads?

• Dataflow visualization

– E.g., Data flow tomography [Sherwood08]

• Performance monitoring

– E.g., Estimate multi-core cache miss rates

• Black-box program profiling

– E.g., Invisible kernel introspection

35

How to instrument in a practical way?

• E.g., adding new counter to CPU

• Can we have our cake and eat it too?

– SW-like programming abstraction

– Without resynthesizing and keeping FPGA speeds

36

Example 1: Real-time Cache Models

• Generate cache model performance in real time

• Applications:

– Generating cache state checkpoints

37

Example 2: Black-Box Profiling

• Used in ProtoFlex to profile black-box commercial workloads (e.g., IBM DB2, Oracle)

38

Life-cycle of simulation (approximately)

GreatIdea

DesignImplement+Instrument

SimulateMeasurePublish

39

Life-cycle of simulation (approximately)

GreatIdea

DesignImplement+Instrument

SimulateMeasurePublish

40

Back-of-the-envelope calculation

• Let’s calculate opportunity cost of HW-simulation

• Assumptions

– Only goal is to measure given metric (e.g., IPC)

– Don’t care about prototyping

• 12 hours to design, simulate, P&R

– 12 hours = 12 x 3600s x 1KIPS/s = 43M instructions

– On a cluster of 100, can simulate ~4B instructions using detailed timing models in 24 hours

41

The FPGA Usability Challenge

• Despite impressive proof-of-concepts, FPGAs still not widely adopted in arch community

• FPGAs are not user-friendly

– Simulators easier to modify/use

• Lack of instant gratificationslows productivity

– How long to build and run ‘Hello World’ on FPGA?

42

Usability of FPGAs

User Class Usage DescriptionRequired FPGA

expertise

Casual User • Parallel programming on new architectures Low/None?

Serious User

• Use predefined target machine • Want to tweak HW parameters• Requires inspection/changes to system state

Low/Med?

Casual Developer

• Large changes to architecture or components• Monitoring tools to inspect low-level info

Med/High?

Serious Developer

• Build new components or special-purpose processing elements from scratch

High?

Required expertise should be minimized when possible

43

The FPGA Usability Challenge

• Challenges for Users

– Typical simulation features missing or hard-to-build

– Low runtime visibility into FPGA HW

• Challenges for Developers

– Even mundane tasks require RTL design/debugging

– Long synthesis turnaround times (up to hours/days)

– Must learn new languages, (buggy) tools

How to improve usability with RAMP2?

44

Closing the Usability Gap

• Ideally want to provide:

– Fast ‘SW’ simulation interface for casual/serious users

– Fast programming abstractions for casual developers without re-synthesizing designs

• Without sacrificing (too much) FPGA performance

45

The Ultimate Productivity Killer

Designcapture

Map/translate+ Place &

Route

Bitstream generation+ Download

IF2 D E1 M1 M2 WE2

I-cache

RF

D-cache

IF1

Cost of a mistake or forgetting to add something? Priceless.

Synthesis

RTL(code for

describingHW)

Q

QSET

CLR

D

5-45 min

15-45 min

5 min