1 retreat (advance) john wawrzynek uc berkeley january 15, 2009

1

Retreat (Advance)

John WawrzynekUC Berkeley

January 15, 2009

2

Research Accelerator for Multiple Processors

Rapid design space exploration - A new set of architecture parameters can be tried each day leading to highly efficient (power, cost) designs.

High confidence verification of design specification (conventional software simulators are either too slow or not trustworthy).

An early platform for software development while waiting for machine to be built.

REVIEWREVIEW

Problem with Manycore Processor Design trend: Compilers, operating systems,

architectures not ready for 1000s of CPU per chip

How do we do research on 1000 CPU systems in arch., OS, compilers, apps?

Develop an infrastructure to build cycle-Develop an infrastructure to build cycle-accurate multi-core and many-core accurate multi-core and many-core architecture emulators using FPGAsarchitecture emulators using FPGAs• Not FPGA computingNot FPGA computing• Not a gate-level verification platformNot a gate-level verification platform

3

1.1. Provide infrastructure to support collaboration Provide infrastructure to support collaboration among researchersamong researchers• Standardization of interfacesStandardization of interfaces• Development of reusable modulesDevelopment of reusable modules

2.2. Produce relatively inexpensive FPGA platforms Produce relatively inexpensive FPGA platforms & freely available gateware and software& freely available gateware and software

BEE2 ModuleVirtex 2VPro70

BEE3 ModuleBEE3 ModuleVirtex 5Virtex 5

Start with BEE2 at “RAMP1” platform, Start with BEE2 at “RAMP1” platform, design BEE3 as “RAMP2” platform for design BEE3 as “RAMP2” platform for broad deployment using 3broad deployment using 3rdrd party party

3.3. Provide a set of “target” architecture modelsProvide a set of “target” architecture models• Reference designs for further developmentReference designs for further development• Out-of-the box parallel machines for software Out-of-the box parallel machines for software

developmentdevelopment

Promised “deliverables”Promised “deliverables”

4

Partnerships Co-PIs: Krste Asanovíc (UCB), Derek Chiou (UT Austin), Joel Emer

(MIT/Intel), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), and John Wawrzynek (Berkeley)

RAMP hardware development activity centered at Berkeley Wireless Research Center.

Three year NSF grant for staff (awarded 3/06).

Small random other funding: GSRC, Par lab (UCB), other.

DARPA seedling at UCB.

Major continuing commitment from Xilinx

Collaboration with MSR (Chuck Thacker) on BEE3 FPGA platform.

Sun, IBM contributing processor designs, IBM faculty awards, collaboration.

Co-Pis continue to “meet” bi-weekly in teleconferences, and in person 3-4 times per year.

Some History

Spring 2005 ISCA, “hallway” discussions Serious planning in summer 2005. Fall 2005 NSF proposal submitted. Jan 2006 First RAMP Workshop at UC (Hand-on lab

based on BEE2). March 2006 NSF grant awarded Summer 2006 MIT retreat Winter 2007 UCB retreat Summer 2007 tutorial/workshop/retreat at SD Winter 2008 UCB retreat Summer 2008 Stanford retreat

5

Summer 2008 Retreat In anticipation of a new funding proposal to the NSF (larger

this time) … Evaluated where we where technically in the project,

reevaluated our project goals, and brainstormed on what we needed to do going forward.

Summary:a) Made a big splash in the computer architecture communityb) Had many technical success in proving the concept and

understanding how to build efficient scalable infrastructurec) Had not collaborated as well as we would have likedd) Had less outside adoption that we had expected.

6

Technical Accomplishments Built several FPGA-based processor core and memory models, including

models for novel architectural features such as transactional memory (Stanford/RAMP Red, UT/RAMP White);

Demonstrated a 1008 core parallel processor system (Berkeley/RAMP Blue);

Developed next generation FPGA emulation platform (BEE3) and transferred its design and responsibility for manufacturing to a third party (BEEcube) (MSR, UCB);

Implemented several approaches to standards interfaces for module level interchangeability and sharing (Aports/MIT, RDL/UCB);

Demonstrated approaches to simulation “time dilation” for cycle-accurate simulation (Hasim/MIT, RDL/UCB);

Built several prototype hybrid host systems combining FPGA based emulation with conventional processors and software (Protoflex/CMU, FAST/UTAustin);

Investigated several novel techniques to processor modeling for improved simulation and efficiency and scaling – virtualization/split timing-functional modeling (FAST/UTAustin, Protoflex/CMU).

7

Lessons Motivating New Work (1/2)"Simulation" is better than "emulation": Original idea was to build direct RTL level

implementations of target architectures, starting with available RTL of full processors and peripherals.

We did some of this but discovered: Suitable RTL models rarely exist, once completed, such a

system is not flexible, difficult to scale up. The space of possibilities in emulation techniques

(essentially building “simulators” in FPGAs) is quite broad, and emerged as a first class intellectual pursuit in the computer architecture research community.

In this proposed next phase of the project we plan to aggressively pursue this space.

8

Lessons Motivating New Work (2/2)Generally most parallel systems researchers are well

practiced in using and modifying software simulators but have little experience at the hardware level and are reluctant to dig into low-level hardware details needed to build efficient FPGA-based simulators.

Furthermore, reluctance by most researchers to acquire and maintain large FPGA computing platforms and associated FPGA development tools.

In order to be successful, we believe that we must provide ready to go an easy-to-use simulation models and with easy shared access in the form of online services.

9

Proposed Work 1 Create a set of “standards” and ref. implementations:

module interfaces and communication; simulator user interface; monitoring/debug; power/thermal modeling.

2 Use these specifications and reference implementation to build several full simulators, including ones that model 1K+ processor nodes.

3 Offer RAMP infrastructure (both HW and development software) as an online service.

4 Aggressive outreach to hardware architects, RAMP developers, software developers, with: On-line and live tutorials, and providing all the necessary

resources freely available, with full-time staff support services.

10

Project Timeline (2/2)

12

Retreat Agenda, day 1

13

Retreat Agenda, day 2

14

Breakout Topics Simulator interoperability

Which are the important simulators with which to be compatable How should RAMP infrastructure play with other simulators?

RAMP Supported FPGA platforms: Nallatech (FSB), BEE3, … Xinlinx / Altera How do we deal with all the choices Lower cost alternative and what is the impact of platform cost (does it affect the

value prop.) What are the price points.

RAMP and Heterogeneous Architectures Value Prop. of RAMP capabilities: What should you be able to do only on

RAMP? Can’t be done any other way. Functional versus RTL versus simulation. When to do which? How to

transition from one to the other?

15

1 retreat (advance) john wawrzynek uc berkeley january 15, 2009

Documents

bee3 fpga platform

design bee3

fpgabased processor

ucb retreatsummer

processor designs

ramp2 platform

ramp1 platform

early platform