cmpe750 - shaaban #1 lec # 9 spring 2015 4-7-2015 computing system element choices specialization,...
DESCRIPTION
CMPE750 - Shaaban #3 lec # 9 Spring Computing Element Programmability Defining Terms Computes one function (e.g. FP-multiply, divider, DCT) Function defined at fabrication time. e.g ASICs Computes “any” computable function: –Processor: GPPs, ASPs –Configurable Hardware: e.g. FPGAs. Function defined/changed after fabrication (e.g at compilation or runtime). Fixed Functionality:Programmable: Parameterizable Hardware: Performs limited “set” of functions e.g. Co-Processors Late binding Fixed Hardware Functionality Not FixedTRANSCRIPT
CMPE750 - ShaabanCMPE750 - Shaaban#1 lec # 9 Spring 2015 4-7-2015
Computing System Element Choices
Specialization , Development cost/time Performance/Chip Area/Watt(Computational Efficiency)
Programmability /Flexibility
GeneralPurpose Processors
ApplicationSpecificProcessors
Re-configurableHardware
ASICs
SuperscalarVLIW
DSPsNetwork ProcessorsGraphics Processors…..
Reconfigurable ComputingAlso known as Custom Computing Machines (CCMs) Utilize hardware devices customized to match computationUsing: FPGAs (Fine grain) or Micro-coded arrays of simple processors (coarse grain)
GPPs
Co-ProcessorsHardware customization/reconfigurablity, how?Hardware customization/reconfigurablity, how?Change both Change both functionality functionality of hardware cells (elements)of hardware cells (elements)and their spatial and their spatial connectivitconnectivity to match requirements of y to match requirements of computation/application on the fly (at runtime). computation/application on the fly (at runtime).
+ Shorter Useful Life cycle+ Shorter Useful Life cycle
Software Hardware
+ L
onge
r U
sefu
l Life
Cyc
le
CMPE750 - ShaabanCMPE750 - Shaaban#2 lec # 9 Spring 2015 4-7-2015
Spatial vs. Temporal ComputingSpatial vs. Temporal Computing
Spatial Temporal
ProcessorInstructions
(using hardware) (using software/program)
Defined by Defined by fixedfixed functionalityfunctionality and and connectivityconnectivity of hardware elements of hardware elements
Processor running Processor running programs written using a pre-defined fixed set of instructions (ISA)
Space vs. Time Trade-offSpace vs. Time Trade-off
CMPE750 - ShaabanCMPE750 - Shaaban#3 lec # 9 Spring 2015 4-7-2015
Computing Element ProgrammabilityComputing Element Programmability Defining TermsDefining Terms
• Computes one function (e.g. FP-multiply, divider, DCT)
• Function defined at fabrication time.
• e.g ASICs
• Computes “any” computable function:– Processor: GPPs, ASPs– Configurable Hardware:
e.g. FPGAs.• Function defined/changed after
fabrication (e.g at compilation or runtime).
Fixed Functionality: Programmable:
Parameterizable Hardware:Performs limited “set” of functions
e.g. Co-Processors
Late bindingLate binding
Fixed HardwareFixed Hardware Functionality Not FixedFunctionality Not Fixed
CMPE750 - ShaabanCMPE750 - Shaaban#4 lec # 9 Spring 2015 4-7-2015
Computing Element Choices Observation • Generality and computational efficiency are in some sense inversely
related to one another:– The more general-purpose a computing element is and thus the greater the number of
tasks it can perform, the less efficient it will be in performing any of those specific tasks.
– Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features.
• To counter the problem of specialized and computationally intense problems for which general purpose machines cannot achieve the necessary performance:
– Special-purpose processors (ASPs), attached processors, and coprocessors have been built for many years, especially in such areas as image or signal processing (for which many of the computational tasks can be very well defined).
– The problem with such machines (i.e ASPs) is that they are special-purpose; as problems change or new ideas and techniques develop, their lack of flexibility makes them problematic as long-term solutions.
• Reconfigurable computing or Custom Computing Machines (CCMs) using Reconfigurable computing or Custom Computing Machines (CCMs) using FPGAs (Field Programmable Gate Arrays, first introduced in 1986 by Xilinx) or other reconfigurableeconfigurable (customizable) hardware can offer an attractive alternative to other computing element choices.
FPGAs originally developed for: 1- hardware design verification, 2- rapid-prototyping, and 3- potential ASIC-replacement
Due to fixed ISA limitationsDue to fixed ISA limitationsSolution?Solution?
FixedFixedASP ISAASP ISA
CMPE750 - ShaabanCMPE750 - Shaaban#5 lec # 9 Spring 2015 4-7-2015
What is Reconfigurable Computing (RC)?What is Reconfigurable Computing (RC)?• Utilize Utilize reconfigurable hardware devices: (spatially-programmed connections of hardware processing elements) tailored to application:
• Customizing hardware to match computations needed/present in a particular application by changing hardware functionality on the fly (at runtime).
• Reconfigurable Computing GoalReconfigurable Computing Goal: Using reconfigurable hardware devices to build systems with advantages over conventional computing solutions in terms of:
- Flexibility - Performance - Power - Time-to-market - Life cycle cost
“Hardware” customized to specifics of problem.
Direct map of problem specific dataflow, control.
Circuits “adapted” as problem requirements change.
Computational Efficiency Computational Efficiency (vs. processors)(vs. processors)
Hardware customization/reconfigurablity, how?Hardware customization/reconfigurablity, how?Change both Change both functionality functionality of hardware cells (elements)of hardware cells (elements)and their spatial and their spatial connectivitconnectivity to match requirements of y to match requirements of computation/application on the fly (at runtime). computation/application on the fly (at runtime).
Still spatial computing but both Still spatial computing but both functionalityfunctionality and and connectivityconnectivity of hardware elements are of hardware elements are not fixednot fixed
(vs. ASICS)(vs. ASICS) (vs. ASICS)(vs. ASICS) (vs. ASICS)(vs. ASICS) AdvantagesAdvantagesOf RCOf RC
CMPE750 - ShaabanCMPE750 - Shaaban#6 lec # 9 Spring 2015 4-7-2015
Conventional Programmable ProcessorsConventional Programmable ProcessorsVs. Configurable devicesVs. Configurable devices
Conventional Programmable Processors:• Moderately wide datapath which have been growing larger over time (e.g. 16, 32, 64, 128
bits). • Support for large on-chip instruction/data caches which have also been been growing larger
over time that can now hold thousands of instructions.• High bandwidth instruction distribution so that several instructions may be issued per cycle
at the cost of dedicating considerable die area for instruction fetch/distribution/issue/scheduling.
• A single thread of computation control per processor core. (SMT changes this)
Configurable devices (such as FPGAs):• Narrow datapath (e.g. almost always one bit), • On-chip space for only one instruction per compute element -- i.e. the single instruction which
tells the FPGA array cell (Configurable Logic Block, CLB) CLB) what function to perform and how to route its inputs and outputs (connectivity to other cells).
• Minimal die area dedicated to instruction distribution such that it takes hundreds or thousands of compute cycles to change the active set of array instructions (e.g From one FPGA configuration to another) .
• Can handle regular and bit-level computations more efficiently than processors.Issue: Potentially long reconfiguration latency
CMPE750 - ShaabanCMPE750 - Shaaban#7 lec # 9 Spring 2015 4-7-2015
Why Reconfigurable Computing?Why Reconfigurable Computing? • To improve performance (including predictability) and
computational energy efficiency over a software implementations (vs. processors: GPPs, ASPs).– e.g. signal processing applications in configurable hardware.
• Provide powerful, application-specific operations in hardware (ASIC-like).
• To improve product flexibility and lower development cost/time compared to hardware (vs. ASICs) – e.g. encryption, compression or network protocols handling in
configurable hardware • To use the same hardware for different purposes at different
points in time in the computation (lowers cost vs. ASIC).– Given sufficient use of each configuration to tolerate
potentially long reconfiguration latency/overheads
CMPE750 - ShaabanCMPE750 - Shaaban#8 lec # 9 Spring 2015 4-7-2015
Benefits of Reconfigurable Logic DevicesBenefits of Reconfigurable Logic Devices • Non-permanent customization and application
development after fabrication– “Late Binding”
• Economies of scale (amortize large, fixed design costs)
• Shorter time-to-market than ASICs (dealing with evolving requirements and standards, new ideas)
Potential Disadvantages:• Efficiency penalty (area, performance, power)
compared to ASICs.• Need for correctness Verification. (common to all hardware-based solutions)
Customization achieved by changing both Customization achieved by changing both function function of of hardware elements and their hardware elements and their connectivitconnectivity to match y to match requirements of applicationrequirements of application
Lower development time/cost than ASICSLower development time/cost than ASICS
i.e Lower computational efficiency than ASICSi.e Lower computational efficiency than ASICS
CMPE750 - ShaabanCMPE750 - Shaaban#9 lec # 9 Spring 2015 4-7-2015
Spatial/Configurable Hardware Spatial/Configurable Hardware Benefits/DrawbacksBenefits/Drawbacks
• Potentially, an order of magnitude (10x) or higher raw computational density advantage over processors.
• Potential for fine-grained (bit-level) control/parallelism --- can offer another order of magnitude benefit.
• Locality.
• Each compute/interconnect resource dedicated to single function.
• Must dedicate resources for every computational subtask.
• Infrequently needed portions of a computation sit idle --> inefficient use of resources vs. ASICs.
Spatial/Configurable Drawbacks
(but much better than processors)
Common to allCommon to allhardware-basedhardware-basedsolutionssolutions
CMPE750 - ShaabanCMPE750 - Shaaban#10 lec # 9 Spring 2015 4-7-2015
Configurable Computing Application AreasConfigurable Computing Application Areas• Digital signal processing• Encryption • Image processing• Telemetry Data processing (remote-sensing)• Data/Image/Video compression/decompression• Low-power (through hardware "sharing")• Scientific/Engineering physical system modeling (e.g. finite-element computations). • Network applications (e.g. reconfigurable routers)• Variable precision arithmetic • Logic-intensive applications• In-the-field hardware enhancements • Adaptive (learning) hardware elements • Rapid system prototyping• Verification of processor and ASIC designs• …...
In general many types of applications with few computationally intensive “kernels” (inner-loops?) that can done more efficiently in hardware
Original applications of FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#11 lec # 9 Spring 2015 4-7-2015
Technology Trends Driving Configurable ComputingTechnology Trends Driving Configurable Computing• Increasing gap between "peak" performance of general-purpose processors
and "average actually achieved" performance. – Most programmers don't write code that gets anywhere near the peak
performance of current superscalar CPUs • Improvements in FPGA hardware: capacity and speed:
– FPGAs use standard SRAM processes and "ride the commodity technology" curve (e.g. VLSI technology)
– Volume pricing even though customized solution • Improvements in synthesis and FPGA mapping/routing software • Increasing number of transistors on a (processor) chip (one billion+):
How to use them efficiently? – Bigger caches (Most popular)?– Multiple processor cores? (Chip Multiprocessors - CMPs)– SMT support?– IRAM-style vector/memory?– DSP cores or other application specific processors?– Reconfigurable logic (FPGA or other reconfigurable logic)?
A Combination of the above choices?Heterogeneous Computing System on a Chip?
Micro-Heterogeneous Computing (MHC) ?Micro-Heterogeneous Computing (MHC) ?
CMPE750 - ShaabanCMPE750 - Shaaban#12 lec # 9 Spring 2015 4-7-2015
Configurable Computing Configurable Computing Architectures• Configurable ComputingConfigurable Computing architectures combine elements of general-purpose
computing and application-specific integrated circuits (ASICs).– The general-purpose processor operates with fixed circuits that perform
multiple tasks under the control of software. – An ASIC contains circuits specialized to a particular task and thus needs
little or no software to instruct it. • The configurable computer can execute software commands that alter its
configurable devices (e.g FPGA circuits) as needed to perform a variety of jobs. i.e to changei.e to change
both both functionality functionality and and connectivityconnectivityof hardware elementsof hardware elements(cells) on the fly (cells) on the fly (i.e at runtime)(i.e at runtime)
i.e configuration bit stream
CMPE750 - ShaabanCMPE750 - Shaaban#13 lec # 9 Spring 2015 4-7-2015
Levels of the Reconfigurable Computational ElementsLevels of the Reconfigurable Computational Elements (according to grain size of implemented components)(according to grain size of implemented components)
ReconfigurableReconfigurableLogicLogic
ReconfigurableReconfigurableDatapathsDatapaths
adder
buffer
reg0
reg1
muxCLB CLB
CLBCLB
DataMemory
Inst ructionDecoder
&Controller
DataMemory
ProgramMemory
Datapath
MAC
In
AddrGen
Memory
AddrGen
Memory
ReconfigurableReconfigurableArithmeticArithmetic
ReconfigurableReconfigurableControlControl
Bit-Level Operationse.g. encoding
Dedicated data pathse.g. Filters, AGU
Arithmetic kernelse.g. Convolution
Configurable ProcessorsReal-Time Operating Systems (RTOS):Process management
e.g FPGAse.g FPGAs
Finer GrainFiner Grain Coarser GrainCoarser Grain
CMPE750 - ShaabanCMPE750 - Shaaban#14 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture ComputerHybrid-Architecture Computer • Combines general-purpose processors (GPPs) and reconfigurable devices, commonly:
– FPGA chips (Fine-grain reconfigurable hardware) , or – Micro-coded arrays of simple processors (Coarse-grain reconfigurable hardware) .
• A controller FPGA may load circuit configurations stored in memory onto the processor FPGA in response to the requests of the operating program.
• If the memory does not contain a requested circuit, the processor FPGA sends a request to the PC host, which then loads the configuration for the desired circuit.
• Common Hybrid Configurable Architecture Today: – One or more FPGAs on board connected to host via I/O bus (e.g PCI-Express)
• Possible Future Hybrid Configurable Architecture: – Integrate a region of configurable hardware (FPGA or something else) onto
processor chip itself as reconfigurable functional units or coprocessors– Integrate configurable hardware onto DRAM chip=> Flexible computing without
memory bottleneck
Current Current Hybrid-Architecture on a chip:Hybrid-Architecture on a chip:
Hybrid FPGAs:
Integrate one or more hard-wiredGPPs with an FPGA on the same chipExample: Xilinx Virtex-II Pro, Virtex-4/5 FX (FPGA with one or two PowerPC cores)
11
22
CMPE750 - ShaabanCMPE750 - Shaaban#15 lec # 9 Spring 2015 4-7-2015
Hybrid-Reconfigurable Computer: Levels of Coupling
Different levels of coupling in a hybrid reconfigurable system. Reconfigurable logic is shaded.
Reconfigurable functional units(on chip)
Reconfigurable coprocessor(on or off chip)
Attached (e.g. via PCI) reconfigurable processing unit(Most common today)
External standaloneprocessing unit(e.g. via network/IO interface)
Future direction
ISA Support Function Calls
Loose CouplingLoose CouplingTight CouplingTight Coupling
44
33
11 22
Lower communication time/overheadsLower communication time/overheads(higher bandwidth/lower latency(higher bandwidth/lower latency
CMPE750 - ShaabanCMPE750 - Shaaban#16 lec # 9 Spring 2015 4-7-2015
Sample Configurable Computing Application:Configurable Computing Application:Prototype Video Communications System
• Uses a single FPGA to perform four functions that typically require separate chips. • A memory chip stores the four circuit configurations and loads them sequentially into the
FPGA. • Initially, the FPGA's circuits are configured to acquire digitized video data. • The chip is then rapidly reconfigured to transform the video information into a compressed
form and reconfigured again to prepare it for transmission.• Finally, the FPGA circuits are reconfigured to modulate and transmit the video information. • At the receiver, the four configurations are applied in reverse order to demodulate the data,
uncompress the image and then send it to a digital-to-analog converter so it can be displayed on a television screen.
as needed
CMPE750 - ShaabanCMPE750 - Shaaban#17 lec # 9 Spring 2015 4-7-2015
Early Configurable (or Custom) Computing SuccessesEarly Configurable (or Custom) Computing Successes• DEC Programmable Active Memories, PAM (1992):
– A universal hardware FPGA-based co-processor closely coupled to a standard host computer developed at DEC's Paris Research Laboratory.
– Fast RSA decryption implementation on a reconfigurable machine (10x faster than the fastest ASIC at the time)
• Splash2 (1993): – Attached Processor System using Xilinx FPGAs as processing
elements developed at Center for Computing Sciences. – Performs DNA Sequence matching 300x Cray2 speed, and
200x a 16K Thinking Machines CM2 speed.• Many modern processors and ASICs are verified using
FPGA-based hardware emulation systems.• For many digital signal processing/filtering (e.g FIR, IIR)
algorithms, single chip FPGAs outperform DSPs by 10-100x.
More on Splash 2 in lecture handoutMore on Splash 2 in lecture handout
Fixed-point not FPFixed-point not FP
CMPE750 - ShaabanCMPE750 - Shaaban#18 lec # 9 Spring 2015 4-7-2015
Programmable Circuitry: FPGAsProgrammable Circuitry: FPGAs• Field-Programmable Gate Array (FPGA) introduced by Xilinx (1986). • Original target applications of FPGAs:
1- hardware design verification, 2- rapid-prototyping, and 3- potential ASIC-replacement.
• Programmable circuits can be created or removed by sending signals to gates in the logic elements (configuration bit stream).
• A built-in grid of circuits arranged in columns and rows allows the designer to connect a logic element to other logic elements or to an external memory or microprocessor.
• The logic elements are grouped in Configurable Logic Blocks (CLBs) that perform basic binary operations such as AND, OR and NOT
• Firms, including Xilinx and Altera, have developed devices with the capability of 4,000,000 or more equivalent gates.
• Recently, in addition to “ general-purpose” or generic FPGAs, more specialized FPGA families targeting specific areas such as DSP applications have been developed with hard-wired functional units (e.g. hard-wired MAC units, processors …).
Fine-grain Reconfigurable Hardware Devices:
To change both functionality and connectivity of logic blocksTo change both functionality and connectivity of logic blocks
i.e. Hybrid FPGAsi.e. Hybrid FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#19 lec # 9 Spring 2015 4-7-2015
Field Programmable Gate Arrays (FPGAs)Field Programmable Gate Arrays (FPGAs)• Chip contains many small building blocks that can be configured to implement
different functions. – These building blocks are known as CLBs (Configurable Logic Blocks)
• FPGAs typically "programmed" by having them read in a stream of configuration information from off-chip: – Typically in-circuit programmable (as opposed to EPLDs -Electrically
Programmable Logic Devices- which are typically programmed by removing them from the circuit and using a PROM programmer).
• 25% of an FPGA's gates are application-usable – The rest control the configurability, interconnects, etc.
• As much as 10X clock rate degradation compared to fully custom hardware implementations (ASICs)
• Typically built using SRAM fabrication technology. • Since FPGAs "act" like SRAM or logic, they lose their program (i.e configuration)
when they lose power. – Thus configuration bits need to be reloaded on power-up. Usually reloaded from
a PROM, or downloaded from memory via an I/O bus.
Fine-grain Reconfigurable Hardware Devices:
Using configuration bit streamUsing configuration bit stream
CMPE750 - ShaabanCMPE750 - Shaaban#20 lec # 9 Spring 2015 4-7-2015
Look-Up Table (LUT)Look-Up Table (LUT)
In Out00 001 110 111 0
2-LUT
Mem
In1 In2
Out
• K-LUT -- K input lookup table• Any function of K inputs by programming table
4-LUT4-LUT
2-LUT2-LUT
Fine-grain Reconfigurable Hardware Devices: FPGAs
ConfigurationConfiguration
CMPE750 - ShaabanCMPE750 - Shaaban#21 lec # 9 Spring 2015 4-7-2015
Conventional FPGA TileConventional FPGA Tile
K-LUT (typical k=4) w/ optional output Flip-Flop
~ 75% of FPGA area~ 75% of FPGA area
~ 25% of FPGA area~ 25% of FPGA area
Or configurable Logic Block (CLB) 4-LUT4-LUT
Fine-grain Reconfigurable Hardware Devices: FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#22 lec # 9 Spring 2015 4-7-2015
A Generic Island-style FPGA Routing Architecture
64 CLBs (8x8)64 CLBs (8x8)One TileOne Tile
Customization (configurability) achieved by changing both Customization (configurability) achieved by changing both functionality functionality of hardware elements (CLBs here) and their spatial of hardware elements (CLBs here) and their spatial connectivitconnectivity to match requirements of computation on the fly y to match requirements of computation on the fly using using configuration bit streamconfiguration bit stream
Fine-grain Reconfigurable Hardware Devices: FPGAs
CLBCLB
CMPE750 - ShaabanCMPE750 - Shaaban#23 lec # 9 Spring 2015 4-7-2015
Xilinx XC4000 InterconnectFine-grain Reconfigurable Hardware Devices: FPGAs
Customization (configurability) achieved by changing both Customization (configurability) achieved by changing both functionality functionality of of hardware elements (CLBs here) and their spatial hardware elements (CLBs here) and their spatial connectivitconnectivity to match y to match requirements of computation on the fly using requirements of computation on the fly using configuration bit streamconfiguration bit stream
CMPE750 - ShaabanCMPE750 - Shaaban#24 lec # 9 Spring 2015 4-7-2015
Xilinx XC4000 Xilinx XC4000 Configurable Logic Block (CLB) (CLB)
Cascaded 4 LUTs (2 4-LUTs -> 1 3-LUT)
Fine-grain Reconfigurable Hardware Devices: FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#25 lec # 9 Spring 2015 4-7-2015
FPGAs vs. RISC ProcessorsFPGAs vs. RISC ProcessorsComputational Density ComparisonComputational Density Comparison
RISC ProcessorsRISC Processors
FPGAsFPGAs 10X10X
Fine-grain Reconfigurable Hardware Devices: FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#26 lec # 9 Spring 2015 4-7-2015
Processor vs. FPGA AreaProcessor vs. FPGA AreaFPGA Processor
Fine-grain Reconfigurable Hardware Devices: FPGAs
Cache?Cache?
CMPE750 - ShaabanCMPE750 - Shaaban#27 lec # 9 Spring 2015 4-7-2015
Programming/Configuring FPGAsProgramming/Configuring FPGAs
• (1) Hardware Design Specification: A hardware design to realize the selected hardware-bound computationally-intensive portion of the application is specified using RTL/HDL/logic diagrams.
• Synthesis & Layout: Vendor supplied device-specific software tools are used to convert the hardware design to netlist format. – (2) Partition the design into logic blocks (CLBs) : LUT Mapping– Then find a good (3) placement for each block and (4) routing
between them
• Then the serial configuration bitstream is generated (5) and fed down to the FPGAs themselves – The configuration bits are loaded into a "long shift register" on
the FPGA. – The output lines from this shift register are control wires that
control the behavior of all the CLBs on the chip.
Result of Hardware-Software Partitioning (co-design)
Step 0? : Determine what portion of computation is migrated to hardwareStep 0? : Determine what portion of computation is migrated to hardware i.e. Hardware-Software Partitioning (co-design)
CMPE750 - ShaabanCMPE750 - Shaaban#28 lec # 9 Spring 2015 4-7-2015
Programming/Configuring FPGAsProgramming/Configuring FPGAs
LUTMapping Placement Routing
BitstreamGeneration
Tech. Indep.Optimization
Config.Data
RTL
(1) Hardware Design(1) Hardware Design
(2) Partition the (2) Partition the design CLBsdesign CLBs
(3) Placement for (3) Placement for each CLBeach CLB
((4) Routing between4) Routing between CLBsCLBs
(5) configuration (5) configuration bitstream bitstream generationgeneration
Step 0? : Determine what portion of computation is migrated to hardware (FPGA in this case)Step 0? : Determine what portion of computation is migrated to hardware (FPGA in this case)
Specified in RTL/HDL/Logic diagrams …Specified in RTL/HDL/Logic diagrams …
i.e. Synthesis & Layout
CMPE750 - ShaabanCMPE750 - Shaaban#29 lec # 9 Spring 2015 4-7-2015
Reconfigurable Processor Tools Flow(Hardware/Software Co-design Process Flow)
Customer Application / IP
(C code)
ARC Object
Code
C Compiler
RTLHDL
Linker
Executable
C Model Simulator
Configuration Bits
Synthesis & Layout
C DebuggerDevelopment
Board
Portion to be done in Portion to be done in
Reconfigurable hardware (e.g FPGA)Reconfigurable hardware (e.g FPGA) Portion be Portion be done in softwaredone in software
(1) Hardware Design Specification(1) Hardware Design Specification
(2) Partitioning (2) Partitioning (3) Placement (3) Placement (4) Routing(4) Routing
(5) configuration (5) configuration bitstream bitstream generationgeneration
Hybrid SystemHybrid System
CMPE750 - ShaabanCMPE750 - Shaaban#30 lec # 9 Spring 2015 4-7-2015
Starting Point: (1) Hardware Design Specification(1) Hardware Design Specification
• RTL– t=A+B– Reg(t,C,clk);
• Logic– Oi=AiiCi
– Ci+1 = AiBiBiCiAiCi
Programming/Configuring FPGAsProgramming/Configuring FPGAs
RTL/HDL/logic diagrams
CMPE750 - ShaabanCMPE750 - Shaaban#31 lec # 9 Spring 2015 4-7-2015
Programming/Configuring FPGAsProgramming/Configuring FPGAs(2) Partition the design into logic blocks (CLBs) : LUT Mapping
CLBsCLBs
CMPE750 - ShaabanCMPE750 - Shaaban#32 lec # 9 Spring 2015 4-7-2015
(3) Placement of CLBs(3) Placement of CLBs
• Maximize locality– minimize number of wires in each channel– minimize length of wires– (but, cannot put everything close)
• Often start by partitioning/clustering• State-of-the-art finish via simulated annealing
Programming/Configuring FPGAsProgramming/Configuring FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#33 lec # 9 Spring 2015 4-7-2015
Programming/Configuring FPGAsProgramming/Configuring FPGAs
(3) Placement of CLBs(3) Placement of CLBs
Goal: Maximize LocalityGoal: Maximize Locality
CMPE750 - ShaabanCMPE750 - Shaaban#34 lec # 9 Spring 2015 4-7-2015
(4) Routing Between CLBs(4) Routing Between CLBs
• Often done in two passes:– Global to determine channel.– Detailed to determine actual wires and switches.
• Difficulty is: – Limited available channels.– Switchbox connectivity restrictions.
Programming/Configuring FPGAsProgramming/Configuring FPGAs
CMPE750 - ShaabanCMPE750 - Shaaban#35 lec # 9 Spring 2015 4-7-2015
Programming/Configuring FPGAsProgramming/Configuring FPGAs
(4) Routing Between CLBs(4) Routing Between CLBs
CMPE750 - ShaabanCMPE750 - Shaaban#36 lec # 9 Spring 2015 4-7-2015
Overall Configurable Hardware ApproachOverall Configurable Hardware Approach• Select critical portions or phases of an application where hardware customizations will
offer an advantage: e.g. computationally intensive portion “kernel(s)” of application. • Map those application phases to FPGA hardware:
– Hand hardware design/RTL/VHDL – VHDL => synthesis & layout
• If it doesn't fit in FPGA, re-select application phase (smaller) and try again. • Perform timing analysis to determine rate at which configurable design can be clocked. • Write interface software for communication between main processor (GPP) and
configurable hardware: – Determine where input / output data communicated between software and
configurable hardware will be stored – Write code to manage its transfer (like a procedure call interface in standard
software) – Write code to invoke configurable hardware (e.g. memory-mapped I/O)
• Compile software (including interface code) • Send configuration bits to the configurable hardware • Run program.
Hardware-Software Partitioning
CMPE750 - ShaabanCMPE750 - Shaaban#37 lec # 9 Spring 2015 4-7-2015
Configurable Hardware Configurable Hardware Application Challenges • This process turns applications programmers into:
– Part-time hardware designers.
• Performance analysis problems => what should we put in hardware? – Hardware-software co-design problem
• Choice and granularity of computational elements.• Choice and granularity of interconnect network/degree of coupling.• Long reconfiguration latency.• Synthesis problems. • Testing/reliability problems.
Highly Desirable to ease transition: Silicon CompliersHighly Desirable to ease transition: Silicon Compliers C C S Si i ? ?
Partitioning Partitioning
e.g. SystemC ?e.g. SystemC ?
CMPE750 - ShaabanCMPE750 - Shaaban#38 lec # 9 Spring 2015 4-7-2015
Issues in Using FPGAs for Issues in Using FPGAs for Reconfigurable ComputingReconfigurable Computing
• Hardware-Software Partitioning (co-design)• Run-time reconfiguration latency/overhead
– Time to load configuration bitstream – may take seconds (improving)• Reconfiguration latency hiding techniques.• I/O bandwidth limitations: Need for tight coupling. • Speed, power, cost, density (improving)• High-level language support (improving) • Performance, space estimators • Design verification • Partitioning and mapping across several FPGAs• Partial reconfiguration • Configuration caching. Supported in some/most recent Supported in some/most recent
high-end FPGAshigh-end FPGAs
e.g Hybrid-FPGAse.g Hybrid-FPGAs
With software (processors)With software (processors)C C Si Si
To reduce reconfiguration latencyTo reduce reconfiguration latency
CMPE750 - ShaabanCMPE750 - Shaaban#39 lec # 9 Spring 2015 4-7-2015
Example Reconfigurable ComputingReconfigurable Computing Research Efforts
• PRISM (Brown)• PRISC (Harvard) RC-1• DPGA-coupled uP (MIT)• GARP (RC-3), Pleiades, …
(UCB)• OneChip (Toronto) RC-2• RAW (MIT) RC-4• REMARC (Stanford) RC-5• CHIMAERA RC-6
(Northwestern)
• DEC PAM• Splash 2• NAPA (NSC)• E5 etc. (Triscend)
CMPE750 - ShaabanCMPE750 - Shaaban#40 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute ModelsHybrid-Architecture RC Compute Models1. Unaffected by array logic: Interfacing
– Triscend E5
2. Dedicated IO Processor.– NAPA 1000NAPA 1000
3. Instruction Augmentation: (Tight Coupling)
– Special Instructions / Coprocessor Ops- - PRISM (Brown, 1991) - PRISC (Harvard, 1994) PRISM (Brown, 1991) - PRISC (Harvard, 1994) - Chimaera (Northwestern, 1997) - GARP (Berkeley, 1997)- Chimaera (Northwestern, 1997) - GARP (Berkeley, 1997)- Virtex-4 FX (Xilinx)
– VLIW/microcoded arrays extension to processor - REMARC (Stanford, 1998) - Raw (MIT, 1997) - -REMARC (Stanford, 1998) - Raw (MIT, 1997) - - - MorphoSys (UC Irvine, 2000) - MATRIX (MIT, 1997)- MorphoSys (UC Irvine, 2000) - MATRIX (MIT, 1997) - RaPiD (Reconfigurable Pipelined Datapaths) - RaPiD (Reconfigurable Pipelined Datapaths) (University of Washington, 1996)(University of Washington, 1996)
- PipeRench (Carnegie Mellon, 1999) - PipeRench (Carnegie Mellon, 1999) - DAPDNA-2 (IPFlex Inc., 2004?) ………- DAPDNA-2 (IPFlex Inc., 2004?) ………
4. Autonomous co/stream processor– OneChip (Toronto , 1998)OneChip (Toronto , 1998)
UsuallyUsuallyFPGA-basedFPGA-based(Fine grain RC)(Fine grain RC)
Usually arrays ofUsually arrays ofSimple processorsSimple processors
See DAPDNA HandoutSee DAPDNA Handout
May be considered as ASIC-May be considered as ASIC-replacement efforts and not RC?replacement efforts and not RC?
RC Array runs a separate threadRC Array runs a separate thread
CMPE750 - ShaabanCMPE750 - Shaaban#41 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: InterfacingInterfacing
• Configurable logic used in place of: – ASIC environment
customization– External FPGA/PLD
devices• Example
– bus protocols– peripherals– sensors, actuators
• Case for:– Always have some system
adaptation to do– Modern chips have
capacity to hold processor + glue logic
– reduce part count– Glue logic vary – valued added must now
be accommodated on chip (formerly board level)
May be considered as ASIC-May be considered as ASIC-replacement effort and not RCreplacement effort and not RC
CMPE750 - ShaabanCMPE750 - Shaaban#42 lec # 9 Spring 2015 4-7-2015
Example: Interface/PeripheralsExample: Interface/PeripheralsExample: Triscend E5
May be considered as ASIC-replacement effort and not RCMay be considered as ASIC-replacement effort and not RC
CMPE750 - ShaabanCMPE750 - Shaaban#43 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: IO ProcessorIO Processor
• Configurable array dedicated to servicing IO channel(s):– sensor, lan, wan,
peripheral ..• Provides:
– Flexible protocol handling.– Flexible stream computation
• compression, encrypt (in-place by RC hardware)
• Looks like IO peripheral to processor.
• Case for:– Many protocols, services
supported.– Only need few at a time.– Dedicate attention, offload
work to IO processor.Why?Why?
Why?Why?
CMPE750 - ShaabanCMPE750 - Shaaban#44 lec # 9 Spring 2015 4-7-2015
NAPA 1000 Block DiagramNAPA 1000 Block Diagram
RPCReconfigurablePipeline Cntr
ALPAdaptive Logic
Processor
SystemPort
TBTToggleBusTM
Transceiver
PMAPipeline
Memory Array
CR32CompactRISCTM
32 Bit Processor
BIUBus Interface
Unit
CR32PeripheralDevices
ExternalMemoryInterface SMA
ScratchpadMemory Array
CIOConfigurable
I/O
Reconfigurable IO Processor Example: NAPA 1000Reconfigurable IO Processor Example: NAPA 1000
CMPE750 - ShaabanCMPE750 - Shaaban#45 lec # 9 Spring 2015 4-7-2015
NAPA 1000 as IO ProcessorNAPA 1000 as IO ProcessorSYSTEM
HOST
NAPA1000
ROM &DRAM
ApplicationSpecific
Sensors, Actuators, orother circuits
System Port
CIO
Memory Interface
Reconfigurable IO Processor Example: NAPA 1000Reconfigurable IO Processor Example: NAPA 1000
CMPE750 - ShaabanCMPE750 - Shaaban#46 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Instruction AugmentationInstruction Augmentation
• Observation: Instruction Bandwidth– Processor can only describe a small number of basic
computations in a cycle• I bits 2I operations
– This is a small fraction of the operations one could do even in terms of www Ops
• w22(2w) operations– Processor could have to issue w2(2 (2w) -I) operations (instructions)
just to describe some computations– An a priori selected base set of functions (via ISA instructions)
could be very bad for some applications• Motivation for application-specific processors/ISAs
I = opcode size W = operand word size
ASPsASPs
i.e Fixed ISAi.e Fixed ISA
i.e per instructioni.e per instruction
Com
puta
tion/
ISA
Com
puta
tion/
ISA
Sem
antic
Gap
Sem
antic
Gap
i.e instruction expression limit
i.e. i.e. ISA/ApplicationISA/ApplicationMismatchMismatch
CMPE750 - ShaabanCMPE750 - Shaaban#47 lec # 9 Spring 2015 4-7-2015
Instruction AugmentationInstruction Augmentation
• Idea:– Provide a way to augment the processor’s instruction set (Base
ISA) with operations needed by a particular application and realized by RC hardware.
– Close semantic gap / avoid mismatch between fixed ISA and application computational operations needed.
• What’s required:– Some way to fit augmented instructions into stream– Execution engine for augmented instructions:
• If programmable, has own instructions• FPGA or array of simple micro-coded processors
– Interconnect to augmented instructions.
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models:
shown in shown in last slidelast slide
decode/executedecode/execute
Configurable hardware arrayConfigurable hardware arrayto realize and execute new to realize and execute new augmented instructionsaugmented instructions
Why?Why?
Augment base CPU ISA with additional instructions realized by RC arrayAugment base CPU ISA with additional instructions realized by RC array
CMPE750 - ShaabanCMPE750 - Shaaban#48 lec # 9 Spring 2015 4-7-2015
First Effort In Instruction Augmentation:First Effort In Instruction Augmentation: PRISM (Brown, 1991)PRISM (Brown, 1991)
• Processor Reconfiguration through Instruction Set Metamorphosis (PRISM)
• FPGA on bus (similar to Splash 2)• Access as memory mapped peripheral• Explicit context management• PRISM-1
– 68010 (10MHz) + XC3090– can reconfigure FPGA in one second– 50-75 clocks for operations
Instruction AugmentationInstruction Augmentation
CMPE750 - ShaabanCMPE750 - Shaaban#49 lec # 9 Spring 2015 4-7-2015
PRISM-1 ResultsPRISM-1 Results
Raw kernel speedups (IO configuration time not included?)
CMPE750 - ShaabanCMPE750 - Shaaban#50 lec # 9 Spring 2015 4-7-2015
PRISC (Harvard, 1994)PRISC (Harvard, 1994)• Takes next step:
– What if we put configurable array on chip?– How to integrate into processor ISA?
• Architecture:– RC array tightly coupled into processor register file as
“superscalar” Programmable Functional Unit (PFU).– Flow-through array (no state, combinational logic PFUcombinational logic PFU)
Instruction AugmentationInstruction Augmentation
(paper RC-1)
PRISC = PRogrammable Instruction Set ComputersPRISC = PRogrammable Instruction Set Computers
PFU = Programmable Functional UnitPFU = Programmable Functional Unit
Tight CouplingTight Coupling
Here base ISA selected is MIPSHere base ISA selected is MIPS
FPGA-likeFPGA-likeconfigurable arrayconfigurable array
Single cycle executionSingle cycle execution
CMPE750 - ShaabanCMPE750 - Shaaban#51 lec # 9 Spring 2015 4-7-2015
PRISC ISA IntegrationPRISC ISA Integration
– Add expfu instruction (execute programmable functional unit) to MIPS ISA
– 11 bit address space for user defined expfu instructions– Fault on pfu instruction mismatch
• trap code to service instruction miss– All operations occur in one clock cycle– Easily works with processor context switch
• No state in PFU+ fault on mismatch pfu instruction
Requested Logical PFU Function NumberRequested Logical PFU Function Number
(paper RC-1)PFU = Programmable Functional Unit
i.e combinational logic PFUi.e combinational logic PFU
Here base ISA selected is MIPSHere base ISA selected is MIPS
no state, combinational logic PFUcombinational logic PFU
i.e needed array configuration i.e needed array configuration not loaded in PFUnot loaded in PFU
ConfigurationConfiguration MissMiss
MIPS R-Type MIPS R-Type InstructionInstruction
CMPE750 - ShaabanCMPE750 - Shaaban#52 lec # 9 Spring 2015 4-7-2015
PRISC ResultsPRISC Results
• All compiled• working from MIPS binary• <200 4LUTs ?
– 64x3• 200MHz MIPS base
SPEC92
(paper RC-1)
May not be a good target app.?May not be a good target app.?
CMPE750 - ShaabanCMPE750 - Shaaban#53 lec # 9 Spring 2015 4-7-2015
Chimaera (Northwestern, 1997)Chimaera (Northwestern, 1997)
• Start from PRISC idea– Integrate as functional unit– No state– RFUOPs (like expfu): Reconfigurable FU Operation– Stall processor on instruction miss, reload
• Adds:– Manage multiple instructions loaded – More than 2 inputs possible
Instruction AugmentationInstruction Augmentation
(paper RC-6)
i.e > 2 operand registersi.e > 2 operand registers
i.e needed array configuration not loadedi.e needed array configuration not loaded
i.e. configuration cachingi.e. configuration caching
Similar to PRISC: single-cycle, combinational logic configurable arraySimilar to PRISC: single-cycle, combinational logic configurable array
CMPE750 - ShaabanCMPE750 - Shaaban#54 lec # 9 Spring 2015 4-7-2015
Chimaera ArchitectureChimaera Architecture• “Live” copy of register file
values feed into array• Each row of array may
compute from register values or intermediates (other rows)
• Tag on array to indicate RFUOP
(paper RC-6)
Major difference from PRISC:Major difference from PRISC:More than two register inputs possibleMore than two register inputs possible
FPGA-Like ArrayFPGA-Like Array
HostHostProcessorProcessor
CMPE750 - ShaabanCMPE750 - Shaaban#55 lec # 9 Spring 2015 4-7-2015
Chimaera ArchitectureChimaera Architecture
• Array can compute on values as soon as placed in register file• Logic is combinational• When RFUOP matches
– stall until result ready• critical path
– only from late inputs– Drive result from matching row
(paper RC-6)
No state in arrayNo state in array
CMPE750 - ShaabanCMPE750 - Shaaban#56 lec # 9 Spring 2015 4-7-2015
GARP (Berkeley, 1997)GARP (Berkeley, 1997)
• Integrates configurable array (FPGA-like array) as coprocessor:– Similar bandwidth/coupling to
processor as FU– Array has its own access to memory
• Support multi-cycle operation:– Allow state– Cycle counter to track operation
• Fast operation selection:– Cache for multiple configurations– Dense encodings, wide path to
memory
Instruction AugmentationInstruction Augmentation
(paper RC-3)
MIPS is also the base ISA selected for GARPMIPS is also the base ISA selected for GARP
FPGA-Like ArrayFPGA-Like Array
Not FUNot FU
in one chipin one chip
GARP = Global Array Reconfiguable ProcessorGARP = Global Array Reconfiguable Processor
CMPE750 - ShaabanCMPE750 - Shaaban#57 lec # 9 Spring 2015 4-7-2015
GARPGARP• Augmented MIPS ISA -- coprocessor operations
– Issue gaconfig to make a particular configuration resident (may be active or cached).
– Explicitly move data to/from array• 2 writes, 1 read (like FU, but not 2W+1R)
– Processor suspend during co-processor operation:• Cycle count tracks operation
– Array may directly access memory:• Processor and array share memory space
– cache/mmu keeps consistent between processor and array• Can exploit streaming and data parallel operations.
(paper RC-3)
State/multi-cycle operation allowed in GARPState/multi-cycle operation allowed in GARP
Base ISABase ISA
CMPE750 - ShaabanCMPE750 - Shaaban#58 lec # 9 Spring 2015 4-7-2015
GARP Processor InstructionsGARP Processor Instructions
Augmented to MIPS ISA
(paper RC-3)
Get results from arrayGet results from array
Pass operands to arrayPass operands to arrayLoad a configurationLoad a configuration
CMPE750 - ShaabanCMPE750 - Shaaban#59 lec # 9 Spring 2015 4-7-2015
GARP ArrayGARP Array
• Row oriented logic– Denser for datapath
operations• Dedicated path for
– Processor/memory data• Processor does not have to be
involved in array memory path.
(paper RC-3)
FPGA-Like ArrayFPGA-Like Array
i.e DMA used by arrayi.e DMA used by array
CMPE750 - ShaabanCMPE750 - Shaaban#60 lec # 9 Spring 2015 4-7-2015
GARP ResultsGARP Results
• General results– 10-20x on stream, feed-
forward operation– 2-3x when data-
dependencies limit pipelining
(paper RC-3)
FPGA-Like ArrayFPGA-Like Array
CMPE750 - ShaabanCMPE750 - Shaaban#61 lec # 9 Spring 2015 4-7-2015
PRISC/Chimera vs. GARPPRISC/Chimera vs. GARP
• PRISC/Chimaera– Basic op is single cycle: expfu (rfuop)
– No state– Could conceivably have
multiple PFUs?– Discover parallelism =>
run in parallel?– Can’t run deep pipelines– Configurable array has no
direct access to memory
• GARP– Basic op is multicycle
• gaconfig• mtga• mfga
– Can have state/deep pipelining
– Multiple arrays viable?– Identify mtga/mfga w/
corr gaconfig?– Configurable array has
access to memory
CMPE750 - ShaabanCMPE750 - Shaaban#62 lec # 9 Spring 2015 4-7-2015
Common Instruction Augmentation FeaturesCommon Instruction Augmentation Features
• To get around instruction expression limits (and the resulting computation/ISA semantic gapcomputation/ISA semantic gap)):– Define new instruction in configurable array (FPGA
usually).• Many bits of config … broad expressability• Many parallel operations possible.
– Give array configuration short “name” which processor can callout (augmented instructions)
• …effectively the address of the operation
Close semantic gapClose semantic gap
CMPE750 - ShaabanCMPE750 - Shaaban#63 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: VLIW/microcoded ModelVLIW/microcoded Model
• Similar to instruction augmentation model but….• Usually utilize micro-coded arrays of simple processors
(Coarse-grain reconfigurable hardware, not FPGAs) .• Single tag (address, instruction)
– controls a number of more basic operations• Some difference in expectation
– can sequence a number of different tags/operations together
-REMARC (Stanford, 1998) - RAW (MIT, 1997) REMARC (Stanford, 1998) - RAW (MIT, 1997) - MorphoSys (UC Irvine, 2000) - MATRIX (MIT, 1997- MorphoSys (UC Irvine, 2000) - MATRIX (MIT, 1997- RaPiD (Reconfigurable Pipelined Datapaths) - RaPiD (Reconfigurable Pipelined Datapaths) (University of Washington, 1996)(University of Washington, 1996)
-PipeRench (Carnegie Mellon, 1999) PipeRench (Carnegie Mellon, 1999) - DAPDNA-2 (IPFlex Inc., 2004?) DAPDNA-2 (IPFlex Inc., 2004?) ……… ………
Examples:Examples:
Coarse Grain RCCoarse Grain RC
See DAPDNA-2 HandoutSee DAPDNA-2 Handout
CMPE750 - ShaabanCMPE750 - Shaaban#64 lec # 9 Spring 2015 4-7-2015
REMARC (Stanford, 1998)REMARC (Stanford, 1998)• Array of “nano-processors”
– 16b, 32 instructions each– VLIW like execution, global sequencer
• Coprocessor interface (similar to GARP)– No direct array memory
VLIW/microcoded ModelVLIW/microcoded Model
(paper RC-5)
Global Control UnitGlobal Control Unit
ButBut
RC ArrayRC Array
CMPE750 - ShaabanCMPE750 - Shaaban#65 lec # 9 Spring 2015 4-7-2015
REMARC ArchitectureREMARC Architecture• Issue coprocessor rex
– Global controller sequences nano-processors
– Multiple cycles (microcode)
• Each nano-processor has own I-store (VLIW)
Here array hasHere array has8 x 8 = 64 nano-processors8 x 8 = 64 nano-processors
Nano-processorNano-processor (paper RC-5)
(micro-coded)(micro-coded)
Micro-codeMicro-codeRAMRAM
CMPE750 - ShaabanCMPE750 - Shaaban#66 lec # 9 Spring 2015 4-7-2015
REMARC ResultsREMARC Results
MPEG2
DES
(paper RC-5)
CMPE750 - ShaabanCMPE750 - Shaaban#67 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Observation Observation
• All single threaded– Limited to parallelism at:
• Instruction level (VLIW, bit-level)• Data level (vector/stream/SIMD)
– No task/thread level parallelism (TLP)• Except for IO dedicated task parallel with processor task
CMPE750 - ShaabanCMPE750 - Shaaban#68 lec # 9 Spring 2015 4-7-2015
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Autonomous CoroutineAutonomous Coroutine
• RC Array task is decoupled from processor– Fork operation / join upon completion
• Array has own – Internal state– Access to shared state (memory)
• NAPA supports this to some extent– Task level, at least, with multiple devices
Example:Example:
- OneChip (Toronto , 1998)- OneChip (Toronto , 1998)
Separate threadSeparate thread
(paper RC-2)
CMPE750 - ShaabanCMPE750 - Shaaban#69 lec # 9 Spring 2015 4-7-2015
OneChip (Toronto , 1998)OneChip (Toronto , 1998)
• Want array to have direct memorymemory operations
• Want to fit into programming model/ISA– without forcing exclusive processor/FPGA operation– allowing decoupled processor/array execution
• Key Idea:– FPGA operates on memory memory regions– Make regions explicit to processor issue– scoreboard memory blocks
(paper RC-2)
To Check for dependency violationsTo Check for dependency violations
Hybrid-Architecture RC Compute Models:Hybrid-Architecture RC Compute Models: Autonomous CoroutineAutonomous Coroutine
CMPE750 - ShaabanCMPE750 - Shaaban#70 lec # 9 Spring 2015 4-7-2015
OneChip PipelineOneChip Pipeline
(paper RC-2)
ProcessorProcessor
FPGAFPGAHas its own configured Has its own configured simple processorsimple processor““limited ISA”limited ISA”
MainMainMemoryMemory
CMPE750 - ShaabanCMPE750 - Shaaban#71 lec # 9 Spring 2015 4-7-2015
OneChip CoherencyOneChip Coherency
(paper RC-2)
RAWRAW
RAWRAW
WARWAR
WARWAR
WAWWAW
WAWWAW
CMPE750 - ShaabanCMPE750 - Shaaban#72 lec # 9 Spring 2015 4-7-2015
OneChip InstructionsOneChip Instructions
• Basic Operation is:– FPGA MEM[Rsource]MEM[Rdst]
• block sizes powers of 2
• Supports 14 “loaded” functions– DPGA/contexts so 4 can be cached
(paper RC-2)
Memory to memoryMemory to memory
Source Memory Block Base addressSource Memory Block Base address
Destination Memory Block Base addressDestination Memory Block Base address
CMPE750 - ShaabanCMPE750 - Shaaban#73 lec # 9 Spring 2015 4-7-2015
OneChipOneChip
• Basic op is: FPGA MEM MEM• No state between these ops• coherence is that ops appear sequential• could have multiple/parallel FPGA Compute units
– scoreboard with processor and each other• Can’t chain FPGA operations?
(paper RC-2)
CMPE750 - ShaabanCMPE750 - Shaaban#74 lec # 9 Spring 2015 4-7-2015
SummarySummary• Several different models and uses for a “Reconfigurable
Processor”:– On computational kernels
• seen the benefits of coarse-grain interaction– GARP, REMARC, OneChip
– Missing: • More full application (multi-application) benefits of these
architectures...• Exploit density and expressiveness of fine-grained, spatial
operations• Number of ways to integrate cleanly into processor
architecture…and their limitations