cs250b: modern computer systemsswjun/courses/2020s-cs250b... · o supports fpga programming,...
TRANSCRIPT
![Page 1: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/1.jpg)
CS250B: Modern Computer Systems
What Are FPGAsAnd Why Should You Care
Sang-Woo Jun
![Page 2: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/2.jpg)
What Are FPGAs
Field-Programmable Gate Array
Can be configured to act like any circuit – More later!
Can do many things, but we focus on computation acceleration
![Page 3: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/3.jpg)
FPGAs Come In Many Forms
PCIe-Attached
CPU Integrated In-Network
In-Storage
![Page 4: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/4.jpg)
How Is It Different From CPU/GPUs
GPU – The other major accelerator
CPU/GPU hardware is fixedo “General purpose”
o we write programs (sequence of instructions) for them
FPGA hardware is not fixedo “Special purpose”
o Hardware can be whatever we want
o Will our hardware require/support software? Maybe!
Optimized hardware is very efficiento GPU-level performance**
o 10x power efficiency (300 W vs 30 W)
![Page 5: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/5.jpg)
Analogy
“The Z-Berry”“Experimental Investigations on Radiation Characteristics of IC Chips”benryves.com “Z80 Computer”
CPU/GPU comes with fixed circuits FPGA gives you a big bag of components
To build whatever
Shadi Soundation: Homebrew 4 bit CPU
Could be a CPU/GPU!
![Page 6: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/6.jpg)
Fine-Grained Parallelism of Special-Purpose Circuits
Example -- Calculating gravitational force: 𝐺×𝑚1×𝑚2
(𝑥1−𝑥2)2+(𝑦1−𝑦2)
2
8 instructions on a CPU → 8 cycles**
Much fewer cycles on a special purpose circuit
A = G × m1
B = A × m2
C = x1 - x2
D = C2
E = y1 - y2
F = E2
G = D + F
Ret = B / G
A = G × m1 × m2 B = (x1 - x2)2 C = (y1 - y2)2
D = B + C
Ret = B / G
4 cycles with basic operations
3 cycles with compound operations
Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2)
1 cycle with even further compound operations
May slow down clock
![Page 7: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/7.jpg)
Also, Remember:
![Page 8: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/8.jpg)
Coarse-Grained Parallelism ofSpecial-Purpose Circuits
Typical unit of parallelism for general-purpose units are threads ~= cores
Special-purpose processing units can also be replicated for parallelismo Large, complex processing units: Few can fit in chip
o Small, simple processing units: Many can fit in chip
Only generates hardware useful for the applicationo Instruction? Decoding? Cache? Coherence?
![Page 9: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/9.jpg)
Hurdles of FPGA Programming
Very close to actually designing a chip, with all associated difficultieso Signal propagation delays, clock speeds
o Circuit design with acceptable propagation delay
o Explicit pipeline design for performance
o On-chip resource management (Number of available transistors, SRAM, …)
High-level tools exist, but not yet very effectiveo Often trade-off between ease of programming and performance/utilization
![Page 10: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/10.jpg)
How Is It Different From ASICs
ASIC (Application-Specific Integrated Circuit)o Special chip purpose-built for an application
o E.g., ASIC bitcoin miner, Intel neural network accelerator
o Function cannot be changed once expensively built
+ FPGAs can be field-programmedo Function can be changed completely whenever
o FPGA fabric emulates custom circuits
- Emulated circuits are not as efficient as bare-metalo ~10x performance (larger circuits, faster clock)
o ~10x power efficiency
![Page 11: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/11.jpg)
Prominent Technology For Performance Scaling
Efficient use of increasing transistor budgeto Platform for specialized architectures
Reconfigurable based on application/algorithm changeso Kind of like software running on a hardware processor
o Single FPGA hardware can be shared between different applications
Common operations often embedded using ASIC blockso High-performance, low-power, small size
o Arithmetic, shifters, neural network operations, high-speed communications, etc
![Page 12: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/12.jpg)
FPGA Architecture Goals
Reconfigure a physical chip to act like a different circuit
Transistors within an integrated circuit are fixed during fabrication!o How can FPGAs change themselves?
Shadi Soundation: Homebrew 4 bit CPU
![Page 13: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/13.jpg)
Basic FPGA Architecture“Configurable logic block (CLB)”
Programmable interconnect
I/O block 6-InputLook-Up
Table
FF
Latch
Programmable
Input 1 Input 2 Output
0 0 0
0 1 0
1 0 0
1 1 1
Ex) 2-LUT for “AND”
~
Stores state forsequential circuit
construction
![Page 14: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/14.jpg)
Basic FPGA Architecture – DSP Blocks
CLBs act as gates – Many needed to implement high-level logic
Arithmetic operation provided as efficient ALU blockso “Digital Signal Processing (DSP) blocks”
o Each block provides an adder + multiplier
“DSP block”
× +/-
![Page 15: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/15.jpg)
Basic FPGA Architecture – Block RAM
CLB can act as flip-flopso (~1 bit/block) – tiny!
Some on-chip SRAM provided as blockso ~18/36 Kbit/block, MBs per chip
o Massively parallel access to data → multi-TB/s bandwidth
“Block RAM”
![Page 16: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/16.jpg)
Basic FPGA Architecture – Hard Cores
Some functions are provided as efficient, non-configurable “hard cores”o Multi-core ARM cores (“Zynq” series)
o Multi-Gigabit Transceivers
o PCIe/Ethernet PHY
o Memory controllers
o …
ARM PCIe
Ethernet
Memory
![Page 17: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/17.jpg)
Example Accelerator Card Architecture
PCIe
FPGA
DRAM
DRAM
1GbE
FMC
40GbE
“FPGA Mezzanine Card” Expansiono Network Ports, Memory, Storage, PCIe, …
General-Purpose I/O Pins Multi-Gigabit Transceivers
![Page 18: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/18.jpg)
Example Development Board (Xilinx VCU108)
![Page 19: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/19.jpg)
Example Development Board(Intel Stratix 10)
![Page 20: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/20.jpg)
Example Accelerator Card (Xilinx Alveo U250)
![Page 21: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/21.jpg)
Programming FPGAs
Languages and tools overlap with ASIC/VLSI design
FPGAs for acceleration typically done with eithero Hardware Description Languages (HDL): Register-Transfer Level (RTL) languages
o High-Level Synthesis: Compiler translates software programming languages to RTL
![Page 22: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/22.jpg)
Major Hardware Description Languages
Verilog: Most widely used in industryo Relatively low-level language supported by everyone
Chisel – Compiles to Verilogo Relatively high-level language from Berkeley
o Embedded in the Scala programming language
o Prominently used in RISC-V development (Rocket core, etc)
Bluespec – Compiles to Verilogo Relatively high-level language from MIT
o Supports types, interfaces, etc
o Also active RISC-V development (Piccolo, etc)
![Page 23: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/23.jpg)
Register-Transfer Level
RTL models a circuit using:o Registers (state), and
o Combinational logic (computation)
o Typically everything is clock-synchronous
Unfamiliar constraint: Transfer must finish within a clock cycleo Logic must have a short enough critical path, or
o Clock must be slow enough
Register A
Register B
Transfer (Logic)Register A
Register C
![Page 24: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/24.jpg)
Reminder: Critical Path
A chain of logic components has additive delayo The “depth” of combinational circuits is important
The “critical path” defines the overall propagation delay of a circuit
Example: A full adderSource: en:User:Cburnett @ Wikimedia
Critical path of three componentstPD = tPD(xor2)+tPD (and2)+tPD (or2)
![Page 25: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/25.jpg)
Timing Behavior of State Elements
Meeting the setup time constrainto “Processing must fit in clock cycle”
o After rising clock edge,
o tPD(State element 1) + tPD(Combinational logic) + tSETUP(State element 2)
o must be smaller than the clock period
Data from here…must reach here
…before the next clock Otherwise, “timing violation”
![Page 26: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/26.jpg)
Complexities of RTL
Example RTL logic:o Reg#(Bit#(64)) A, B; // Two 64-bit registers
o A <= (A>>B); // Somewhere, do a variable-width shift
o This is very inefficient on an FPGA! Very long critical path• Long critical path -> Slow clock
• Aside: Reg#(Bit#(2)) B; then A>>(B*16); Generates much better hardware
Kind of have to know what kind of circuits are generated by what logico Typically covered by a few rule of thumbs
o Will be covered later!
![Page 27: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/27.jpg)
High-Level Synthesis
Compiler translates software programming languages to RTL
High-Level Synthesis compiler from Xilinx, Altera/Intelo Compiles C/C++, annotated with #pragma’s into RTL
o Theory/history behind it is a complex can of worms we won’t go into
o Personal experience: needs to be HEAVILY annotated to get performance
o Anecdote: Naïve RISC-V in Vivado HLS achieves IPC of 0.0002 [1], 0.04 after optimizations [2]
OpenCLo Inherently parallel language more efficiently translated to hardware
o Stable software interface
[1] http://msyksphinz.hatenablog.com/entry/2019/02/20/040000[2] http://msyksphinz.hatenablog.com/entry/2019/02/27/040000
![Page 28: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/28.jpg)
FPGA Compilation Toolchain
High-Level HDL Code
Language Compiler
Verilog/VHDL
Bitfile
High-level language vendor tool
Synthesize NetlistMap/Place/Route
FPGA Vendor toolchain (Few open source)
Constraint File
“Which transceiver instance shouldtop_transceiver_01 map to?”And so, so much more…
FunctionalSimulation
Cycle-levelSimulation
![Page 29: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/29.jpg)
Example System Abstraction For Accelerators
Vendor-Provided “Shell”
Ethernet DRAM…
User Code
Abstract Interface
Hardware
User Software
Software
PCIe Vendor PCIe Driver
Abstract Interface
![Page 30: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/30.jpg)
Programming/Using an FPGA Accelerator
Bitfile is programmed to FPGA over “JTAG” interfaceo Typically used over USB cable
o Supports FPGA programming, limited debugging access, etc
o Kind of slow…
o Bitfile often stored in on-board flash for persistence
Modern FPGAs provide faster programming methods as wello On-chip accelerator to load from local memory
• e.g., Xilinx ICAP (Internal Configuration Access Port)
o Milliseconds to program a new design
![Page 31: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/31.jpg)
Programming/Using an FPGA Accelerator
PCIe-attached FPGA accelerator card is typically used similarly to GPUso Program FPGA, execute software
o Software copies data to FPGA board, notify FPGA-> FPGA logic performs computations-> Software copies data back from FPGA
FPGA flexibility gives immense freedom of usage patternso Streaming, coherent memory, …
o Copy to on-board DRAM, or directly to on-chip memory, …
o Decompress before loading to DRAM, etc!
![Page 32: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/32.jpg)
Partial Reconfiguration
FPGA
Sub-components Parts of the FPGA can be swapped out dynamically without turning off FPGAo Physical area is drawn on chip
Used in Amazon F1, etc
Toolchain support for isolation
![Page 33: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/33.jpg)
FPGAs In The Cloud
Amazon EC2 F1 instance (1 – 4 FPGAs)
![Page 34: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/34.jpg)
FPGAs In The Cloud
Microsoft Azure with FPGAs embedded in the networko Powering Bing, etc
![Page 35: CS250B: Modern Computer Systemsswjun/courses/2020S-CS250B... · o Supports FPGA programming, limited debugging access, etc o Kind of slow… o Bitfile often stored in on-board flash](https://reader035.vdocument.in/reader035/viewer/2022071300/60892e280a011d43e0362e63/html5/thumbnails/35.jpg)
Questions?