cs294-48: hardware design patterns berkeley hardware
TRANSCRIPT
![Page 1: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/1.jpg)
CS294-48: Hardware Design PatternsBerkeley Hardware Pattern Language Version 0.4
Krste AsanovicUC Berkeley
Fall 2009
![Page 2: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/2.jpg)
Overall Problem Statement
Application(s)
MP3 bit string Audio
Hardware (RTL)
MP3 bit stringAudio
(Berkeley) Hardware Pattern Language
![Page 3: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/3.jpg)
BHPL Goals
BHPL captures problem-solution pairs for creating hardware designs (machines) to execute applications
BHPL Non-GoalsDoesn’t describe applications themselves, only machines that execute applications and strategies for mapping applications onto machines
![Page 4: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/4.jpg)
BHPL Overview
Dense Linear Algebra
Sparse Linear Algebra
Spectral Methods FSMsGraph
Algorithms
Circuits
N-Body Methods
Dynamic Programming
Computational Patterns
Graph Traversal
Structured Grids
Unstructured Grids Graphical
Models
Pipelines
Model-View-Controller
Event Based Process Control
Agent&Repository
Structural Patterns
Map-Reduce
Iteration Layered Systems
Task Graphs
Applications (including OPL patterns)
Machines
BHPLMapping Patterns
![Page 5: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/5.jpg)
Machine Vocabulary
![Page 6: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/6.jpg)
Machine VocabularyMachines described using a hierarchical structural decomposition
Units (processing engines)MemoriesNetworks (connect multiple entities)Channels (point-to-point connections)(Memories, Networks, and Channels are just specialized Units)
![Page 7: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/7.jpg)
Hierarchy within Unit
Input Port
Output Port
Input/Output Port
![Page 8: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/8.jpg)
Hierarchy within Memory
![Page 9: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/9.jpg)
Hierarchy within Network (2)
![Page 10: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/10.jpg)
Hierarchy within Network
![Page 11: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/11.jpg)
Hierarchy within Channel
![Page 12: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/12.jpg)
Units are FSMsAll units are digital hardware, i.e., describable as a finite-state machine (FSM)
Different ways of factoring out the FSM description of a unit
Structural decomposition into hierarchical sub-units
Decompose functionality into control + datapath
Can further decompose control into inter-transaction scheduling plus intra-transaction sequencing
All factorings are equivalent, so pick factoring that best explains what unit does
![Page 13: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/13.jpg)
Structural Decomposition
![Page 14: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/14.jpg)
Control + Datapath
![Page 15: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/15.jpg)
Controller TypesState Machine Controller
control lines generated by state machine
Microcoded Controllersingle-cycle datapath, control lines in ROM/RAM
In-Order Pipeline Controllerpipelined control, dynamic interaction between stages
Out-of-Order Pipeline Controlleroperations within a control stream might be reordered internally
Threaded Pipeline Controllermultiple control streams one execution pipelinecan be either in-order (PPU) or out-of-order
![Page 16: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/16.jpg)
Leaf-Level HardwareRegister
Memory
Combinational Logic Wires
Tristate driverMultiplexer/ALU
FIFO
• Conventional schematic notation• (Need additional notation for asynchronous logic?)
![Page 17: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/17.jpg)
Hardware Patterns
![Page 18: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/18.jpg)
Decoupled UnitsProblem: Difficult to design a large unit with a single controller, especially when components have variable processing rates. Large controllers have long combinational paths.
Solution: Break large unit into smaller sub-units where each sub-unit has a separate controller and all channels between sub-units have some form of decoupling (i.e., no combinational path between units on each side of channel).
Applicability: Larger units where area and performance overhead of decoupling is small compared to benefits of simpler design and shorter controller critical paths.
Consequences: Decoupled channels generally have greater communication latency and area/power cost. Sub-unit controllers must cope with unknown arrival time of inputs and unknown time of availability of space on outputs. Sub-units must be synchronized explicitly.
![Page 19: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/19.jpg)
Decoupled Units
SharedMemory
Network
Unless shared memory is truly multiported, channels to memory must be decoupled
Channels to network are
always decoupled in any case
![Page 20: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/20.jpg)
Pipelined OperatorProblem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is required.
Solution: Divide combinational function using pipeline registers such that logic in each stage has critical path below desired cycle time. Improve throughput by initiating new operation every clock cycle overlapped with propagation of earlier operations down pipeline.
Applicability: Operators that require high throughput but where latency is not critical.
Consequences: Latency of function increases due to propagation through pipeline registers, adds energy/op. Any associated controller might have to track execution of operation across multiple cycles.
![Page 21: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/21.jpg)
Pipelined Operator
Clock Clock Clock
Clock Clock
f(g(in))
g(in) f(in)
![Page 22: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/22.jpg)
Multicycle Operator
Problem: Combinational function of operator has long critical path that would reduce system clock frequency. High throughput of this function is not required.
Solution: Hold input registers stable for multiple clock cycles of main system, and capture output after combinational function has settled.
Applicability: Operators where high throughput is not required, or if latency is critical (in which case, replicate to increase throughput).
Consequences: Associated controller has to track execution of operation across multiple cycles. CAD tools might detect false critical path in block.
![Page 23: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/23.jpg)
Multicycle Operator
Clock/2 Clock/2
Clock Clock
f(g(in))
f(g(in))
![Page 24: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/24.jpg)
Memory PatternsTrue Multiport Memory
Banked MemoryInterleave lesser-ported banks to provide higher bandwidth
Cached MemoryMemory hierarchy to provide higher-bandwidth, lower latency for predictable accesses
Bypassed MemoryReduce latency of pipelined dependent memory accesses
![Page 25: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/25.jpg)
Network Patterns
Connects multiple units using shared resources
BusLow-cost, ordered
CrossbarHigh-performance
Multi-stage networkTrade cost/performance
![Page 26: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/26.jpg)
Control+Datapath
Problem:
Solution:
Applicability:
Consequences:
![Page 27: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/27.jpg)
Machine Types
If SCSD, SCMD, MCMD machines are patterns, what is the problem-solution?
If they’re solutions, what’re the problems?
![Page 28: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/28.jpg)
SCMD Distributed Memory
Examples: MPP, ICL DAP, CM-1, CM-2, MasPar, Sony Playstation-2 Graphics Engine, Vision processing chips
C
N
D
M
D
M
D
M
D
M
D
M
![Page 29: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/29.jpg)
SCMD Shared Memory
Examples: STARAN, BSP, TI ASC, CDC Star-100, Multi-Lane Vector Machines
C
M
D D D D D
![Page 30: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/30.jpg)
MCMD Shared Memory
Examples: Burroughs B5x00 series, Network Packet Routers
M
D
C
D
C
D
C
D
C
D
C
![Page 31: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/31.jpg)
Homogeneous MCMD Distributed Memory
Examples: Caltech Cosmic Cube, Transputer, nCube, Clusters
Message Network
D
M
C
D
M
C
D
M
C
D
M
C
D
M
C
![Page 32: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/32.jpg)
Heterogeneous MCMD Distributed Memory
Examples: Signal Processing Pipelines,
P
M
P
M P
M
P
M
P
M
P = C + D
![Page 33: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/33.jpg)
Systolic
Examples: Warp, Raw, Motion Estimation Engines,
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P = C + D
![Page 34: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/34.jpg)
Channels
Control->Datapathdirectpipelined? (maybe don’t need pipelined controller?)
Datapath<->Memoryfixed latency
cannot have shared memory without true multiportdecoupled in-orderout-of-order
Control<->Network<->Controlfixed latencyFIFOsaddressable messaging
![Page 35: CS294-48: Hardware Design Patterns Berkeley Hardware](https://reader031.vdocument.in/reader031/viewer/2022012413/616cdffba1652d16682aaac1/html5/thumbnails/35.jpg)
BHPL Version 0.3
Systolic
Application Patterns
(from OPL)
SCMD Distributed Memory Heterogeneous MCMD Distributed Memory
Machine Organizations
SCMD Shared Memory
MCMD Shared Memory
Hardware Building Blocks
FIFO Multiport Memory
CAM
Arbiter
ProcessingIn-Order Pipeline
FSM
Microcoded Engine Out-of-
Order Pipeline
Threaded Pipeline
Communication Channel
Crossbar
Memory NetworksPMNC Layer
Banked Memory
Cached Memory
Bypassed Memory
Bus
Multi-Stage Networks
Dense Linear Algebra
Sparse Linear Algebra
Spectral Methods FSMs
Graph Traversal
Graph Algorithms
Circuits
N-Body Methods
Dynamic Programming
Structured Grids
Unstructured Grids Graphical
Models
Computational PatternsPipelines
Model-View-Controller
Event Based
Map-Reduce
Process Control
Iteration
Agent&Repository
Layered Systems
Task Graphs
Structural Patterns
Homogeneous MCMD Distributed Memory
Channels