lecture 10 hardware accelerators ingo sander [email protected]
TRANSCRIPT
IntroductionHardware Accelerator
April 20, 2023 IL2206 Embedded Systems 3
Design Constraint Propagation
A design constraint on system level leads to new design constraints on subsystem level
P1 P2 P3
t < 500 ms
Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms
Constraint on System
April 20, 2023 IL2206 Embedded Systems 4
Design Constraint Propagation
An estimation tool can give the execution time of a subsystem
What happens, if a subsystem is too slow?
P1 P2 P3
t < 500 ms
Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms
Constraint on System
Execution Time 95 ms 280 ms 145 ms
Too Slow!
April 20, 2023 IL2206 Embedded Systems 5
How to improve the performance of a microprocessor system?
Improve your code Choose a faster version of your
microprocessor Add additional computational units that are
perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator
April 20, 2023 IL2206 Embedded Systems 6
Hardware Accelerators
If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator!
The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 7
Accelerated System Architecture
CPU
accelerator
memory
I/O
Request1
Data2
Result
3
Request and Result may also require access to memory
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 8
An Accelerator is not a Co-Processor
A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU.
An accelerator appears as a device on the bus
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 9
Amdahl’s Law
Amdahl’s law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!
Enhanced
EnhancedEnhanced
Enhanced
Old
SpeedupFraction
Fraction
imeExecutionT
imeExecutionTSpeedup
)1(
1
Fraction denotes the percentage the enhancement can be used!
April 20, 2023 IL2206 Embedded Systems 10
Example (Henessey & Patterson)
An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to implement a square root unit that speeds up this
operation with a factor of 10, or to Improve the floating-point instructions in general
so that they can run 2 times faster.
April 20, 2023 IL2206 Embedded Systems 11
Example (Henessey & Patterson)
Square Root: Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22
Floating-Point: Speedup = 1 / ((1-0.5)+0.5/2) =
1/0.75 = 1.33
April 20, 2023 IL2206 Embedded Systems 12
Almdahl’s LawLessons to be learned
The maximum speedup that is possible is limited by the fraction! Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F)
Fraction F 0.1 0.3 0.5 0.9
Max. Speedup 1.11 1.43 2 10
Improve the common cases!
April 20, 2023 IL2206 Embedded Systems 13
Amdahl’s Law for Parallel Architectures
Amdahl’s law can even be used for parallel architectures, where sequential code is parallelized and runs on identical parallel units!
itsParallelUnFraction
Fraction
imeExecutionTimeExecutionT
Speedup
ParallelParallel
Parallel
Serial
)1(
1
Fraction denotes the percentage of the code parallelism can be used!
DesignHardware Accelerator
April 20, 2023 IL2206 Embedded Systems 15
Design of a hardware accelerator
Which functions shall be implemented in hardware and which functions in software?
Hardware/software co-design: joint design of hardware and software architectures
The hardware accelerator can be implemented in Application-specific integrated circuit. Field-programmable gate array (FPGA).
April 20, 2023 IL2206 Embedded Systems 16
Hardware Software Co-Design
SWCompilation
ExecutableProgram
SystemModel Original Program
(concurrent processes
Partitioning& Mapping
Which functions shall go to HW and SW?
Netlist
HWSynthesis
Verification
HW-Model(VHDL)
SW-Model(C/C++)
Verification
EstimationLibrary
Good estimates are needed for good partitioning
April 20, 2023 IL2206 Embedded Systems 17
Hardware/Software Co-Design
Hardware/Software Co-design covers the following problems Co-Specification: the creations of specifications
that describe both the hardware and software of a system
Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification
Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction
April 20, 2023 IL2206 Embedded Systems 18
Co-Synthesis
Four tasks are included in co-synthesis Partitioning: The functionality of the system is divided into
smaller, interacting computation units Allocation: The decision, which computational resources
are used to implement the functionality of the system Scheduling: If several system functions have to share the
same resource, the usage of the resource must be scheduled in time
Mapping: The selection of a particular allocated computational unit for each computation unit
All these tasks depend on each other!
April 20, 2023 IL2206 Embedded Systems 19
Partitioning During partitioning the functionality of the system is
partitioned into several parts (corresponding to the allocated/available components)
Many possible partitions exist Analysis is done by evaluating the costs of different
partitions
B
A
E
DC
B
A
E
DC
April 20, 2023 IL2206 Embedded Systems 20
Estimation
In order to get a good partitioning, there is a need for good figures about performance for a function on different
components execution time for communication time
April 20, 2023 IL2206 Embedded Systems 21
EstimationAccuracy and Fidelity
The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation
The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations
April 20, 2023 IL2206 Embedded Systems 22
Fidelity
Though accuracy is much higher in (2) than in (1), the estimates are not very useful for the partitioning process because of the low fidelity!
This can cause bad design decisions!
Quality metric
A B C
Quality metric
A B C
Fidelity = 100% Fidelity = 33% (only A > C correct)
1 2
Estimate
Measurement
April 20, 2023 IL2206 Embedded Systems 23
Hardware/Software Co-Design
Strategies:1. Start with an ”all-software”-configuration
While (Constraints are not satisfied)
Move the SW function that gives the best improvement to HW
(implemented in COSYMA [Ernst, Henkel, Brenner 1993])
2. Start with an ”all-hardware”-configurationWhile (Constraints are satisfied)
Move the most costly HW component to SW
(implemented in Vulcan [Gupta, DeMicheli 1995])
April 20, 2023 IL2206 Embedded Systems 24
Papers on HW/SW Co-Design R. Ernst et al. Hardware-software co-synthesis from
Microcontrollers. IEEE Design & Test of Computers. December 1993.
R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993.
G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997.
… (and much much more)
Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)
April 20, 2023 IL2206 Embedded Systems 25
System design tasks
Design a heterogeneous multiprocessor architecture. Processing element (PE): CPU, accelerator, etc.
Divide Tasks to Processing Elements Verify that
Functionality of the system is correct System meets the performance constraints
April 20, 2023 IL2206 Embedded Systems 26
Why accelerators?
Better cost/performance. Custom logic may be able to perform operation
faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.
To improve performance by choosing a faster CPU may be very expensive!
cost
performance
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 27
Accelerated system design
First, determine that the system really needs to be accelerated. Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead?
Design Tasks performance analysis; scheduling and allocation.
Design the accelerator itself. Design CPU interface to accelerator.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 28
Performance analysis
Critical parameter is speedup: how much faster is the system with the accelerator?
Must take into account: Accelerator execution time. Data transfer time. Synchronization with the master CPU.
The Accelerator needs to know, when it can start its computation
The CPU needs to know when the results are ready© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 29
Single- vs. multi-threaded
One critical factor is available parallelism: single-threaded/blocking: CPU waits for
accelerator; multithreaded/non-blocking: CPU continues to
execute along with accelerator. To multithread, CPU must have useful work
to do. But software must also support multithreading.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 30
Sources of parallelism
Overlap I/O and accelerator computation. Perform operations in batches, read in second
batch of data while computing on first batch. Find other work to do on the CPU.
May reschedule operations to move work after accelerator initiation.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 31
Total execution time
Single-threaded: Multi-threaded:
P2
P1
A1
P3
P4
P2
P1
A1
P3
P4
CPU
Accel.
CPU
Accel.
Split
Join
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 32
Communication OverheadData input/output times
Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets,
handshaking, etc.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 33
Accelerator execution time
Total accelerator execution time: taccel = tin + tx + tout
Data input
Acceleratedcomputation
Data output
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 34
Execution time analysis
Single-threaded: Count execution time of
all component processes.
Multi-threaded: Find longest path
through execution.
P1
A1
P2 P3 P4CPU
Acc.
Time
Execution Time
Communication Overhead
tin tout
tx P1
A1
P2P3 P4CPU
Acc.
Time
Execution Time
April 20, 2023 IL2206 Embedded Systems 35
Example for Accelerator Architecture
CPU
Mem
DMA
Bus
Inte
rface Read
Unit
WriteUnit
Regis
ters
Core
Accelerator
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 36
Accelerator/CPU interface
Accelerator registers provide control registers for CPU.
Data registers can be used for small data objects.
Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 37
Caching problems
Main memory provides the primary data transfer mechanism to the accelerator.
Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 38
Possible Problems with Caches
1. CPU reads location S.
2. Accelerator writes location S.
3. CPU reads location S.
Cache
S
CPU
Memory
Accelerator
12
3Wrong value!
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 39
Cache Coherence Problem
Cache coherence problems appears also on multiprocessor systems Cache and main memory do not have the same contents Avalon bus, like most on-chip busses do not have an inbuilt
mechanism to avoid these problems
P1
Cache
Main Memory
Bus
Pn
Cache
April 20, 2023 IL2206 Embedded Systems 40
Cache Coherence with Write-Through Caches
How to tackle cache coherence? Idea: Caches must be aware of the transactions on the bus! Add extra hardware and define a protocol to be able to detect invalid data in the
caches Take actions, if cache or memory (in case of write-back caches) is invalid
P1
Cache
Main Memory
Bus
Pn
Cache
Cache-MemoryTransition
Bus Snooping
V
I
V
I
CacheCoherence
Protocol
More about cache coherence protocols in IL2207 SoC Architectures
What to do, if no cache coherence protocol exists?
Designer has to be aware of possible cache coherence problems
Disciplined programming is needed Use commands to explicitly bypass the
cache, if risk for cache coherence problem
April 20, 2023 IL2206 Embedded Systems 41
April 20, 2023 IL2206 Embedded Systems 42
ExampleAccelerator
f g
x y
h
h(f(x),g(y))
P A M
Data-flow Graph
Architecture
April 20, 2023 IL2206 Embedded Systems 43
Execution Times
Both P and A have sufficient registers
P and A cannot access the bus simultaneously
A memory access (load or store) takes 1 time unit
P A
f 5 2
g 5 2
h 5 -
April 20, 2023 IL2206 Embedded Systems 44
Single-Processor Solution
f g
x y
h
h(f(x),g(y))
Data-flow Graph
P
P
P
Load x 1
Load y 1
f 5
g 5
h 5
Store h(...) 1
18
April 20, 2023 IL2206 Embedded Systems 45
Processor-Accelerator Solution I
f g
x y
h
h(f(x),g(y))
Data-flow Graph
A
P
A
P A
Load x 1
Load y 1
f 2
g 2
Store f 1
Store g 1
Load f 1
Load g 1
h 5
Store h 1
Total 16
Still Single-Thread!
April 20, 2023 IL2206 Embedded Systems 46
Processor-Accelerator Solution II
f g
x y
h
h(f(x),g(y))
Data-flow Graph
A
P
P
P A
Load y 1
g 5 Load x 1
f 2
Store f 1
Load f 1
h 5
Store h 1
Total 13
Exploitation of parallelism leads to fast solution!
April 20, 2023 IL2206 Embedded Systems 47
System integration and debugging
Try to debug the CPU/accelerator interface separately from the accelerator core.
Build equipment to test the accelerator. Hardware/software co-simulation can be
useful.
© 2000 Wolf (Morgan Kaufman)
April 20, 2023 IL2206 Embedded Systems 48
Summary
The use of a hardware accelerator can lead to a more efficient solution In particular when the parallelism in the
functionality can be exploited Hardware/Software co-design techniques can
be used for the design of an accelerator You have to be aware of cache coherence
problems, if the processor or accelerator uses a cache
April 20, 2023 IL2206 Embedded Systems 50
Motivation for Configurable Processor Cores
Observations Time-to-market is critical Development time for software is much smaller
than for hardware Hardware can be customized and has much
better performance than software solution
April 20, 2023 IL2206 Embedded Systems 51
Why Configurable Processor Cores?
Idea Combine the advantages of hardware and software in form
of a customizable processor to achieve Clearly shorter Time-To-Market than hardware Clearly better performance than software
Provide a processor platform with a basic architecture that can be extended
by additional optimized units (MAC, Floating-Point Unit) Own instructions together with own customized hardware
can be defined for the processor
April 20, 2023 IL2206 Embedded Systems 52
Example for a configurable processor: Xtensa (Tensilica)
The Xtensa processor core targets system-on-chip applications is configurable, extensible and synthesizable has
Base Instruction Set Architecture Configurable Functions (Parametrised) Optional Functions Designer-Defined Functions and Registers (For
Accleration of Specific Algorithms)
April 20, 2023 IL2206 Embedded Systems 53
Xtensa Processor Core
April 20, 2023 IL2206 Embedded Systems 54
Basic Xtensa Core
32-bit architecture Base configuration:
32-bit ALU Up to 64 general purpose registers 6 special purpose registers 80 base instructions Improved 16- and 24-bit RISC instruction
encoding
April 20, 2023 IL2206 Embedded Systems 55
Optional Architecture
Execution Units Multipliers, 16 and 32 bits MAC-Unit, Floating-Point Unit
Interface Options Memory Subsystem Options
Memory Management Options Local Data and Instruction Caches Separate RAM, ROM Areas for Data and
Instruction
April 20, 2023 IL2206 Embedded Systems 56
Tensilica Extension Language The Tensilica extension language is used to
describe new instructions, registers and execution units that are then automatically added to the Xtensa processor
April 20, 2023 IL2206 Embedded Systems 57
Xtensa ProcessorDesign Process
April 20, 2023 IL2206 Embedded Systems 58
Design Flow
1. Choose basic Xtensa processor2. Specify algorithm in C3. Compile to Target Processor4. Profile and check, if design constraints are met5. If constraints are met, everything is fine, otherwise6. Choose optional functions (e.g. Multiplier) or design
new instructions for the critical part => improved architecture
7. Adjust your code for the new architecture8. Go back to 3.
April 20, 2023 IL2206 Embedded Systems 59
Summary
The Xtensa concept provides Not only a configurable architecture But also a design methodology
The idea is to take the best of both the hardware and the software world in order to Have good performance Short Time-to-Market
Xtensa processors can be used as parts of a system-on-chip architecture
Other extendable cores exist like the NIOS II from Altera