lecture 10 hardware accelerators ingo sander [email protected]

Lecture 10Hardware Accelerators

Ingo Sander

[email protected]

IntroductionHardware Accelerator

April 20, 2023 IL2206 Embedded Systems 3

Design Constraint Propagation

A design constraint on system level leads to new design constraints on subsystem level

P1 P2 P3

t < 500 ms

Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms

Constraint on System


Design Constraint Propagation

An estimation tool can give the execution time of a subsystem

What happens, if a subsystem is too slow?

P1 P2 P3

t < 500 ms

Constraints on subsystems t < 100 ms t < 250 ms t < 150 ms

Constraint on System

Execution Time 95 ms 280 ms 145 ms

Too Slow!


How to improve the performance of a microprocessor system?

Improve your code Choose a faster version of your

microprocessor Add additional computational units that are

perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator


Hardware Accelerators

If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator!

The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor.

© 2000 Wolf (Morgan Kaufman)


Accelerated System Architecture

CPU

accelerator

memory

I/O

Request1

Data2

Result

3

Request and Result may also require access to memory



An Accelerator is not a Co-Processor

A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU.

An accelerator appears as a device on the bus



Amdahl’s Law

Amdahl’s law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use!

Enhanced

EnhancedEnhanced

Enhanced

Old

SpeedupFraction

Fraction

imeExecutionT

imeExecutionTSpeedup

)1(

1

Fraction denotes the percentage the enhancement can be used!


Example (Henessey & Patterson)

An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to implement a square root unit that speeds up this

operation with a factor of 10, or to Improve the floating-point instructions in general

so that they can run 2 times faster.


Example (Henessey & Patterson)

Square Root: Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22

Floating-Point: Speedup = 1 / ((1-0.5)+0.5/2) =

1/0.75 = 1.33


Almdahl’s LawLessons to be learned

The maximum speedup that is possible is limited by the fraction! Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F)

Fraction F 0.1 0.3 0.5 0.9

Max. Speedup 1.11 1.43 2 10

Improve the common cases!


Amdahl’s Law for Parallel Architectures

Amdahl’s law can even be used for parallel architectures, where sequential code is parallelized and runs on identical parallel units!

itsParallelUnFraction

Fraction

imeExecutionTimeExecutionT

Speedup

ParallelParallel

Parallel

Serial

)1(

1

Fraction denotes the percentage of the code parallelism can be used!

DesignHardware Accelerator


Design of a hardware accelerator

Which functions shall be implemented in hardware and which functions in software?

Hardware/software co-design: joint design of hardware and software architectures

The hardware accelerator can be implemented in Application-specific integrated circuit. Field-programmable gate array (FPGA).


Hardware Software Co-Design

SWCompilation

ExecutableProgram

SystemModel Original Program

(concurrent processes

Partitioning& Mapping

Which functions shall go to HW and SW?

Netlist

HWSynthesis

Verification

HW-Model(VHDL)

SW-Model(C/C++)

Verification

EstimationLibrary

Good estimates are needed for good partitioning


Hardware/Software Co-Design

Hardware/Software Co-design covers the following problems Co-Specification: the creations of specifications

that describe both the hardware and software of a system

Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification

Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction


Co-Synthesis

Four tasks are included in co-synthesis Partitioning: The functionality of the system is divided into

smaller, interacting computation units Allocation: The decision, which computational resources

are used to implement the functionality of the system Scheduling: If several system functions have to share the

same resource, the usage of the resource must be scheduled in time

Mapping: The selection of a particular allocated computational unit for each computation unit

All these tasks depend on each other!


Partitioning During partitioning the functionality of the system is

partitioned into several parts (corresponding to the allocated/available components)

Many possible partitions exist Analysis is done by evaluating the costs of different

partitions

B

A

E

DC

B

A

E

DC


Estimation

In order to get a good partitioning, there is a need for good figures about performance for a function on different

components execution time for communication time


EstimationAccuracy and Fidelity

The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation

The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations


Fidelity

Though accuracy is much higher in (2) than in (1), the estimates are not very useful for the partitioning process because of the low fidelity!

This can cause bad design decisions!

Quality metric

A B C

Quality metric

A B C

Fidelity = 100% Fidelity = 33% (only A > C correct)

1 2

Estimate

Measurement


Hardware/Software Co-Design

Strategies:1. Start with an ”all-software”-configuration

While (Constraints are not satisfied)

Move the SW function that gives the best improvement to HW

(implemented in COSYMA [Ernst, Henkel, Brenner 1993])

2. Start with an ”all-hardware”-configurationWhile (Constraints are satisfied)

Move the most costly HW component to SW

(implemented in Vulcan [Gupta, DeMicheli 1995])


Papers on HW/SW Co-Design R. Ernst et al. Hardware-software co-synthesis from

Microcontrollers. IEEE Design & Test of Computers. December 1993.

R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993.

G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997.

… (and much much more)

Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)


System design tasks

Design a heterogeneous multiprocessor architecture. Processing element (PE): CPU, accelerator, etc.

Divide Tasks to Processing Elements Verify that

Functionality of the system is correct System meets the performance constraints


Why accelerators?

Better cost/performance. Custom logic may be able to perform operation

faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance.

To improve performance by choosing a faster CPU may be very expensive!

cost

performance



Accelerated system design

First, determine that the system really needs to be accelerated. Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead?

Design Tasks performance analysis; scheduling and allocation.

Design the accelerator itself. Design CPU interface to accelerator.



Performance analysis

Critical parameter is speedup: how much faster is the system with the accelerator?

Must take into account: Accelerator execution time. Data transfer time. Synchronization with the master CPU.

The Accelerator needs to know, when it can start its computation

The CPU needs to know when the results are ready© 2000 Wolf (Morgan Kaufman)


Single- vs. multi-threaded

One critical factor is available parallelism: single-threaded/blocking: CPU waits for

accelerator; multithreaded/non-blocking: CPU continues to

execute along with accelerator. To multithread, CPU must have useful work

to do. But software must also support multithreading.



Sources of parallelism

Overlap I/O and accelerator computation. Perform operations in batches, read in second

batch of data while computing on first batch. Find other work to do on the CPU.

May reschedule operations to move work after accelerator initiation.



Total execution time

Single-threaded: Multi-threaded:

P2

P1

A1

P3

P4

P2

P1

A1

P3

P4

CPU

Accel.

CPU

Accel.

Split

Join



Communication OverheadData input/output times

Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets,

handshaking, etc.



Accelerator execution time

Total accelerator execution time: taccel = tin + tx + tout

Data input

Acceleratedcomputation

Data output



Execution time analysis

Single-threaded: Count execution time of

all component processes.

Multi-threaded: Find longest path

through execution.

P1

A1

P2 P3 P4CPU

Acc.

Time

Execution Time

Communication Overhead

tin tout

tx P1

A1

P2P3 P4CPU

Acc.

Time

Execution Time


Example for Accelerator Architecture

CPU

Mem

DMA

Bus

Inte

rface Read

Unit

WriteUnit

Regis

ters

Core

Accelerator



Accelerator/CPU interface

Accelerator registers provide control registers for CPU.

Data registers can be used for small data objects.

Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers.



Caching problems

Main memory provides the primary data transfer mechanism to the accelerator.

Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU).



Possible Problems with Caches

1. CPU reads location S.

2. Accelerator writes location S.

3. CPU reads location S.

Cache

S

CPU

Memory

Accelerator

12

3Wrong value!



Cache Coherence Problem

Cache coherence problems appears also on multiprocessor systems Cache and main memory do not have the same contents Avalon bus, like most on-chip busses do not have an inbuilt

mechanism to avoid these problems

P1

Cache

Main Memory

Bus

Pn

Cache


Cache Coherence with Write-Through Caches

How to tackle cache coherence? Idea: Caches must be aware of the transactions on the bus! Add extra hardware and define a protocol to be able to detect invalid data in the

caches Take actions, if cache or memory (in case of write-back caches) is invalid

P1

Cache

Main Memory

Bus

Pn

Cache

Cache-MemoryTransition

Bus Snooping

V

I

V

I

CacheCoherence

Protocol

More about cache coherence protocols in IL2207 SoC Architectures

What to do, if no cache coherence protocol exists?

Designer has to be aware of possible cache coherence problems

Disciplined programming is needed Use commands to explicitly bypass the

cache, if risk for cache coherence problem



ExampleAccelerator

f g

x y

h

h(f(x),g(y))

P A M

Data-flow Graph

Architecture


Execution Times

Both P and A have sufficient registers

P and A cannot access the bus simultaneously

A memory access (load or store) takes 1 time unit

P A

f 5 2

g 5 2

h 5 -


Single-Processor Solution

f g

x y

h

h(f(x),g(y))

Data-flow Graph

P

P

P

Load x 1

Load y 1

f 5

g 5

h 5

Store h(...) 1

18


Processor-Accelerator Solution I

f g

x y

h

h(f(x),g(y))

Data-flow Graph

A

P

A

P A

Load x 1

Load y 1

f 2

g 2

Store f 1

Store g 1

Load f 1

Load g 1

h 5

Store h 1

Total 16

Still Single-Thread!


Processor-Accelerator Solution II

f g

x y

h

h(f(x),g(y))

Data-flow Graph

A

P

P

P A

Load y 1

g 5 Load x 1

f 2

Store f 1

Load f 1

h 5

Store h 1

Total 13

Exploitation of parallelism leads to fast solution!


System integration and debugging

Try to debug the CPU/accelerator interface separately from the accelerator core.

Build equipment to test the accelerator. Hardware/software co-simulation can be

useful.



Summary

The use of a hardware accelerator can lead to a more efficient solution In particular when the parallelism in the

functionality can be exploited Hardware/Software co-design techniques can

be used for the design of an accelerator You have to be aware of cache coherence

problems, if the processor or accelerator uses a cache

Configurable Processor Cores

Ingo Sander

[email protected]


Motivation for Configurable Processor Cores

Observations Time-to-market is critical Development time for software is much smaller

than for hardware Hardware can be customized and has much

better performance than software solution


Why Configurable Processor Cores?

Idea Combine the advantages of hardware and software in form

of a customizable processor to achieve Clearly shorter Time-To-Market than hardware Clearly better performance than software

Provide a processor platform with a basic architecture that can be extended

by additional optimized units (MAC, Floating-Point Unit) Own instructions together with own customized hardware

can be defined for the processor


Example for a configurable processor: Xtensa (Tensilica)

The Xtensa processor core targets system-on-chip applications is configurable, extensible and synthesizable has

Base Instruction Set Architecture Configurable Functions (Parametrised) Optional Functions Designer-Defined Functions and Registers (For

Accleration of Specific Algorithms)


Xtensa Processor Core


Basic Xtensa Core

32-bit architecture Base configuration:

32-bit ALU Up to 64 general purpose registers 6 special purpose registers 80 base instructions Improved 16- and 24-bit RISC instruction

encoding


Optional Architecture

Execution Units Multipliers, 16 and 32 bits MAC-Unit, Floating-Point Unit

Interface Options Memory Subsystem Options

Memory Management Options Local Data and Instruction Caches Separate RAM, ROM Areas for Data and

Instruction


Tensilica Extension Language The Tensilica extension language is used to

describe new instructions, registers and execution units that are then automatically added to the Xtensa processor


Xtensa ProcessorDesign Process


Design Flow

1. Choose basic Xtensa processor2. Specify algorithm in C3. Compile to Target Processor4. Profile and check, if design constraints are met5. If constraints are met, everything is fine, otherwise6. Choose optional functions (e.g. Multiplier) or design

new instructions for the critical part => improved architecture

7. Adjust your code for the new architecture8. Go back to 3.


Summary

The Xtensa concept provides Not only a configurable architecture But also a design methodology

The idea is to take the best of both the hardware and the software world in order to Have good performance Short Time-to-Market

Xtensa processors can be used as parts of a system-on-chip architecture

Other extendable cores exist like the NIOS II from Altera

lecture 10 hardware accelerators ingo sander [email protected]

Documents

systemil2206 embedded

43210il2206 embedded

33il2206 embedded systems

additional hardware

fraction of time

hardware acceleratorsif

hardware acceleratorsingo

hardware acceleratorwhi