1up hard-core projectskochd/hardcoreprojects.pdfmerge sort: ΟΟΟΟ((((n log n)))) what about...
Post on 06-Oct-2020
1 Views
Preview:
TRANSCRIPT
1
1UP Hard1UP Hard--Core ProjectsCore Projects
Dirk Koch (Dirk Koch (dirk.koch@manchester.ac.ukdirk.koch@manchester.ac.uk
2
System Specification
(Communication Architecture & Floorplan)
generate
static
system
generate
module
repository
bitlink module.bit -pos X,Y static.bit -outfile initial.bit
> + join sort mean
> +join sort>>in out
static design: PCIe, memory, filesystem,management, reconfiguration
Let me introduce myself
Electrical Engineering
(automation technology)
Neural Networks
(autonomous systems)
Self-adaptive
fault-tolerant
distributed
reconfigurable
embedded systems
FPGA design tools for
run-time reconfiguration
Parallel systems,
FPGA virtualization
High performance
stream processing
Before university projects:
6502 Computer,
CNC machine, ...
Since September:
Lecturer in the
APT Group
3
Hardcore Projects: Goals and Content
� Fun
� Creativity
� Build hardware-software systems
(machine vision, robotic, computing, ...)
� Compute without processors
(there is more than CPUs and GPUs)
� Get access to hardware (FPGA boards)
� Get support for your own projects or
� Learn hardware design by prepared projects
https://sites.google.com/site/1uphardcore/
4
What is an FPGA?
� An FPGA (Field Programmable Gate Array) is a digital chip
where its function is defined after production (“in field”)
� Useful for small volumes: an FPGA fits many different designs
� reduces initial mask set cost (several millions of USD)
Example AIBO: generation 1, 2 and 3: ~60K units each
� Updatable hardware
(requirement for telecommunication equipment & prototyping)
CPU DSP Hardware
Accelerator
Micro Controller
Peripheral NetworkProcessor
FPGA technology can implement all of them!
5
How does an FPGA look like?
18x25 bitmultiplier
Logic, arithmetic and memory blocks + interconnection network
wires
72x512dual ported
memory(4KB+ECC)
logic cell(4x6-input
look-uptable LUT)
switch matrix(~200multi-
plexers)
6
How does an FPGA look like?
Big FPGAs:> 1M LUTs> 2000 multipliers> 2000 memoriesbillions of transisitors
7
How does an FPGA look like?
AlteraStratix IV
I/O (Gigabit transceivers), Logic, Memory & Multipliers (DSP)
8
Look-up tables (LUTs) as the
basic building block
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
LUT-value
AND gate1A3, A2, A1, A0
LUT-value
OR gate
0 OOOO 0
1 OOO1 1
2 OO1O 1
3 OO11 1
4 O1OO 1
5 O1O1 1
6 O11O 1
7 O111 1
8 1OOO 1
9 1OO1 1
A 1O1O 1
B 1O11 1
C 11OO 1
D 11O1 1
E 111O 1
F 1111 1
A0A1
A2
A3
0
1
F
FF
LUT values(part of the
configuration
data)
Slice FF(can store
state data)
How does it work?
A0
...
A3
A0
...
A3
9
FPGAs versus CPUs
� Slow (~300 MHz), but highly parallel execution >1000 Operations
� Moderate I/O throughput, but >1MB @ >1TB/sec (on-chip)
� Difficult VHDL programming, but C++ is coming up
if (old_position) then
case position is
42: if free then
for i=1 to ∞ loop
numbercrunching;
data flow oriented vs. control flow dominated
10
FPGAs versus CPUs
A[7]
42
24
Q[7]
for i = 0 to 7 do {
tmp = A[i] & x"F";
tmp = tmp + 42;
Q[i]= tmp * 24;
}
LDI reg_i,0
L1:ANDI r_tmp,$i,xF
...
BLI reg_i,L1
instructionstream
datastream
von Neumann computer
A[0][3..0]
42
+
24
* Q[0]
A[1][3..0]
42
+
24
* Q[1]
A[7][3..0]
42
+
24
* Q[7]
...
reconfigurable computing
pipelining
loop u
nro
lling
A[0]
42
24
Q[0]......
...
11
FPGAs versus CPUs
A[7]
42
24
Q[7]
for i = 0 to 7 do {
tmp = A[i] & x"F";
tmp = tmp + 42;
Q[i]= tmp * 24;
}
LDI reg_i,0
L1:ANDI r_tmp,$i,xF
...
BLI reg_i,L1
instructionstream
datastream
von Neumann computer
A[0][3..0]
42
+
24
* Q[0]
A[1][3..0]
42
+
24
* Q[1]
A[7][3..0]
42
+
24
* Q[7]
...
reconfigurable computing
pipelining
loop u
nro
lling
A[0]
42
24
Q[0]......
...
12
FPGAs versus CPUs
A[7]
42
24
Q[7]
for i = 0 to 7 do {
tmp = A[i] & x"F";
tmp = tmp + 42;
Q[i]= tmp * 24;
}
LDI reg_i,0
L1:ANDI r_tmp,$i,xF
...
BLI reg_i,L1
instructionstream
datastream
von Neumann computer
A[0][3..0]
42
+
24
* Q[0]
A[1][3..0]
42
+
24
* Q[1]
A[7][3..0]
42
+
24
* Q[7]
...
reconfigurable computing
pipelining
loop u
nro
lling
A[0]
42
24
Q[0]......
...
� Data parallel processing on CPUs:
� use SIMD and/or multiple threads
for different index points (e.g. GPUs)
“1 worker builds a complete car –
multiple workers operate in parallel
on different cars”
� Data parallel processing on FPGAs:
� build a processing pipeline
“car assembly line with various
specialized workers – multiple cars
are in the production line at different
processing stages”
13
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
a2
a0
b2
b0
c2
c0
l2
l0
k2
k0
y2
y0
x2
x0
LD
ST
MUL
ADD
ROR
ACC
MUL
ADDST
ACC c1
c3
x3
k3
a3
x1
k1
a1
b3
l3
y3
y1
l1
b1
FPGAs versus CPUs
LD
ST
MUL
ADD
ROR
ACC
MUL
ADD
b0
a0
c0
k0
l0
x0
y0
ST
ACC
Scalar processing SIMD processing (e.g. GPU)multithreading
Dataflow processing / pipelining (FPGAs)Pipelining often better for global states
x
42 24
LD MUL ADD ROR ACC
accum
ula
tor
DMADMA + +
ST
14
How does it help?
Example skin color detection
Algorithm:
Message:
� >20 instr. per pixel
� >1 GOPS @ 50 MHz pixel clock
� >2 GOPS @ 100 MHz pixel clock
� >>200 GOPS device potential
(Xilinx Spartan-6 LX45 ~50 USD)
15
How does it help?
16
Where is the Problem?
Gap between reconfigurable FPGAs and dedicated ASICs
� Note that the gap towards a programmable von Neumann
machine could be even orders of magnitude higher!
� also: programming / debugging is sometimes harder
Solution: partial run-time reconfiguration (PR):reusing the resources of a reconfigurable architecture by
multiple modules over time. Only parts of a system might be updated while continuing operation of the remaining system.
*Kuon & Rose: Measuring the Gap Between FPGAs and ASICs, in Tr. On CAD, 2007.
~ 3-5 x slowerclock speed
~ 14 x moredynamic power
~ 18 x largerchip area
FPGA versus ASIC@ 90nm process*
17
What is partial reconfiguration
Changing modules during operation while
keeping the rest of the system active
(Like plugging in a new network into
your PC while the the machine is working)
18
Example: Large Problem Sorting
� Large problem sorting
arises in particular in
databases
� Problems in the GB range
(���� multiple runs)
� Many famous algorithms:
� Bubble sort: ΟΟΟΟ((((n2))))
� Quicksort: ΟΟΟΟ((((n log n)
� Merge sort: ΟΟΟΟ((((n log n))))
� What about taking log n sorting cells in order to get linear
time complexity ΟΟΟΟ((((n)))) ?
19
Sorting: Algorithm Discussion
FIFO-based Merge Sorter:
� Suitable for large problems:
memory bound, but not logic bound
� Difficult to pipeline
>a)
>b)
c) > > >> > >
20
Partial Reconfiguration and Sorting
PCIe8x
FPGA
me
m-c
on
tr.
2GB/s DD
R3
> >
>
prefetcher
max burst size
max latency
sorted output
A B C D
FPGA
> > >>
FPGA
context switching
A
B
C
D
MEM
initial step final step
un
sort
ed s
tream
50% area saving or 4 times larger problems as compared to a static design
initial step: fully sorted sequences
[intermediate steps]: merging
final step: merge and emit result
input: unsorted data stream
� Next step: hierachical reconfiguration: swap comparator cells for different data types (integer, text, …)
21
1UP Hard1UP Hard--Core ProjectsCore Projects
https://sites.google.com/site/1uphardcore/
22
Build your own CPU P1
RTL Example of a baseline MIPS CPU
23
� For example with reconfigurable custom instructions
result
OP_AOP_B
instr.conf.
instr.conf.instruction
register file
OP_A
OP_B
instructionAB
register
file
… and customize it
instruction slices slots bitstream latency (max/av)
64-bit XOR gate 19 (40%) 1 2.64 KB 7.04 / 5.95 nsCCITT CRC 33 (34%) 2 5.28 KB 5.32 / 3.98 nssat. add/sub 70 (73%) 2 5.28 KB 9.89 / 7.81 nsbarrel shifter 90 (94%) 2 5.28 KB 11.07 / 7.88 ns'1'-bit counter 214 (89%) 5 13.2 KB 11.37 / 8.25 ns
mask & permute 16 (33%) 1 2.64 KB 5.94 / 4.05 ns
� Examples
24
Mandelbrot on FPGAs P2
25
Described by the following algorithm:
for each pixel (representing a complex point c) compute
while (|zn| <2 or n < limit)
pixel_color = n
Looks easy, but:
� For deep zoom levels, we need hundreds of bits
� special techniques like Karatsuba multiplication
� Next while-loop iteration needs result of the previous one
� Execution time of the whole while-loop is data dependent
� Goal: build a fast Mandelbrot set computation machine using hundreds or even thousands of multipliers
� deep real-time zooms
czznn
+=+
2
1
Sourc
e: W
ikip
edia
Mandelbrot on FPGAs P2
26
Robot Vision P3
� FPGA for preprocessing
(noise reduction)
� Segmentation and
feature detection
� Stereo vision
� Speed measurement
(motion estimation)
� Use a CPU for making decisions and for intelligent behavior
� Build human machine interfaces (gesture control)
Goal: use cameras as powerful sensors
27
Robot Vision P3
Possible hardware:
� MT9D112 2-megapixel
Stereo Camera Module
� 1600x1200 maximum
resolution at 15 FPS
Fits to the Atlys Spartan-6
FPGA board from Xilinx
Hardware would work out-of-the-box.
All algorithms to be implemented.
28
FPGA and Robotics
� Example: Microstepping for stepper motors
using PWM (pulse-width modulation)
29
FPGA and Robotics
� Example: Microstepping for stepper motors using PWM
� How to implement PWM (counter-based):
on off on off on off on off on offt
CMP
S
R
modcount
Aout
zero
30
FPGA and Robotics
� How to implement PWM (accumulator-based)
low
high
accum
ula
tor
+
Aout
out
(MSB)
N+1
N
N
� Can be used to implement a hundred analog outputs.
� In a nutshell:
FPGAs can be used for accelerators and peripherals
� Now its your turn:
Think about projects and enrole for Hardcore Projects
31
Any Questions? dirk.koch@manchester.ac.uk
https://sites.google.com/site/1uphardcore/
top related