1up hard-core projectskochd/hardcoreprojects.pdfmerge sort: ΟΟΟΟ((((n log n)))) what about...

31
1 1UP Hard 1UP Hard - - Core Projects Core Projects Dirk Koch ( Dirk Koch ( [email protected] [email protected]

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

1

1UP Hard1UP Hard--Core ProjectsCore Projects

Dirk Koch (Dirk Koch ([email protected]@manchester.ac.uk

Page 2: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

2

System Specification

(Communication Architecture & Floorplan)

generate

static

system

generate

module

repository

bitlink module.bit -pos X,Y static.bit -outfile initial.bit

> + join sort mean

> +join sort>>in out

static design: PCIe, memory, filesystem,management, reconfiguration

Let me introduce myself

Electrical Engineering

(automation technology)

Neural Networks

(autonomous systems)

Self-adaptive

fault-tolerant

distributed

reconfigurable

embedded systems

FPGA design tools for

run-time reconfiguration

Parallel systems,

FPGA virtualization

High performance

stream processing

Before university projects:

6502 Computer,

CNC machine, ...

Since September:

Lecturer in the

APT Group

Page 3: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

3

Hardcore Projects: Goals and Content

� Fun

� Creativity

� Build hardware-software systems

(machine vision, robotic, computing, ...)

� Compute without processors

(there is more than CPUs and GPUs)

� Get access to hardware (FPGA boards)

� Get support for your own projects or

� Learn hardware design by prepared projects

https://sites.google.com/site/1uphardcore/

Page 4: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

4

What is an FPGA?

� An FPGA (Field Programmable Gate Array) is a digital chip

where its function is defined after production (“in field”)

� Useful for small volumes: an FPGA fits many different designs

� reduces initial mask set cost (several millions of USD)

Example AIBO: generation 1, 2 and 3: ~60K units each

� Updatable hardware

(requirement for telecommunication equipment & prototyping)

CPU DSP Hardware

Accelerator

Micro Controller

Peripheral NetworkProcessor

FPGA technology can implement all of them!

Page 5: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

5

How does an FPGA look like?

18x25 bitmultiplier

Logic, arithmetic and memory blocks + interconnection network

wires

72x512dual ported

memory(4KB+ECC)

logic cell(4x6-input

look-uptable LUT)

switch matrix(~200multi-

plexers)

Page 6: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

6

How does an FPGA look like?

Big FPGAs:> 1M LUTs> 2000 multipliers> 2000 memoriesbillions of transisitors

Page 7: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

7

How does an FPGA look like?

AlteraStratix IV

I/O (Gigabit transceivers), Logic, Memory & Multipliers (DSP)

Page 8: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

8

Look-up tables (LUTs) as the

basic building block

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

LUT-value

AND gate1A3, A2, A1, A0

LUT-value

OR gate

0 OOOO 0

1 OOO1 1

2 OO1O 1

3 OO11 1

4 O1OO 1

5 O1O1 1

6 O11O 1

7 O111 1

8 1OOO 1

9 1OO1 1

A 1O1O 1

B 1O11 1

C 11OO 1

D 11O1 1

E 111O 1

F 1111 1

A0A1

A2

A3

0

1

F

FF

LUT values(part of the

configuration

data)

Slice FF(can store

state data)

How does it work?

A0

...

A3

A0

...

A3

Page 9: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

9

FPGAs versus CPUs

� Slow (~300 MHz), but highly parallel execution >1000 Operations

� Moderate I/O throughput, but >1MB @ >1TB/sec (on-chip)

� Difficult VHDL programming, but C++ is coming up

if (old_position) then

case position is

42: if free then

for i=1 to ∞ loop

numbercrunching;

data flow oriented vs. control flow dominated

Page 10: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

10

FPGAs versus CPUs

A[7]

42

24

Q[7]

for i = 0 to 7 do {

tmp = A[i] & x"F";

tmp = tmp + 42;

Q[i]= tmp * 24;

}

LDI reg_i,0

L1:ANDI r_tmp,$i,xF

...

BLI reg_i,L1

instructionstream

datastream

von Neumann computer

A[0][3..0]

42

+

24

* Q[0]

A[1][3..0]

42

+

24

* Q[1]

A[7][3..0]

42

+

24

* Q[7]

...

reconfigurable computing

pipelining

loop u

nro

lling

A[0]

42

24

Q[0]......

...

Page 11: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

11

FPGAs versus CPUs

A[7]

42

24

Q[7]

for i = 0 to 7 do {

tmp = A[i] & x"F";

tmp = tmp + 42;

Q[i]= tmp * 24;

}

LDI reg_i,0

L1:ANDI r_tmp,$i,xF

...

BLI reg_i,L1

instructionstream

datastream

von Neumann computer

A[0][3..0]

42

+

24

* Q[0]

A[1][3..0]

42

+

24

* Q[1]

A[7][3..0]

42

+

24

* Q[7]

...

reconfigurable computing

pipelining

loop u

nro

lling

A[0]

42

24

Q[0]......

...

Page 12: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

12

FPGAs versus CPUs

A[7]

42

24

Q[7]

for i = 0 to 7 do {

tmp = A[i] & x"F";

tmp = tmp + 42;

Q[i]= tmp * 24;

}

LDI reg_i,0

L1:ANDI r_tmp,$i,xF

...

BLI reg_i,L1

instructionstream

datastream

von Neumann computer

A[0][3..0]

42

+

24

* Q[0]

A[1][3..0]

42

+

24

* Q[1]

A[7][3..0]

42

+

24

* Q[7]

...

reconfigurable computing

pipelining

loop u

nro

lling

A[0]

42

24

Q[0]......

...

� Data parallel processing on CPUs:

� use SIMD and/or multiple threads

for different index points (e.g. GPUs)

“1 worker builds a complete car –

multiple workers operate in parallel

on different cars”

� Data parallel processing on FPGAs:

� build a processing pipeline

“car assembly line with various

specialized workers – multiple cars

are in the production line at different

processing stages”

Page 13: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

13

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

a2

a0

b2

b0

c2

c0

l2

l0

k2

k0

y2

y0

x2

x0

LD

ST

MUL

ADD

ROR

ACC

MUL

ADDST

ACC c1

c3

x3

k3

a3

x1

k1

a1

b3

l3

y3

y1

l1

b1

FPGAs versus CPUs

LD

ST

MUL

ADD

ROR

ACC

MUL

ADD

b0

a0

c0

k0

l0

x0

y0

ST

ACC

Scalar processing SIMD processing (e.g. GPU)multithreading

Dataflow processing / pipelining (FPGAs)Pipelining often better for global states

x

42 24

LD MUL ADD ROR ACC

accum

ula

tor

DMADMA + +

ST

Page 14: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

14

How does it help?

Example skin color detection

Algorithm:

Message:

� >20 instr. per pixel

� >1 GOPS @ 50 MHz pixel clock

� >2 GOPS @ 100 MHz pixel clock

� >>200 GOPS device potential

(Xilinx Spartan-6 LX45 ~50 USD)

Page 15: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

15

How does it help?

Page 16: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

16

Where is the Problem?

Gap between reconfigurable FPGAs and dedicated ASICs

� Note that the gap towards a programmable von Neumann

machine could be even orders of magnitude higher!

� also: programming / debugging is sometimes harder

Solution: partial run-time reconfiguration (PR):reusing the resources of a reconfigurable architecture by

multiple modules over time. Only parts of a system might be updated while continuing operation of the remaining system.

*Kuon & Rose: Measuring the Gap Between FPGAs and ASICs, in Tr. On CAD, 2007.

~ 3-5 x slowerclock speed

~ 14 x moredynamic power

~ 18 x largerchip area

FPGA versus ASIC@ 90nm process*

Page 17: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

17

What is partial reconfiguration

Changing modules during operation while

keeping the rest of the system active

(Like plugging in a new network into

your PC while the the machine is working)

Page 18: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

18

Example: Large Problem Sorting

� Large problem sorting

arises in particular in

databases

� Problems in the GB range

(���� multiple runs)

� Many famous algorithms:

� Bubble sort: ΟΟΟΟ((((n2))))

� Quicksort: ΟΟΟΟ((((n log n)

� Merge sort: ΟΟΟΟ((((n log n))))

� What about taking log n sorting cells in order to get linear

time complexity ΟΟΟΟ((((n)))) ?

Page 19: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

19

Sorting: Algorithm Discussion

FIFO-based Merge Sorter:

� Suitable for large problems:

memory bound, but not logic bound

� Difficult to pipeline

>a)

>b)

c) > > >> > >

Page 20: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

20

Partial Reconfiguration and Sorting

PCIe8x

FPGA

me

m-c

on

tr.

2GB/s DD

R3

> >

>

prefetcher

max burst size

max latency

sorted output

A B C D

FPGA

> > >>

FPGA

context switching

A

B

C

D

MEM

initial step final step

un

sort

ed s

tream

50% area saving or 4 times larger problems as compared to a static design

initial step: fully sorted sequences

[intermediate steps]: merging

final step: merge and emit result

input: unsorted data stream

� Next step: hierachical reconfiguration: swap comparator cells for different data types (integer, text, …)

Page 21: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

21

1UP Hard1UP Hard--Core ProjectsCore Projects

https://sites.google.com/site/1uphardcore/

Page 22: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

22

Build your own CPU P1

RTL Example of a baseline MIPS CPU

Page 23: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

23

� For example with reconfigurable custom instructions

result

OP_AOP_B

instr.conf.

instr.conf.instruction

register file

OP_A

OP_B

instructionAB

register

file

… and customize it

instruction slices slots bitstream latency (max/av)

64-bit XOR gate 19 (40%) 1 2.64 KB 7.04 / 5.95 nsCCITT CRC 33 (34%) 2 5.28 KB 5.32 / 3.98 nssat. add/sub 70 (73%) 2 5.28 KB 9.89 / 7.81 nsbarrel shifter 90 (94%) 2 5.28 KB 11.07 / 7.88 ns'1'-bit counter 214 (89%) 5 13.2 KB 11.37 / 8.25 ns

mask & permute 16 (33%) 1 2.64 KB 5.94 / 4.05 ns

� Examples

Page 24: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

24

Mandelbrot on FPGAs P2

Page 25: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

25

Described by the following algorithm:

for each pixel (representing a complex point c) compute

while (|zn| <2 or n < limit)

pixel_color = n

Looks easy, but:

� For deep zoom levels, we need hundreds of bits

� special techniques like Karatsuba multiplication

� Next while-loop iteration needs result of the previous one

� Execution time of the whole while-loop is data dependent

� Goal: build a fast Mandelbrot set computation machine using hundreds or even thousands of multipliers

� deep real-time zooms

czznn

+=+

2

1

Sourc

e: W

ikip

edia

Mandelbrot on FPGAs P2

Page 26: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

26

Robot Vision P3

� FPGA for preprocessing

(noise reduction)

� Segmentation and

feature detection

� Stereo vision

� Speed measurement

(motion estimation)

� Use a CPU for making decisions and for intelligent behavior

� Build human machine interfaces (gesture control)

Goal: use cameras as powerful sensors

Page 27: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

27

Robot Vision P3

Possible hardware:

� MT9D112 2-megapixel

Stereo Camera Module

� 1600x1200 maximum

resolution at 15 FPS

Fits to the Atlys Spartan-6

FPGA board from Xilinx

Hardware would work out-of-the-box.

All algorithms to be implemented.

Page 28: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

28

FPGA and Robotics

� Example: Microstepping for stepper motors

using PWM (pulse-width modulation)

Page 29: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

29

FPGA and Robotics

� Example: Microstepping for stepper motors using PWM

� How to implement PWM (counter-based):

on off on off on off on off on offt

CMP

S

R

modcount

Aout

zero

Page 30: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

30

FPGA and Robotics

� How to implement PWM (accumulator-based)

low

high

accum

ula

tor

+

Aout

out

(MSB)

N+1

N

N

� Can be used to implement a hundred analog outputs.

� In a nutshell:

FPGAs can be used for accelerators and peripherals

� Now its your turn:

Think about projects and enrole for Hardcore Projects

Page 31: 1UP Hard-Core Projectskochd/HardCoreProjects.pdfMerge sort: ΟΟΟΟ((((n log n)))) What about taking log n sorting cells in order to get linear time complexity ... RTL Example of

31

Any Questions? [email protected]

https://sites.google.com/site/1uphardcore/