simple gpu - oakland university · gpu architecture axi_lite 4x4 mvp matrix cam2pix / (optional)...

26
Simple GPU Anthony Bogedin Michael Lohrer

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Simple GPU

Anthony BogedinMichael Lohrer

Page 2: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

GPU Architecture

AXI_LITE

4X4

MVP

Matrix

CAM2PIX

/

(optional)

Line Draw

PIX2MA

Converter

M_AXI

Clip

Detectionenable

address

data

Divide

(x,y,z)/wAXI_FULL

16 bit color

32 to

64

W_H

reg

frame

addr

Page 3: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Vector Matrix Multiply 4D

Contains 4 MADDs

C represents the

column

S1

S2

S3

S5

S4

start

init=’1’ store=’1’ ctrl=0

store=’1’ ctrl=1

1

0

store=’1’ ctrl=2

store=’1’ ctrl=3

done=’1’

X

+

accum_reg_[C]

Matrix Row

MUX

Vertex

Component MUX

x y z 1.0 ctrlr1[C] r2[C] r3[C] r4[C] ctrl

0.0

init

store

[C]_out

Page 4: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Matrix Multiply Simulation

1.5469 -0.0195 -1.0078 -0.0156

-0.0156 -1.9414 -1.7227 -1.4297

0.0 -1.9414 -0.2773 1.4297

0.0 0.0 -3.8867 -5.7148

Matrix Under Test:

X Y Z W

Vertex: 1.5313 -3.9023 -6.8945 -5.7305

Matlab Output Comparison:

Page 5: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Divider

Used the pipelined divider provided by Llamoca, but placed a wrapper around

it to handle negative numbers and overflowTo get a fractional output, W was truncated by 12 bits ([32 16]/[20 4] = [32 12])

The X, Y, & Z components of the vertex need to be normalized before being

mapped to screen space

Dividing all three by W ensures that all values are between -1 and 1 if they are on screen

Since the matrix multiply takes 4 clock cycles, X, Y, & Z are feed into the divider one at a time

After division, they are stored in 3 registers to await processing in the next stage

Page 6: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Divider Tests

Example value with : 1.53125 [32 16] / -5.6875 [20 4] = -0.26904296875 [32 12]

= FFFFFBB2 [32 12]

Result from divider: -0.26611328125 [32 12] = FFFFFBBE [32 12]

Page 7: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Camera to Screen Space

Since Xcam and Ycam are from -1 to 1

Just multiply by half of width and height

Truncate fractional bits

Add half of width and height

X

+

reg

reg

>> 1

>> 12

Ypix

Ycamheight

h_heig

ht

X

+

reg

reg

>> 1

>> 12

Xpix

Xcamwidth

h_

wid

th

Page 8: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Clip Detection

If the input X and Y

coordinates are within

the screen dimensions

and there is a valid

vertex, set visible high

vis

ible

<

Xp

ix

wid

th

>=

Xp

ix

0.0

<

Yp

ix

he

igh

t

>=

Yp

ix

0.0

<

Zca

m

1.0

>

Zca

m

-1.0

Va

lid

0

reg

Page 9: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Screen to Array Index

frame_address = Xpix + width * Ypix

Multiply Ypix by width to jump to the correct

row

Add Xpix to jump to the correct column

Add 2 zeros for AXI word alignment

X

+

<< 2

Xpix Ypix width

reg

frame_address

Page 10: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

SGPU Simulation

Page 11: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Integer Line Drawing

Bresenham's line algorithm:

plotLine(x0,y0, x1,y1)

dx=x1-x0

dy=y1-y0

DIF = 4*dy - dx

y=y0

for x from x0 to x1

plot(x,y)

DIF = DIF + (2*dy)

if DIF > 0

y = y+1

DIF = DIF - (2*dx)

Page 12: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Line Drawing State Machine

Using two coordinates in Screen Space,

trace line between them

Takes two clock cycles

This can saturate the AXI bus

START

PRIME

DONE

RUN_DEC

E

dif=’1’ xstore=1

ystore=1 mux=0

1

0

dif=’1’ xstore=’1’

mux = 1 valid=’1’

dif=’1’ ystore=’1’

mux=2

RUN_INC

done=’1’

P

0

1

D

Z

Page 13: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Line Drawing Data Path

Every pixel cycle +1 to x

+1 to y dependant on dif

DIF_REGE

mux

dy&”00” dx

- -+

dx&’0’dy&’0’

>0

Z

dif

X_REGE

+1

mux

x0m

xstore

x_out

==x1m

D

Y_REGE

+1

mux

y0m

ystore

y_out

plotLine(x0,y0, x1,y1)

dx=x1-x0

dy=y1-y0

DIF = 4*dy - dx

y=y0

for x from x0 to x1

plot(x,y)

DIF = DIF + (2*dy)

if DIF > 0

y = y+1

DIF = DIF - (2*dx)

Page 14: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Limits of Simple Line Drawing

Only increments

Y increments at most once per X

increment

Only draws lines in the first 45° of the

first quadrant

+Y

+X

Page 15: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Folding Logic

Uses inputs and their

negations

MUXs folds the inputs to

draw line to always be

positive and ensure that

x is larger than y

Final MUX folds values

back into appropriate

quadrant slice

Page 16: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Additional details

When line drawing is active, two fifos are added into the data flow.

Drawing a line can take much longer than transforming a vertex into pixel space.

Other simple constructs and state machine glue logic from the labs and class.

32 to 64 bit temporal multiplexing for input

SM to give the next set of vertices to the line drawer when it is ready and there are two vertices

available in the fifo’s

Page 17: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Line Drawing Simulation

Page 18: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Now that we’ve got a GPU, we add AXI Interfaces to talk to the CPU and directly

to memory and then wrap it up as an IP core.

GPU in the Big Picture

AXI_LITE

4X4

MVP

Matrix

CAM2PIXPIX2MA

Converter

M_AXI

Clip

Detectionenable

address

data

Divide

(x,y,z)/wAXI_FULL

16 bit color

32 to

64

W_H

reg

frame

addr

Page 19: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

System Architecture

PS

VDMA

VDMA

Disp Ctrl

Disp Ctrl

VGA

HDMI TX HDMI

HP0

HP1

SGPU

SGPU

GP1

Pixels

Vertices

Config

Pixels

Vertices

Config

GP0 SyncINT

Page 20: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

From Zynq TRM

UG585 (v1.10)

p. 64

Page 21: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

AXI Lite Example Simulation

The AXI Lite interface is used for configuring the video frame’s height and width, the base frame address,

and the PV matrix.

Page 22: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

AXI Slave Full Example Simulation

First write loads X and Y Second write loads Z and Color

Page 23: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

AXI Master Full Example Simulation

Page 24: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Vivado ILA (Chipscope Pro)

The exact processor GP and HP AXI interfaces are hard to simulate since we don’t have a testbench for

them. Using Vivado Integrated Logic Analyzer (ILA) we can debug hardware while it is running.

Page 25: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Problems

AXI has two separate handshake signals

1 for the address

1 for the data

When the system first initializes, it seems that the address “ready” goes high but

the data does not causing the two output fifos to get out of sync

Line Drawing still has issues when a line leaves the screen edges

The screen clip module ensures that a pixel is never sent to a memory location outside of the

frame buffer, but the drawing starts to flicker as if vertices are being lost

Page 26: Simple GPU - Oakland University · GPU Architecture AXI_LITE 4X4 MVP Matrix CAM2PIX / (optional) Line Draw PIX2MA Converter M_AXI Clip Detection enable address data Divide AXI_FULL

Demo and Questions