a simd-efficient 14 instruction shader program for high...

23
A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization Jordi Roca · Victor Moya · Carlos Gonzalez Vicente Escandell · Albert Murciego Agustin Fernandez, Computer Architecture Department, UPC Roger Espasa, Intel 1

Upload: others

Post on 23-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

A SIMD-efficient 14 Instruction Shader Program

for High-Throughput

Microtriangle Rasterization

Jordi Roca · Victor Moya · Carlos GonzalezVicente Escandell · Albert Murciego

Agustin Fernandez, Computer Architecture Department, UPC

Roger Espasa, Intel

1

Page 2: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Micropolygons & Tesselation: The future trend in interactive 3D rendering

for improved Level-Of-Detail

2

Page 3: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

• An alternative GPU rasterization pipeline toefficienly process microtriangles.

• Our approach processes several microtrianglesin parallel using GPU shader threads:

– Scalable throughput is guaranteed in next GPU generations.

I´m presenting today…

3

Page 4: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & Solutions.

• Performance Results.

• Conclusion.

4

Page 5: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

V0 (1,0)

V1 (13,4)

V2 (2,7)

Input: screen-projected vertex coordinates: {(1,0),(13,4),(2,7)}Output: covered fragments: {(1,0),(1,1),(2,1),(3,1), …}

The triangle rasterizer jobX

Y

5

Page 6: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Three Triangle Rasterization Approaches

E(x,y) : ax + by + c = 0

Intel Larrabee(software-based, SIMD-16)

Tile Scan

NVIDIA GPUs’96 - today

Recursive

Edge Equations(setup + traversal)

Pomegranate ‘00

Scan Lines

• Hard to parallelize• Software renderers

Fatahalian K., ’09 and THIS WORK

X-Products

• More efficient for verysmall triangles.•Independent per-pixel computation.

6

Page 7: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12

Flo

atin

g-p

oin

t o

pe

rati

on

s

Triangle size

Edge Equations

X-products

Setup equations or X-products?

The high cost of triangle setup is notamortized for ≤ 2-pixel triangles

Cross-products is more efficientfor very small triangles

Rasterizer efficiency (ops per pixel, Lower = Better)

7

Page 8: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

The GPU´s bottleneck in microtriangles: a single Setup unit!

• Typical 2009-GPU rates: 1tri : 32pix /clock

• But the microtriangle ratio is 1tri : ≈1pix

• The Single Setup unit starves the Pixel Pipeline (Shader/ZStencil/Color)

• Need more microtriangle throughput … Can shader units help?

Utilization of thedifferent GPU units rendering a 1-pixel size streamof microtriangles

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 1011

Un

it u

tiliz

atio

n

Time (Kcycles)

rstz/setupmemorycolorZ/stencilshader

8

Page 9: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

How could we increase the throughput formicrotriangles?

• Option 1: Replicate N times the Triangle Setup unit– Increases area

– Does not scale to very large number of microtriangles

• Option 2: Use the shader units to render microtriangles– THIS WORK.

– No area cost

– Large triangles still use the existing triangle setup unit

– Scales in the future as

• Microtriangles are more frequent

• Future GPUs offer more shader cores

9

Page 10: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Proposed Microtriangle pipeline

• Selectable bythe API user.

Texture

Depth/Stencil

Render Target

PixelShader

OutputMerger

Rasterizer/

Interpolator

Texture

Depth/Stencil

Render Target

Rasterize

Interpolate

PixelShader

OutputMerger

Triangle BoundRasterize & Shade Pixels

Standard DX10 Pipeline(for normal triangles)

Microtriangle Pipeline

InputAssembler

Texture

Texture

Stream Output

VertexShader

Vertex Buffer

Index Buffer

Geometry Shader

InputAssembler

Texture

Texture

Stream Output

VertexShader

Vertex Buffer

Index Buffer

Geometry Shader

10

Page 11: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & solutions.

• Performance Results.

• Conclusion.

11

Page 12: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Parallel Rasterization in GPU Shaders1. Fill shader vector groups with

fragments within the bounding

boxes of n input microtriangles

Rasterization

Z Interpolation

Thread Entry

2. Run the

rasterization

program on

multiple

fragments

followed by

the original

API fragment

shader.

Z= 3 Z= 5 Z= 7 Z= 9 Z= 1 Z= 1 Z= 2Attribute

Interpolation

Z= 3

S = 1

T = 0

Z= 5

S = 0

T = 1

Z= 7

S = 0

T = 0

Z= 9

S = 1

T = 1

Z= 1

S = 0

T = 0

Z= 1

S = 1

T = 1

Z= 2

S = 1

T = 0

Original DirectX

FragmentShader

Thread Exit

3. Reorder shaded fragments and do Z Test.

12

Page 13: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

The required features of ourrasterization program:

• Consistent rasterization (no cracks orrepeated pixels):– Fixed-point arithmetic.– Tie break rule for adjacent edges.

• Full support of modern GPU aspects:– Z interpolation:

• Perspective• Orthogonal

– Attribute interpolation:• Flat• Non-perspective correct• Perspective correct• Centroid

– Face culling:• Front/Back/Front&Back

– MSAA:• x2, x4, x6, x8• Customizable patterns

13

Page 14: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & Solutions.

• Performance Results.

• Conclusion.

14

Page 15: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Shading of sparse vectors: Bounding Box optimization pre-pass

• Increases 20 to 45% the density of microtriangle vectors.• Culls entirely subpixel microtriangles (55% culling ratio).• Simple hardware (four comparators, four adders) performs this

optimization inside the Triangle Bound unit.

Can shrink these BB sides

The gap tells those pixelswill be never really hit!

Subpixel-accurate BB:Pixel-accurate BB:

15

Page 16: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Avoid cracks or repeated pixels:Use of Fixed-Point arithmetic

• The rasterization program must ensure that each single pixel is hit by exactly one microtriangle in the mesh (no cracks, no repeated).– Extended the shader ISA with FXMUL and FXMAD fixed-point

instructions which provide consistent cross-product resultsacross microtriangles.

Floating Point 32 Bits Fixed point 24.8 Bits

A lit mesh of adjacent microtriangles

16

Page 17: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & Solutions.

• Performance Results.

• Conclusion.

17

Page 18: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Great microtriangle throughput scaling

• Render times for1/2 pixel and 1/8 pixel-sizemicrotrianglemeshes scale up 1.3X to 4X with 16 shader cores, wrtthe traditional GPU rasterizer unit.

• The better scaling of 1/8 size (blue) is due to theeffectiveness of the Bounding Box optimization.

18

Page 19: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Outline

• The Rasterization of Microtriangles.

• Parallel Rasterization in GPU Shaders.

• Problems & Solutions.

• Performance Results.

• Conclusion.

19

Page 20: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Conclusions• Near term 3D rendering demands for a microtriangle pipeline to efficienly

process tessellated surfaces.

• Current GPU rasterizers are not intended for microtriangles:– Designed for high pixel rates on triangles larger than ~10 pixels.– Poor microtriangle throughput to feed the pixel pipeline. – Replication inefficiently increases area: Bad scalability.

• We propose to rasterize microtriangles in GPU shaders.– The largest & more scalable resource in today´s GPUs– Using the more efficient Xproducts instead of edge setup.– As an alternative selectable pipeline by the API user.

• Problems and solutions:– Shading of sparse vectors: Bounding Box optimization pre-pass.– No cracks or repeated pixels by using Fixed-Point operations.

20

Page 21: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Thank you!Q&A

21

Page 22: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

BACKUP

22

Page 23: A SIMD-efficient 14 Instruction Shader Program for High …attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf · A SIMD-efficient 14 Instruction Shader Program

Enable Early Z Optimization

Rasterization

Z Interpolation

AttributeInterpolation

Original DirectX

FragmentShader

Thread Entry

Thread Exit

Late Z Test

Original DirectX

FragmentShader

Thread Exit

Early Z Test

Thread Entry

Thread Exit

Rasterization

Z Interpolation

AttributeInterpolation

Thread Entry

Rasterization

Z Interpolation

AttributeInterpolation

Original DirectX

FragmentShader

Thread Entry

Thread Exit

Earl

yZ

Test

Sleep Thread

Z test request

Z test result

23