† saarland university, germany ‡ university of utah, usa estimating performance of a ray-tracing...

† Saarland University, Germany‡ University of Utah, USA

Estimating Performance of aRay-Tracing ASIC Design

Sven Woop† Erik Brunvand‡

Philipp Slusallek†

Ray Tracing in Car IndustryRay Tracing in Car Industry

Ray Tracing GamesRay Tracing Games

Previous WorkPrevious Work

• Ray Tracers for Static Scenes

• CPU based: [OpenRT], [MLRT SIGGRAPH05]

• GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05]

• Custom Hardware: Commercial Hardware (ART-VPS) Schmittler (KD Trees) [GH04] RPU (KD Trees) [SIGGRAPH05]

• Ray Tracers for Dynamic Scenes

• CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006]

• Custom Hardware: Woop (B-KD Trees) [GH06]

OutlineOutline

• Previous Work

• DRPU Architecture

• B-KD Trees

• Traversal Processor

• Prototype Implementations

• DRPU-FPGA

• DRPU-ASICs

• Conclusion

Definition of B-KD TreesDefinition of B-KD Trees

B-KD Tree (Bounded KD-Tree)

• Binary Tree

• 1D bounding intervalls for each child

• Leaf nodes point to a single primitive

split axisT1

B-KD Tree SubdivisionB-KD Tree Subdivision• Bounding Volume Hierarchy (partially unbounded)

• Each node can be associated with a full bounding box

• Bounds may overlap

Primitives in single leaf nodes

More traversal steps as for KD Tree

Support for dynamic scenes

T01T00 T11T10

Update of B-KD TreesUpdate of B-KD Trees

Update Procedure

• Bounds updated on changed geometry

• B-KD tree structure remains constant

Linear updating complexity

split axisT1

DRPU ArchitectureDRPU Architecture

TraversalProcessing

Geometry Unit

Node Cache128 Bit wide

Vertex Cache128 Bit wide

from memory

Shading Unit

to framebuffer

from memory

Shader Cache128 Bit wide

from memory

Skinning Processor

instructions from memory

nodes tomemory

vertices tomemory

Update Processor

vertices from memory

• Rendering Units

• Highly multi-threaded

• Higher hardware usage

• Synchronous execution of packets of 4 rays

• Memory bandwidth reduction

• First level caches

• Memory bandwidth reduction

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Programmable Shading Processor

• Design similar to fragment processors on GPUs

• Improved Programming Model

• Add highly efficient recursion

• Add flexible memory access

• Programming Model

• Ray generation tasks

• Material shading

• Calls Ray Casting Units to cast rays

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Programmable Shading Unit

• Ray Casting Units

• High-performance traversal and intersection

• Support for continous dynamic scenes

• B-KD Trees approach

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Efficient traversal of B-KD trees

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Efficient traversal of B-KD trees

• Geometry Unit

• Ray transformations

• Vertex-based ray/triangle intersection [Möller Trumbore]

• Shared vertices save memory 6x

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Scene Changes

• Skinning Processor

• Skeleton Subspace Deformation

• Re-uses Geometry Unit

• Pure stream architecture

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

• Scene Changes

• Skinning Processor (see paper)

• Skeleton Subspace Deformation

• Re-uses Geometry Unit

• Pure stream architecture

• Update Processor

• Stream-like architecture

• Partial breadth-first execution

• One B-KD node update per clock cycle peak

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

TraversalProcessing

Geometry Unit

from memory

Shading Unit

to framebuffer

from memory

Skinning Processor

nodes tomemory

vertices tomemory

Update Processor

Traversal of B-KD TreesTraversal of B-KD Trees

Traversal of B-KD Trees

• Early ray termination

• Clipping of near/far interval against both bounding intervalls

• Take closer child, push farther child to stack

• Traversal order does not affect correctness

Complexity

• 4x computational cost of KD tree traversal step

• 2x stack memory

Tcloser ch ild

Tfarther ch ild

Traversal ProcessorTraversal Processor

• Stack control computes next address

from memory

Stack Control Unit

Memory Access Unit

Slice 0 Slice 1 Slice 2 Slice 3

Decide0 Decide1 Decide2 Decide3

Packet Decision Unit

to Geometry Unit

start traversal

finished if stack empty

Traversal Processing UnitTraversal Processing Unit

• Next node is fetched from cache

from memory

Stack Control Unit

Memory Access Unit

to Geometry Unit

start traversal

• 4 traversal slices compute 4x4 distances to bounding planes

from memory

Stack Control Unit

Memory Access Unit

to Geometry Unit

start traversal

• 4 Decision Units compute per ray traversal decision

from memory

Stack Control Unit

Memory Access Unit

to Geometry Unit

start traversal

• Packet Decision Unit computes packet traversal decision

• Packet goes left if exists a that ray goes left

• Packet goes right if exists a ray that goes right

• Packet goes from left to right if exists a ray that goes into both children from left to right

from memory

Stack Control Unit

Memory Access Unit

to Geometry Unit

start traversal

• Packet Decision Unit computes packet traversal decision

• Packet goes left if exists a that ray goes left

• Packet goes right if exists a ray that goes right

• Packet goes from left to right if exists a ray that goes into both children from left to right

Incoherent packets possible

from memory

Stack Control Unit

Memory Access Unit

to Geometry Unit

start traversal

FPGA ImplementationFPGA Implementation

Hardware

• Xilinx Virtex4 LX160

• 66 MHz

• 1.0 GB/s (limited to 0.5 GB/s)

• 7.5 Gflops

• 2,3 Gflops programmable

• 5,2 Gflops fixed function

Implementation

• Packets of 4 rays

• 32 packets of rays

• 3x 8 KB caches, direct mapped

• 24 bit floating point

Virtex4 Board

ASIC DesignASIC Design

• Synthesis

• Synopsys Synthesis

• UMC 130nm CMOS process

• Place & Route

• Cadence Encounter

• Some manual placements to achieve good results

• Only DRPU Core

• No chip interface designed (PCI Express, DRAM, ...)

• No power estimation

DRPU-ASIC

DRPU-ASICDRPU-ASIC

Hardware

• UMC 130nm process

• Die size: 49 mm2

• 266 MHz clock

• 2.1 GB/s bandwidth

• 30 Gflops

Implementation Differences

• Larger caches (3x 16 KB, 4-way associative)

GPU ComplexityGPU Complexity

ATI R520 (October, 2005)

• 90nm process

• 288 mm2 die

• 600 MHz clock speed

• 170 GFlops programmable?

• 44,8 GB/s memory bandwidth

Implementation

• Packets of 4 fragments

• 16 fragment pipelines

• 8 vertex piplines

On-Chip ParallelizationOn-Chip Parallelization

• Thread Scheduler schedules packets

• High bandwidth memory interface to Rendering Units

Update Processor

Traversal Processor

Geometry Unit

Shader Processor

Rendering Unit 2

Traversal Processor

Geometry Unit

Shader Processor

Rendering Unit 1

Thread Scheduler

Host Bus Controller

to Host Bus

Memory Interface

External DRAM

Traversal Processor

Geometry Unit

Shader Processor

Rendering Unit N

DRPU4 ASICDRPU4 ASIC

Hardware

• UMC 130nm process

• 196 mm2 die (4 x 49 mm2)

• 266 MHz clock

• 8,5 GB/s

• 120 GFlops

• 4x DRPU ASIC

• No high level control

DRPU8-ASICDRPU8-ASIC

Hardware

• 90nm process (extrapolated using constant field scaling)

• 186 mm2 die

• 400 MHz clock speed

• 25,6 GB/s bandwidth

• 361 Gflops

• 110 Gflops programmable

• 471 Gflops fixed function

• 8x DRPU-ASIC

9,6 mm

19,3 mm

ResultsResults 1024x768, shadows

Results for DRPU8Results for DRPU8

Performance sufficient for game play

Room for improving image quality

Gael 91.2 fps DynGael 96.0 fps

Conclusions and Future WorkConclusions and Future Work

• Ray Tracing Hardware Design

• Support for programmable recursive shading

• Coherent scene changes

• Working Prototype Implementation

• Post layout ASIC Results

• Still no power results

• No direct performance comparison against GPU

Questions?Questions?

• Project Homepage:http://www.saarcor.de

• Computer Graphics Lab at Saarland University:http://graphics.cs.uni-sb.de

† saarland university, germany ‡ university of utah, usa estimating performance of a ray-tracing...

kd tree support

bkd tree subdivision

memory slide

kd trees vertices

woop bkd trees gh06

geometry bkd tree structure

single primitive slide

kd trees gh05 custom

Documents

video film and television production soundtracks ... ·...

international visitors - פסטיבל הסרטים...

vol. 150 - no. 24 sidney, new york — thursday, june 16,...

stefan popov space subdivision for bvhs stefan popov, iliyan...

realistic image synthesis - universität des...

afrigraph 2004 interactive ray-tracing of free-form surfaces...

woop...start with a clear mind. woop is an imagery technique...

simtrax: simulation infrastructure for exploring thousands...

network-integrated multimedia middleware - kde ·...

andreas dietrich ingo wald markus wagner philipp...

digital system design - utah...

library collections for sustainability amy brunvand, joshua...

design thinking meets - school of...

woop: rewards redemption process

saarland university, germany b-kd trees for hardware...

case study “big woop” hbs israel bermudez aleysjah...

realtime caustics using distributed photon mapping johannes...

kids in space - music room€¦ · woop-woop-woop warning...

way out woop woop

- history of computer graphics - philipp slusallek