Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware
Tim FoleyMike HoustonPat Hanrahan
Computer Graphics LabStanford University
Motivation
GPU Programming Interactive shading Offline rendering Computation
physical simulations numerical methods BrookGPU [Buck et al. 2004]
Shouldn’t be constrained by hardware limits but demand high runtime performance
Motivation – Multipass Partitioning Divide GPU program (shader) into a
partition set of rendering passes each pass satisfies all resource
constraints save/restore intermediate values in
textures
Many possible partitions exist The problem:
given a program, find the best partition
Related Work
SGI’s ISL [Peercy et al. 2000] treat OpenGL machine as SIMD processor
Recursive Dominator Split (RDS) [Chan et al. 2002] graph partitioning of shader dag
Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] partition around flow control and
schedule passes Mio [Riffel et al. 2004]
instruction scheduling with backtracking
Contribution
Merging Recursive Dominator Split (MRDS)
MRDS – Extends RDS support shaders with multiple outputs support hardware with multiple render
targets generate more optimal partitions same running time as RDS
Outline
Motivation Related Work RDS Algorithm MRDS Algorithm Results Future Work
RDS - Overview
Input: dag of n nodes shader ops inputs
interpolants constants textures
Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets
RDS - Overview
Input: dag of n nodes shader ops inputs
interpolants constants textures
Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets
RDS - Overview
Input: dag of n nodes shader ops inputs
interpolants constants textures
Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets
RDS - Overview
Combination of approaches to limit search space
Save/recompute decisions primary performance tradeoff
Dominator tree used to avoid save/recompute tradeoffs
RDS – Save / Recompute
M – multiply refereced node
RDS – Save / Recompute
M – multiply refereced node
RDS – Save / Recompute
M – multiply refereced node
RDS – Save / Recompute
M – multiply refereced node
Dominator
B dom G all paths to B go through G
Dominator Tree
Key Insight
if B, G in same passand B dom Gthen no save/recompute costs for G
MRDS – Multiple-Output Shaders
MRDS – Multiple-Output Shaders
MRDS – Multiple-Output Hardware
float4 x, y;...for( i=0; i<N; i++ ){
x' = x*x - y*y;y' = 2*x*y;x = x'; y = y';
}...
MRDS – Multiple-Output Hardware
float4 x, y;...for( i=0; i<N; i++ ){
x' = f( x, y );y' = g( x, y );x = x'; y = y';
}...
MRDS – Multiple-Output Hardware
float4 x, y;...for( i=0; i<N; i++ ){
x' = f( x, y );y' = g( x, y );x = x'; y = y';
}...
MRDS – Multiple-Output Hardware
State cannot fit in single output
float4 x, y;...for( i=0; i<N; i++ ){
x' = f( x, y );y' = g( x, y );x = x'; y = y';
}...
MRDS – Multiple-Output Hardware
State cannot fit in single output
float4 x, y;...for( i=0; i<N; i++ ){
x' = f( x, y );y' = g( x, y );x = x'; y = y';
}...
MRDS – Dominating Sets
Dominating Set S = {A,D} S dom G All paths to G go through element of S S, G in same pass
avoid save/recompute for G
MRDS – Pass Merging
Generate initial passes with RDS
Find potential merges check if valid evaluate change in cost
Execute from best to worst revalidate
Stop when no more beneficial merges
MRDS – Pass Merging
Generate initial passes with RDS
Find potential merges check if valid evaluate change in cost
Execute from best to worst revalidate
Stop when no more beneficial merges
MRDS – Pass Merging
Generate initial passes with RDS
Find potential merges check if valid evaluate change in cost
Execute from best to worst revalidate
Stop when no more beneficial merges
MRDS – Pass Merging
Generate initial passes with RDS
Find potential merges check if valid evaluate change in cost
Execute from best to worst revalidate
Stop when no more beneficial merges
MRDS – Pass Merging
Generate initial passes with RDS
Find potential merges check if valid evaluate change in cost
Execute from best to worst revalidate
Stop when no more beneficial merges
MRDS – Pass Merging
What if RDS chose to recompute G?
Merge between passes A and D eliminates duplicate instructions gets high score
MRDS – Pass Merging
What if RDS chose to recompute G?
Merge between passes A and D eliminates duplicate instructions gets high score
MRDS – Time Complexity
Cost of merging dominated by initial search iterates over s2 pairs of splits each pair requires size-s set operations
and 1 compiler call O(s2(s+n))
s = O(n) in worst case MRDS = O(n3) in worst case in practice we expect s << n
Assumes compiler calls are linear not true for fxc
MRDS'
RDS uses linear search for save/recompute evaluates cost of both alternatives with RDSh
RDS = O(n * RDSh) = O(n3)
MRDS merges after RDS has made these decisions MRDS = O(RDS + n3) = O(n3)
MRDS' merges during cost evaluation adds linear factor in worst case MRDS' = O(n * (RDSh + n3)) = O(n4)
Results
3 Brook Programs Procedural Fire Mandelbrot Fractal Matrix Mulitply
Compiled for ATI Radeon 9800 XT with RDS MRDS MRDS'
Results – Procedural Fire
MRDS' better than MRDS and RDS better save/recompute decisions results in less bandwidth used
0500
100015002000250030003500
RDS MRDS MRDS'
Tim
e (n
s)
Results – Compile Times
00.5
11.5
22.5
33.5
Fire Fractal Matrix
RDSMRDSMRDS'
Results – Mandelbrot Fractal
MRDS', MRDS better than RDS iterative computation – state in 2
variables RDS duplicates computation
020406080
100120140
RDS MRDS MRDS'
Tim
e (n
s)
Results – Matrix Multiply
Matrix-matrix multiply benefits from blocking blocking cuts computation by ~2
Blocking requires multiple outputs performance limited by MRT performance
050
100150200250300350400
RDS MRDS MRDS'
Tim
e (n
s)
Summary
Modified RDS algorithm, MRDS supports multiple-output shaders generates code for multiple-render-
targets easy to implement, same running time generates better-performing partitions
Future Work
Implementations Ashli combine with Mio
Exploit new hardware data-dependent flow control large numbers of outputs
Acknowledgements
Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot RDS implementation, design discussions
Kayvon Fatahalian, Ian Buck GPUBench results
ATI hardware
DARPA, ATI, IBM, NVIDIA, SONY funding