using fpgas to supplement ray-tracing computations on the cray xd-1
DESCRIPTION
Using FPGAs to Supplement Ray-Tracing Computations on the Cray XD-1. Charles B. Cameron. United States Naval Academy Department of Electrical Engineering United States Naval Academy 105 Maryland Avenue, Stop 14B Annapolis, Maryland 21402-5025. Research supported by: - PowerPoint PPT PresentationTRANSCRIPT
Using FPGAs to Supplement Using FPGAs to Supplement Ray-Tracing Computations on Ray-Tracing Computations on
the Cray XD-1the Cray XD-1
Charles B. Cameron
United States Naval AcademyDepartment of Electrical Engineering
United States Naval Academy105 Maryland Avenue, Stop 14BAnnapolis, Maryland 21402-5025
Research supported by:• NASA Goddard Space Flight Center (Code 586)• NRL Applied Optics Branch (Code 5630)• DoD High Performance Computing Modernization Program at NRL (Code 5593)• United States Naval Academy• Xilinx, Inc.
TopicsTopics
• Ray tracing
• Conventional parallel processing
• Modulo scheduling
• Coordination of sequential and parallel processing
• Expected Performance
Ray tracingRay tracing
• MODIS– Moderate-resolution Imaging Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
MODIS Optical SystemMODIS Optical System ( (Moderate-resolution Imaging Moderate-resolution Imaging
Spectroradiometer)Spectroradiometer)
MODIS Optical SystemMODIS Optical System
•485 pinholes•400 rays per pinhole•241 121 rays reflected from the diffuser•5.66 109 rays
Ray Directed to a SurfaceRay Directed to a Surface
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation
Calculate the Intercept PointCalculate the Intercept Point
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation
Find the NormalFind the Normal
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation
Find the Refracted RayFind the Refracted Ray
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation
Find the Reflected RayFind the Reflected Ray
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation
Coordinate TransformationCoordinate Transformation
• MODIS– Moderate-resolution Imaging
Spectroradiometer
• The Intersection Problem
• Finding the Perpendicular
• Refraction
• Reflection
• Coordinate Transformation(Hard to visualize this!)
TopicsTopics
• Ray tracing
• Conventional parallel processing
• Modulo scheduling
• Coordination of sequential and parallel processing
• Expected Performance
ParallelismParallelism
PerformancePerformance (5.66 (5.66 10 1099 rays) rays)
Processor DEC Alpha 3000 Series Model 800. 200 MHz
Cray XD-1 with 839 AMD Opteron 275 processors. 2.2 GHz
Duration 1.2 106 s
(Two weeks)
27 s
Rate 0.112 106 rays · surfaces / s
6.6 106 rays · surfaces / (s · processor)
Reduction in Time Consumed:
Improvement in Ray Tracing Rate:99.998 %
5,857 %
*
* Rate based on a linear regression of results obtained using a varying numbers of processors.
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
DEC Alpha 3000 Series Model 800 Opteron alone
PerformancePerformance (5.66 (5.66 10 1099 rays) rays)
EfficiencyEfficiency
TopicsTopics
• Ray tracing
• Conventional parallel processing
• Modulo scheduling
• Coordination of sequential and parallel processing
• Expected Performance
Operations Required as a Operations Required as a Function of Surface, Aperture, Function of Surface, Aperture,
and Interaction Typesand Interaction Types
0
10
20
30
40
50
60
# o
f O
per
atio
ns
1 2 3 4 5 6 7 8 9 10 11 12
Circular
Aperture
Rectangular
Aperture
Plane 1. Refraction
7. Reflection
4. Refraction
10. Reflection
Sphere 2. Refraction
8. Reflection
5. Refraction
11. Reflection
Conicoid 3. Refraction
9. Reflection
6. Refraction
12. Reflection
Lots of theseNot too many of these
27
6
6
112 4b b ac
2a
4ac
ac
2 4
2
b b ac
a
2 4b ac
2 4b ac
4
2b
b
c a
2
27
11
6
6
Quadratic EquationQuadratic Equation
Critical Path
(Data-Flow Limit)
88 cycles
Latency
Unit # of cycles
Adder 11
Multiplier 6
Divider 27
Square root extractor 27
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Modulo Scheduling:Modulo Scheduling:One MultiplierOne Multiplier
Equal to the Data-Flow Limit
One collective computation
Modulo Scheduling:Modulo Scheduling:Filling the PipelineFilling the Pipeline
10c 0c
Cycle #
20c30c40c50c60c70c80c90c
Modulo Scheduling:Modulo Scheduling:Filling the PipelineFilling the Pipeline
10c 0c
Cycle #
20c30c40c50c60c70c80c90c
Multipliers are 100 % utilized
Modulo Scheduling:Modulo Scheduling:Filling the PipelineFilling the Pipeline
10c 0c
Cycle #
20c30c40c50c60c70c80c90c
No schedule conflicts
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Two multipliers with two multiplications each
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Two cycles
One adder with two additions
Maximum efficiency
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Improved efficiency:
Up from 25 %
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Less than the Data-Flow Limit
Modulo Scheduling:Modulo Scheduling:Two MultipliersTwo Multipliers
Less than the Data-Flow Limit, but double the throughput.
TopicsTopics
• Ray tracing
• Conventional parallel processing
• Modulo scheduling
• Coordination of sequential and parallel processing
• Expected Performance
Cray XD-1Cray XD-1
•MPI (Message Passing Interface)
•Master node
•Reads file
•Distributes file
•Collates results
...
...
...
... ... ...220 nodes
One Node of the Cray XD-1One Node of the Cray XD-1
•Open MP (Multi Processing)
•144 of 220 nodes have a Xilinx Virtex II Pro FPGA
•Opteron processors
•Sequential program
•Depth first
•FPGA
•Pipelined hardware
•Breadth first
AMD Opteron0
AMD Opteron1
AMD OpteronP2
AMD Opteron3
FPGA
FPGA ThreadRT Thread
RT Thread
RT Thread
RT Thread
TopicsTopics
• Ray tracing
• Conventional parallel processing
• Modulo scheduling
• Coordination of sequential and parallel processing
• Expected Performance
PerformancePerformance
Opteron alone 6.6 106 rays · surfaces / s · proc [meas.]
FPGA alone 5.4 106 rays · surfaces / s · proc [est.]
Reduction in speed = 20 %.
PerformancePerformance
Opteron alone 6.6 106 rays · surfaces / s · proc [meas.]
FPGA alone 5.4 106 rays · surfaces / s · proc [est.]
Reduction in speed = 20 %.
Opteron with FPGA 12.0 106 rays · surfaces / s · proc [est.]
Increase in speed = +80 %.
Floating point units use 11% of FPGA
•1 adder
•1 multiplier
•1 divider
•1 square-root unit
PerformancePerformance
Opteron alone 6.6 106 rays · surfaces / s · proc [meas.]
FPGA alone 5.4 106 rays · surfaces / s · proc [est.]
Reduction in speed = 20 %.
Opteron with FPGA 12.0 106 rays · surfaces / s · proc [est.]
Increase in speed = +80 %.
Floating point units use 11% of FPGA
Opteron with FPGA 25.2 106 rays · surfaces / s · proc [est.]
Increase in speed = +285 %.
Floating point units use 25% of FPGA
•1 adder
•1 multiplier
•1 divider
•1 square-root unit
•3 adders
•4 multipliers
•1 divider
•1 square-root unit
PerformancePerformance
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Opteron alone FPGA alone Opteron withFPGA
Opteron withFPGA
Note 1: 1 adder, 1 multiplier, 1 divider, 1 square-root takerNote 2: 3 adders, 4 multipliers, 1 divider, 1 square-root taker
MeasuredEstimate
Estimate
Estimate
(Note 1)(Note 2) (Note 1)
SummarySummary
• Modulo scheduling produces 100 % efficiency of critical resources.
• Sequential processors get a boost from supplemental FPGA processing.
• Deep pipelines are efficient only if filled much of the time.
• FPGAs beat ASICs only if they can take advantage of special problem knowledge.
• Opteron uses 55 W.• Virtex II Pro FPGA uses 4 W to 45 W.
EquationsEquations
• Intersection of a Ray with a Plane
• Intersection of a Ray with a Sphere
• Intersection of a Ray with a Conicoid
• Finding the Perpendicular
• Interaction of a Ray with an Optical Surface
• Coordinate Transformations
Intersection of a Ray with a Intersection of a Ray with a PlanePlane
List of equations
Initial direction
Normal to the plane
Point in the plane
Initial point
Final point
Intersection of a Ray with a Intersection of a Ray with a SphereSphere
List of equationsInitial pointFinal point
Initial direction
Intersection of a Ray with a Intersection of a Ray with a ConicoidConicoid
List of equations
Initial point
Final point
Initial direction
Finding the PerpendicularFinding the Perpendicular
Unit Vector Normal to a Sphere
Unit Vector Normal to a Conicoid
List of equations
Interaction of a Ray with an Interaction of a Ray with an Optical SurfaceOptical Surface
Refraction Reflection
List of equations
Initial index of refraction
Final index of refraction
Normal to the plane
Initial direction
Final direction
Coordinate TransformationsCoordinate Transformations
Rotation and Translation
Rotation
List of equations
Translation Vector
Rotation Matrix
Direction in Frame of Reference k
Direction in Frame of Reference k+1
Position in Frame of Reference k
Position in Frame of Reference k+1