a bit about me: rendering systems
TRANSCRIPT
![Page 1: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/1.jpg)
A Bit about Me: Rendering Systems
![Page 2: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/2.jpg)
A Bit about Me: Rendering Systems
Hardware Pixar Image Computer & CHAP REYES Machine & FLAP
Software RenderMan Real-Time Shading Language Spark
![Page 3: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/3.jpg)
A Bit about Me: & Beyond
Brook: Stream computing on graphics processors
Larrabee: An x86 architecture for visual computing
![Page 4: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/4.jpg)
A Graphics Perspective on Co-Design
Pat Hanrahan
DOE Stanford PSAAP Center Stanford Pervasive Parallelism Laboratory
(Supported by Sun/Oracle, AMD, NVIDIA, Intel, NEC)
Salishan Conference on High Speed Computing April 25, 2011
![Page 5: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/5.jpg)
Application
![Page 6: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/6.jpg)
The Road to Point Reyes Lucasfilm
![Page 7: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/7.jpg)
R.E.Y.E.S. = Renders Everything You Ever Saw
Simulating the everyday world
![Page 8: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/8.jpg)
Hardware
![Page 9: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/9.jpg)
REYES Machine Goals (1986)
Pixels 3000 x 1667 (5 MP) Depth complexity 4 Pixel area of a micropolygon 0.25 Number of micropolygons 80,000,000 FLOPs per micropolygon (minimum) 300 Total calculation 24 GFs 24 frames per second .576 TFs
Goal ~ 1 frame in 2 minutes – real-time was inconceivable
![Page 10: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/10.jpg)
CPUs Waste Resources
19%/year 30:1
1,000:1
30,000:1
Graph courtesy of Bill Dally
The Capability Gap
![Page 11: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/11.jpg)
GPUs Use Many Forms of Parallelism
Multi-Core
Multi-Thread
SIMD Vector
![Page 12: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/12.jpg)
“Extreme” Graphics Chip
16 cores x 32 SIMD functional units x 2 flops/cycle x 1 GHz = 1 TFLOP
![Page 13: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/13.jpg)
Texture mapping must run at 100% efficiency
Challenging Short inner loop (lots of branches) Random memory access (texture map) Very little temporal locality
Application-Hardware Co-Design
DCL t0.xy # Interpolate t0.xy DCL v0.xyzw # Interpolate v0.xyzw DCL_2D s0 # Declaration – no code TEX1D r0, t0, s0 # TEXTURE LOAD! MUL r1, r0, v0 # Multiply MOV oC0, r1 # Store to framebuffer
![Page 14: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/14.jpg)
GPU Multi-threading: Hide Latency
frag4 frag3 frag2 frag1 Run until stall at
texture fetch
Fermi: 48 threads x 16 cores x 32 SIMD ALUs = 24,576 tasks
![Page 15: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/15.jpg)
NVIDIA Historicals Year Product Tri rate CAGR Tex rate CAGR 1998 Riva ZX 3m - 100m - 1999 Riva TNT2 9m 3.0 350m 3.5 2000 GeForce2 GTS 25m 2.8 664m 1.9 2001 GeForce3 30m 1.2 800m 1.2 2002 GeForce Ti 4600 60m 2.0 1200m 1.5 2003 GeForce FX 167m 2.8 2000m 1.7 2004 GeForce 6800 Ultra 170m 1.0 6800m 2.7 2005 GeForce 7800 GTX 940m 3.9 10300m 2.0 2006 GeForce 7900 GTX 1400m 1.5 15600m 1.4 2007 GeForce 8800 GTX 1800m 1.3 36800m 2.3 2008 GeForce GTX 280 48160m 1.3 2010 GeForce GTX 480 42000m 0.9 2011 GeForce GTX 580 49400m 1.2
1.7 1.7
![Page 16: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/16.jpg)
SGI Historicals
Year Product Fragment Rate Triangle Rate 1984 Iris 2000 100K - 0.8K -
1988 GTX 40M 4.5 135K 3.6
1992 RE 380M 1.8 2M 2.0
1996 IR 1000M 1.3 12M 1.6
2.2 2.2
Performance of Z-buffered rendering
![Page 17: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/17.jpg)
GPUs 10x More Efficient
# CPU cores 2 out of order 10 in-order Instructions per issue 4 per clock 2 per clock VPU lanes per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single-stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock
20 times greater throughput for same area and power ½ the sequential performance
Larrabee: A many-core x86 architecture for visual computing, D. Carmean, E. Sprangle, T.
Forsythe, M. Abrash, L. Seiler, A. Lake, P. Dubey, S. Junkins, J. Sugerman, P. Hanrahan,
SIGGRAPH 2008 (IEEE Micro 2009, Top Pick)
![Page 18: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/18.jpg)
Software
![Page 19: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/19.jpg)
Software is Inefficient
A C program – base line A ruby/php program – 100x slower A well-written C program – 10x faster A crazy assembly language program – 2x-5x faster yet
![Page 20: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/20.jpg)
Big Challenge
Graphics hardware specialization(s) Multiple implementations with different characteristics Software needs to be optimized for each platform
The resulting software is not portable costly to develop
![Page 21: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/21.jpg)
Heterogeneous Platforms
LANL IBM Roadrunner (Opteron + Cell)
Tianhe-1A (Xeon + Tesla M2050 + NUND 160GBps)
ORNL Titan
![Page 22: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/22.jpg)
Even Bigger Challenges Ahead
Specialization leads to hybrid or heterogeneous systems Heterogeneity leads to combinatorial complexity Complexity makes it even harder to develop software
![Page 23: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/23.jpg)
How Do We Handle Heterogeneity?
![Page 24: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/24.jpg)
Program at a Higher-Level!
![Page 25: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/25.jpg)
Graphics Libraries are High-Level
glPerspective(45.0); for( … ) { glTranslate(1.0,2.0,3.0); glBegin(GL_TRIANGLES); glVertex(…); glVertex(…); … glEnd(); } glSwapBuffers();
![Page 26: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/26.jpg)
OpenGL “Grammar”
<Scene> = <BeginFrame> <Camera> <World> <EndFrame>
<Camera> = glMatrixMode(GL_PROJECTION) <View> <View> = glPerspective | glOrtho
<World> = <Objects>* <Object> = <Transforms>* <Geometry> <Transforms> = glTranslatef | glRotatef | … <Geometry> = glBegin <Vertices> glEnd <Vertices> = [glColor] [glNormal] glVertex
![Page 27: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/27.jpg)
Advantages
Portability Runs on wide range of GPUs
![Page 28: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/28.jpg)
Advantages
Portability Performance
Carefully designed to map efficiently to hardware “Driver-Compiler” uses domain knowledge Vertices/Fragments are independent Textures are read-only; texture filtering hw Efficient framebuffer scatter-ops …
![Page 29: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/29.jpg)
Advantages
Portability Allows hardware innovation
Performance
![Page 30: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/30.jpg)
Advantages
Portability Performance Productivity
Graphics libraries are easy to learn and use
![Page 31: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/31.jpg)
Advantages
Portability Performance Productivity
Having your cake and eating it too!
![Page 32: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/32.jpg)
Can We Apply this Idea to
Scientific Computing?
![Page 33: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/33.jpg)
Liszt Z. DeVito, N. Joubert, M. Medina,
M. Barrientos, E. Elsen, S. Oakley,
J. Alonso, E. Darve, F. Ham, P. Hanrahan
“…the most technically advanced and perhaps greatest pianist of all time… made playing complex pieces on the piano seem effortless…”
![Page 34: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/34.jpg)
Liszt: Solving PDEs on Meshes
val pos = new Field[Vertex,double3]val A = new SparseMatrix[Vertex,Vertex]
for( c <- cells(mesh) ) { val center = avg(pos(c.vertices)) for( f <- faces(c) ) { val face_dx = avg(pos(f.vertices)) – center for ( e <- f edgesCCW c ) { val v0 = e.tail val v1 = e.head val v0_dx = pos(v0) – center val v1_dx = pos(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …
![Page 35: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/35.jpg)
Challenges in Compiling GP Language
Compiler needs to reason about Parallelism Locality Synchronization
Fundamentally, analyzing dependencies is hard 1. Analyzing functions: A[i] = B[pow(2,i) / mod(i,4) + f(i)] 2. Analyzing pointers: A[i] = *ptrA
![Page 36: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/36.jpg)
Liszt Enables Dependency Analysis
Mesh neighborhood accessed through built-in functions Pattern of access defines stencil Stencil shape is fixed and can be determined by
static analysis
Fields accessed consistently during loops Field accesses are organized into “phases” Within a forall, either read-only, write-only or
reduce-only access pattern
![Page 37: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/37.jpg)
Domain Decomposition / Ghost Cells Given a program and a mesh, Liszt automatically creates a graph of mesh adjacencies needed to run the algorithm
Graph is handed to ParMETIS to determine optimal partition
Communication of information in ghost cells is automatically handed
Node 0
Owned Cells
![Page 38: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/38.jpg)
Scalable to Large Clusters
4-socket 6-core 2.66Ghz Xeon CPU per node (24 cores), 16GB RAM per node. 256 nodes, 8 cores per node
32
128
256
512
1024
32 128 256 512 1024
Spee
dup
Cores
Euler23M cell mesh
LisztC++
32
128
256
512
1024
32 128 256 512 1024Cores
Navier-Stokes21M cell mesh
LisztC++
![Page 39: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/39.jpg)
Runs Very Fast on GPUs Tesla C2050 vs. 1 core Nehalem E5520 (2.26 Ghz) Double Precision
0
5
10
15
20
25
30
35
40
Euler NS FEM SW
Spee
dup
over
Sca
lar (
x)
Applications
GPU Performancecuda on tesla c2050
![Page 40: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/40.jpg)
And Even SMPs
0
5
10
15
20
25
30
35
40
Euler NS FEM SW
Spee
dup
over
Sca
lar (
x)
Applications
Comparison between Liszt runtimespthreads on 8-core
mpi on 8-corepthreads on 32-core
mpi on 32-corecuda on tesla c2050
![Page 41: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/41.jpg)
Performance
Productivity Completeness
The Ideal Parallel Programming Language
From Workshop on Concurrency for Application Programmers
![Page 42: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/42.jpg)
Successful Languages
Performance
Productivity Completeness
![Page 43: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/43.jpg)
Successful Languages
Performance
Productivity Completeness
?
![Page 44: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/44.jpg)
Additional Possibility
Performance
Productivity Completeness
Domain Specific
Frameworks &
Languages
![Page 45: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/45.jpg)
Wrap Up
![Page 46: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/46.jpg)
Summary
Graphics systems require advanced simulation Not having enough cycles forced us to be efficient Both performance and portability are important Leads to rapid evolution of innovative hardware
![Page 47: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/47.jpg)
High-Level Abstractions
Applications are written using High-level frameworks: game engines Domain-specific languages: shading languages
Advantage of high level approach … makes programmers productive … allows efficient automatic parallelization
![Page 48: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/48.jpg)
Careful Co-Design
This strategy works because of careful co-design of Applications (features) Algorithms Software Hardware
![Page 49: A Bit about Me: Rendering Systems](https://reader031.vdocument.in/reader031/viewer/2022012103/61dbc4734cc96718786e5bb6/html5/thumbnails/49.jpg)
Thank you