11 getting the most out of intel ® graphics ray paik & katen shah february 21, 2008
DESCRIPTION
33 Agenda Graphics Market Trends Intel ® Integrated Graphics Roadmap Intel GenX Architecture Overview GenX Features & Tips for Developers Gaming Performance Demo Developer Resources SummaryTRANSCRIPT
11
Getting the Most Out of Intel® Graphics
Ray Paik & Katen Shah
February 21, 2008
33
Agenda
Graphics Market TrendsIntel® Integrated Graphics RoadmapIntel GenX Architecture Overview GenX Features & Tips for DevelopersGaming Performance DemoDeveloper ResourcesSummary
44
Integrated Graphics Market will Continue to Grow
Integrated graphics will continue to account for a significant volume
This is especially true in the mobile segment
Intel continued to have the largest market segment share in integrated graphics in 2007
Source: Mercury Research (Q4’07)
0
50,000
100,000
150,000
200,000
250,000
2006 2007 2008 2009 2010 2011 2012
Desktop Integrated
Desktop Discrete
Mobile Integrated
Mobile Discrete
55
2007 2008
Client Graphics Roadmap
Desktop
Mobile
Intel® G35Direct*X9/10†
OpenGL* 1.5/2.0† †
Shader Model 3 (DX9)/4 (DX10) GMA X3500
Intel® G33Direct*X9
OpenGL* 1.4 + ExtShader Model 2 (in
SW)GMA 3100
Intel® GM965Direct*X9/10†
OpenGL* 1.5/2.0† †
Shader Model 3 (DX9)/4 (DX10) GMA X3100
CantigaDirect*X10
OpenGL 2.0Shader Model 4
EaglelakeDirect*X10
OpenGL 2.0Shader Model 4
†DX10 Driver expected 1H’08† † OGL2.0 Driver expected 2H’08
2009+
Innovation Continues: • Leading Process
Technology• Integration
66
What Games Play on Intel® Integrated
Graphics?
77
Intel® Graphics ArchitectureMemoryCommandsInternal buses
Intel Integrated Graphics is architected to support Direct3D*10Delivers consistent features (no cap bits) and generalized unified SM
Executes vertex, geometry and pixel shaders on the array of execution unitsEUs are multi-threaded for covering latencySupport 128-bit execution per clock
Array of UnifiedExecution UnitsVF
VS
Clip
SetupRast /
Z
I$ Cache
TextureCache
Sampler
RenderCache
PixelOps
-Th
read
Dis
patc
h
VideoProcessing 2D DisplayCmd
StreamerMemory /Cache
GSRow0EU0 EU1 EUn
RowMEU0 EU1 EUn
88
Thoughts Before We Begin
Generally speaking most of the guidelines, tips and recommendations to follow are similar to and applicable to most graphics devicesException is that these are typically even more important for integrated and volume graphics devicesMost of the discussion that follows is centered around Intel® Integrated Graphics
99
Core/Memory Capabilities
Integrated Graphics will continue to utilize the UMA - sharing the memory bandwidth between all the agentsDynamically allocated video memory (DVMT) is dependent upon the total system memory in the platform as well as the O/S. However, we continue to increase the max limit
2007 2008Product Intel® G35 Intel® GM965 Eaglelake CantigaGfx Arch Gen 4.0 Gen 5.0Memory BW (GBps) 10.7 – 12.8 8.5 – 10.7 12.8 – 23.1 10.7 – 17.1UMA Capability 2x DDR3-667/800 2x DDR2-533/667 2x DDR3-
800/1066/13332x DDR3-
667/800/1067Max DVMT (XP) 1 or 2GB System Memory
384MB > 512MB
Max DVMT (Vista) 1GB / 2GB System Memory
256MB/384MB 256MB/>512MB
1010
Vertex/Primitive Processing CapabilitiesSupport for HWVP or SWVP is provided
HWVP is enabled for all titles by defaultSWVP may offer performance enhancements on Intel® Core™2 Duo/Core™2 Quad CPUsFor SWVP, VS/GS/Clip stages behave as pass-through
Reallocate the compute resources back for pixel processing for overall performance gainDriver will always export full HWVP support
SWVP maybe used based on configuration and workloadSWVP has optimizations beyond the current PSGP including support for evolving CPU instructions
Peak vertex throughput through the fixed function pipe is defined by the cull rateGen5 has ~2x the vertex processing throughput over Gen4 HWVP
Peak Early-Z reject rate is at 4 pixels/clk
1111
Tips on Vertex/Primitive ProcessingVertex Processing
Use DrawIndexedPrimitive() to maximize reuse of vertex cacheVertex Cache Size will increase over time Use VCACHE_QUERY() for sizeUse strips when possible over lists when possibleAvoid fans (deprecated on DX10)
Utilize visibility tests to reject objects that fall outside the view frustum to reduce the impact on clipping
Use D3DRS_CLIPPING == FALSE for objects that don’t need clippingEnsure adequate Batching to amortize runtime and driver overhead
Bigger is better >200-1K recommendedMinimize state changes between batches Reduces h/w pipeline flushes
Render with Z-only pass followed by a normal render pass and/or in a rough front-to-back order
Utilizes the higher performance of Early-Z to reject occluded fragments reducing computes and raster opsBalance this against cost of an additional pass and more render state changes or worse batching due to sortingAvoid usage of modified Z value (oDepth) in the pixel shader
Occlusion Query can be used to reduce overdraw for complex scenesOQ can be used to check the visibility of an object by rendering the bounding box – if it returns with zero the object does not need to be rendered
1212
Shader Capabilities
Gen5 significantly improves compute capability over Gen4Significant improvement of Transcendental instructions performanceIncreased support of latency coverage at higher frequencySupport of shaders with longer instruction lengths
2007 2008Product Intel® G35 Intel® GM965 Eaglelake CantigaGfx Arch Gen 4.0 Gen 5.0Shader Model Profile vs_3_0, ps_3_0; vs_4_0, ps_4_0, gs_4_0 vs_4_0, ps_4_0, gs_4_0Max # of Insts SM3.0 = 512; SM4.0 = UnlimitedMax # of Constants SM3.0 = 256; SM4.0 = 4Kx16Max # of Temp Registers
Temp storage per Shader execution instance is 4096 elements which can be used in any combination of registers/arrays, i.e., the total number of r# and x# declared must <= 4096
Precision 32bit floating pointVertex Texture/Instancing
SM3.0 / 4.0
Flow Control Static and Dynamic
1313
Tips on Shader CapabilitiesUtilize the higher Shader Models wherever possible
There is no performance gain using SM older than 2.0 (Vista* requirement)Use programmable shaders over FF as much as possibleUtilize shader based fog (SM3.0 deprecates fixed function fog)
Shaders are currently compiled at Draw time and ReusedMid-Scene Compile time is impacted by the length of the shader and number of unique state/shader combinations
Strike a balance between texture samples and complexity of the shaderGeneral trend is increased ALU to Sample ratio
>4:1 is recommended and provides better latency coverageLarger ratio maybe better for floating point textures, higher order filtering and 3D textures
Supports unlimited shader lengths via cache structure but limited registers cause spills and fills which are very expensive
There is a limited number of registers per Execution Unit per thread Impacts EU efficiency
Reduce use of macro/transcendental functions where possible especially for Gen4LOG, LIT, ARL, POW, EXP, etc. are particularly expensive
Mask alpha when not neededUse full precision for non-transcendental instructions
1414
Tips on Shader Capability (cont’d)
SM4.0 requires support of Vertex Texture (optional SM3.0)Unified Shader uses the same sampler for pixels and vertices
Sampler supports all the filtering typesInstancing is supported for SM3.0 and SM4.0
Enables better vertex throughput by minimizing vertex dataHW Instancing vs No Instancing (DIP)
0
10
20
30
40
50
60
70
1 200 400 600 1000 1500 2000 2500 3000
Batch Size
FPS HW Instancing
No Instancing
1515
Tips on Shader Capability (cont’d)
Smart Usage of flow controlGen4 implements both static and dynamic flow control and provides low penalty for early outsDynamic flow control can provide significant benefits by skipping a large number of computations
Ensure it is used where large portions of the shader can be skippedTo maximize performance pixel shader executes on 16 pixels in parallel
The benefits can be significant or small depending on the likelihood of the number of pixels taking the same branch
Usage of predication is preferred over using dynamic flow control especially for shorter branching instruction sequences
ConstantsTry to use under 32 for highest performanceLimit usage of indexed constants c[ax]
Incurs high latency in shaders (typically used mainly for Vertex shaders)
1616
Texture Sampler/Pixel Operations
Supports all sampler filtering types including dynamic anisotropic filteringGen5 significantly improves the 32 bpp Fixed trilinear filtering and 16bpp Float bilinear performance
2007 2008Product Intel
® G35
Intel® GM96
5
Eaglelake
Cantiga
Gfx Arch Gen4 Gen5Format Support 16/32-bit fixed point; 16/32-bit floating point
opsMax # of Samples Up to 16Vertex Textures YesMax 2D/3D/Cube Textures
8Kx8K/2Kx2K/8K
Filtering Type Support BLF, TLF and Dynamic AF w. up to 16 sub-samples
Texture Compression DX9: DXT1/3/5; DX10: BCxNon Power of 2 Textures YesRender to Texture Yes, Incl. Off-screen Surface SupportMulti-Sample Render Single Sample OnlyMulti-Target Render Max = 8Alpha-Blend FP formats Both FP16/FP32 formats are supported
2007 2008Product Inte
l G35
Intel GM965
Eaglelake
Cantiga
Gfx Arch Gen4 Gen54 Ch 32bit Fixed/ 2 Ch 16bit FloatPoint 1X 1XBilinear 1X 1XTrilinear 1X 2XAnisotropic 1X/n 1X/n4 Ch 16bit FloatPoint 0.5X 1XBilinear 0.5X 1XTrilinear 0.25X 0.5XAnisotropic 0.5X/n 0.5X/n4 Ch 32bit Float/ 1 Ch 32bit FloatPoint 0.25X 0.25X
1717
Tips on Texture Sampler/Pixel OpsAvoid 32-bit Float wherever possible
FP32 filtering is optional on DX9* and DX10, and is not supported. Check the capsKeep MRTs to <4
Can provide performance upside in some cases vs. 2 passesUse the same format if possibleKeep size to under 128x128 if possible
Minimize Number of Clear()sClear() Color and Z/Stencil buffer at the same time (when both are required)
Minimize Lock/Blit of Z and/or Stencil BufferTextures
Ideally provide this as a scalability optionUse Compressed Textures and mip-maps whenever possibleMinimize use of large textures even through GenX supports up to 8Kx8KFiltering type has generally shown a low impact to performance but should be used judiciously
Stencil Shadows are generally fill intensiveShadow Map is preferred for performance
Dynamic memory allocationAllocate surfaces in priority order – render surfaces used most frequently should be allocated firstMinimize Lock()D3DPOOL_DEFAULT for lockable memory
Dynamic Vertex/Index buffersD3DPOOL_MANAGED for non-lockable memory
Textures, Back buffer chain, Vertex/Index buffer
1818
Direct3D* 10
Intel® Integrated Graphics has been architected with DX10* in mind DX10 design goals were to make significant optimizations over DX9
Reduced runtime overhead, ensure feature consistency, optimize state management and add new features that increase GPU usage However, there are optimizations such as constant reorganization or using constants as immediates that we can do in DX9 that cannot be done in DX10
DX10 Driver statusDX10 drivers for Eaglelake and Cantiga will be available at launchDX10 drivers will also support Gen4 products such as Intel® G35 Express and Intel GM965 Express chipsets
1919
Direct3D* 10
DX10 still has a few optional features>1 Sample MSAA: Gen4/5 support single sampled rendering32-bit FP Filtering: Gen4/5 DOES NOT support 32b FP Point FilteringRGB32 Rendertarget: Not supported16-bit UNORM Blending: Not supported in Gen4. Supported in Gen5 onwardsCheck format support using ID3D10Device::CheckFormatSupport
Constant ManagementGroup constant buffers based on frequency of updatesEnsure constant buffers are sized according to usage
Minimize Usage of Geometry Shader and StreamOutPerformance not yet characterized
Likely to be a limiter: should be avoidedDrawAuto() causes synchronization in current devices and should be avoided
2020
Direct3D* 10Some of the games have heavier fx enabled for DX10 render path – these games recommend to fall back to DX9 for less capable devices
In general, DX10 can provide a higher performance path based on the inherent design goalsIntel’s focus for optimizations moving forward will be to concentrate on DX10Additionally there are features that can improve performance in DX10 path
For example in case of DX9, FP16 is used for hdr effects. However, with DX10 a R11G11B10 format is available which can be used as a destination format for hdr rendering – for integrated graphicsthis can be significant bandwidth savings
Prefer scaling to be API independent
Game Scaling DX8 DX9 DX10
High Detail
Standard Detail
Low Detail
Observation
Reco
mm
enda
tion
2121
SummaryIntegrated graphics market will continue to grow especially for mobile platformsAggressive improvement planned for integrated graphics
Consistent Feature Set with DX10Intel will push more optimizations and improvements in DX10 driver vs. DX9
Summary of tips & tricksMost of the tips and tricks for integrated graphics devices are applicable to other graphics adaptersIn general, the need for optimization is even more important for integrated devices
Resources are available for developersPlans to improve visibility into performance of Intel® Integrated Graphics
2222
Take Advantage of Graphics Developer Resources from IntelLatest Intel graphics drivers
http://www.intel.com/support/graphics/ Graphics Developer Community
http://www.intel.com/software/graphics Quick Reference Guide to Intel® Integrated Graphics
http://softwarecommunity.intel.com/articles/eng/1488.htm
Intel® GMA 3000 and X3000 Developer’s Guide
http://softwarecommunity.intel.com/articles/eng/1487.htm
Tools and Tips for Debugging Issues on Intel® 3000 and X3000 Series Integrated Graphics
http://softwarecommunity.intel.com/articles/eng/1489.htm
2323
09:00am - Optimizing DirectX* Rendering on Multi-Core Hardware
10:30am - Gaming on the Go12:00pm - COLLADA in the Game02:30pm - Interactive Ray Tracing in Games04:00pm - Speed Up Synchronization Locks
Other Intel Sessions at GDCwww.intel.com/software/graphics
09:00am - The Future of Programming for Multi-Core with the Intel Compilers
10:30am - Getting the Most Out of Intel Graphics12:00pm - Comparative Analysis of Game
Parallelization02:30pm - Threading Quake 4* and Quake Wars*
Wednesday (February 20th)
Thursday (February 21st)
2424