11 getting the most out of intel ® graphics ray paik & katen shah february 21, 2008

23
1 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

Upload: corey-jordan

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

33 Agenda Graphics Market Trends Intel ® Integrated Graphics Roadmap Intel GenX Architecture Overview GenX Features & Tips for Developers Gaming Performance Demo Developer Resources Summary

TRANSCRIPT

Page 1: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

11

Getting the Most Out of Intel® Graphics

Ray Paik & Katen Shah

February 21, 2008

Page 2: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

33

Agenda

Graphics Market TrendsIntel® Integrated Graphics RoadmapIntel GenX Architecture Overview GenX Features & Tips for DevelopersGaming Performance DemoDeveloper ResourcesSummary

Page 3: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

44

Integrated Graphics Market will Continue to Grow

Integrated graphics will continue to account for a significant volume

This is especially true in the mobile segment

Intel continued to have the largest market segment share in integrated graphics in 2007

Source: Mercury Research (Q4’07)

0

50,000

100,000

150,000

200,000

250,000

2006 2007 2008 2009 2010 2011 2012

Desktop Integrated

Desktop Discrete

Mobile Integrated

Mobile Discrete

Page 4: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

55

2007 2008

Client Graphics Roadmap

Desktop

Mobile

Intel® G35Direct*X9/10†

OpenGL* 1.5/2.0† †

Shader Model 3 (DX9)/4 (DX10) GMA X3500

Intel® G33Direct*X9

OpenGL* 1.4 + ExtShader Model 2 (in

SW)GMA 3100

Intel® GM965Direct*X9/10†

OpenGL* 1.5/2.0† †

Shader Model 3 (DX9)/4 (DX10) GMA X3100

CantigaDirect*X10

OpenGL 2.0Shader Model 4

EaglelakeDirect*X10

OpenGL 2.0Shader Model 4

†DX10 Driver expected 1H’08† † OGL2.0 Driver expected 2H’08

2009+

Innovation Continues: • Leading Process

Technology• Integration

Page 5: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

66

What Games Play on Intel® Integrated

Graphics?

Page 6: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

77

Intel® Graphics ArchitectureMemoryCommandsInternal buses

Intel Integrated Graphics is architected to support Direct3D*10Delivers consistent features (no cap bits) and generalized unified SM

Executes vertex, geometry and pixel shaders on the array of execution unitsEUs are multi-threaded for covering latencySupport 128-bit execution per clock

Array of UnifiedExecution UnitsVF

VS

Clip

SetupRast /

Z

I$ Cache

TextureCache

Sampler

RenderCache

PixelOps

-Th

read

Dis

patc

h

VideoProcessing 2D DisplayCmd

StreamerMemory /Cache

GSRow0EU0 EU1 EUn

RowMEU0 EU1 EUn

Page 7: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

88

Thoughts Before We Begin

Generally speaking most of the guidelines, tips and recommendations to follow are similar to and applicable to most graphics devicesException is that these are typically even more important for integrated and volume graphics devicesMost of the discussion that follows is centered around Intel® Integrated Graphics

Page 8: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

99

Core/Memory Capabilities

Integrated Graphics will continue to utilize the UMA - sharing the memory bandwidth between all the agentsDynamically allocated video memory (DVMT) is dependent upon the total system memory in the platform as well as the O/S. However, we continue to increase the max limit

2007 2008Product Intel® G35 Intel® GM965 Eaglelake CantigaGfx Arch Gen 4.0 Gen 5.0Memory BW (GBps) 10.7 – 12.8 8.5 – 10.7 12.8 – 23.1 10.7 – 17.1UMA Capability 2x DDR3-667/800 2x DDR2-533/667 2x DDR3-

800/1066/13332x DDR3-

667/800/1067Max DVMT (XP) 1 or 2GB System Memory

384MB > 512MB

Max DVMT (Vista) 1GB / 2GB System Memory

256MB/384MB 256MB/>512MB

Page 9: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1010

Vertex/Primitive Processing CapabilitiesSupport for HWVP or SWVP is provided

HWVP is enabled for all titles by defaultSWVP may offer performance enhancements on Intel® Core™2 Duo/Core™2 Quad CPUsFor SWVP, VS/GS/Clip stages behave as pass-through

Reallocate the compute resources back for pixel processing for overall performance gainDriver will always export full HWVP support

SWVP maybe used based on configuration and workloadSWVP has optimizations beyond the current PSGP including support for evolving CPU instructions

Peak vertex throughput through the fixed function pipe is defined by the cull rateGen5 has ~2x the vertex processing throughput over Gen4 HWVP

Peak Early-Z reject rate is at 4 pixels/clk

Page 10: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1111

Tips on Vertex/Primitive ProcessingVertex Processing

Use DrawIndexedPrimitive() to maximize reuse of vertex cacheVertex Cache Size will increase over time Use VCACHE_QUERY() for sizeUse strips when possible over lists when possibleAvoid fans (deprecated on DX10)

Utilize visibility tests to reject objects that fall outside the view frustum to reduce the impact on clipping

Use D3DRS_CLIPPING == FALSE for objects that don’t need clippingEnsure adequate Batching to amortize runtime and driver overhead

Bigger is better >200-1K recommendedMinimize state changes between batches Reduces h/w pipeline flushes

Render with Z-only pass followed by a normal render pass and/or in a rough front-to-back order

Utilizes the higher performance of Early-Z to reject occluded fragments reducing computes and raster opsBalance this against cost of an additional pass and more render state changes or worse batching due to sortingAvoid usage of modified Z value (oDepth) in the pixel shader

Occlusion Query can be used to reduce overdraw for complex scenesOQ can be used to check the visibility of an object by rendering the bounding box – if it returns with zero the object does not need to be rendered

Page 11: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1212

Shader Capabilities

Gen5 significantly improves compute capability over Gen4Significant improvement of Transcendental instructions performanceIncreased support of latency coverage at higher frequencySupport of shaders with longer instruction lengths

2007 2008Product Intel® G35 Intel® GM965 Eaglelake CantigaGfx Arch Gen 4.0 Gen 5.0Shader Model Profile vs_3_0, ps_3_0; vs_4_0, ps_4_0, gs_4_0 vs_4_0, ps_4_0, gs_4_0Max # of Insts SM3.0 = 512; SM4.0 = UnlimitedMax # of Constants SM3.0 = 256; SM4.0 = 4Kx16Max # of Temp Registers

Temp storage per Shader execution instance is 4096 elements which can be used in any combination of registers/arrays, i.e., the total number of r# and x# declared must <= 4096

Precision 32bit floating pointVertex Texture/Instancing

SM3.0 / 4.0

Flow Control Static and Dynamic

Page 12: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1313

Tips on Shader CapabilitiesUtilize the higher Shader Models wherever possible

There is no performance gain using SM older than 2.0 (Vista* requirement)Use programmable shaders over FF as much as possibleUtilize shader based fog (SM3.0 deprecates fixed function fog)

Shaders are currently compiled at Draw time and ReusedMid-Scene Compile time is impacted by the length of the shader and number of unique state/shader combinations

Strike a balance between texture samples and complexity of the shaderGeneral trend is increased ALU to Sample ratio

>4:1 is recommended and provides better latency coverageLarger ratio maybe better for floating point textures, higher order filtering and 3D textures

Supports unlimited shader lengths via cache structure but limited registers cause spills and fills which are very expensive

There is a limited number of registers per Execution Unit per thread Impacts EU efficiency

Reduce use of macro/transcendental functions where possible especially for Gen4LOG, LIT, ARL, POW, EXP, etc. are particularly expensive

Mask alpha when not neededUse full precision for non-transcendental instructions

Page 13: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1414

Tips on Shader Capability (cont’d)

SM4.0 requires support of Vertex Texture (optional SM3.0)Unified Shader uses the same sampler for pixels and vertices

Sampler supports all the filtering typesInstancing is supported for SM3.0 and SM4.0

Enables better vertex throughput by minimizing vertex dataHW Instancing vs No Instancing (DIP)

0

10

20

30

40

50

60

70

1 200 400 600 1000 1500 2000 2500 3000

Batch Size

FPS HW Instancing

No Instancing

Page 14: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1515

Tips on Shader Capability (cont’d)

Smart Usage of flow controlGen4 implements both static and dynamic flow control and provides low penalty for early outsDynamic flow control can provide significant benefits by skipping a large number of computations

Ensure it is used where large portions of the shader can be skippedTo maximize performance pixel shader executes on 16 pixels in parallel

The benefits can be significant or small depending on the likelihood of the number of pixels taking the same branch

Usage of predication is preferred over using dynamic flow control especially for shorter branching instruction sequences

ConstantsTry to use under 32 for highest performanceLimit usage of indexed constants c[ax]

Incurs high latency in shaders (typically used mainly for Vertex shaders)

Page 15: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1616

Texture Sampler/Pixel Operations

Supports all sampler filtering types including dynamic anisotropic filteringGen5 significantly improves the 32 bpp Fixed trilinear filtering and 16bpp Float bilinear performance

2007 2008Product Intel

® G35

Intel® GM96

5

Eaglelake

Cantiga

Gfx Arch Gen4 Gen5Format Support 16/32-bit fixed point; 16/32-bit floating point

opsMax # of Samples Up to 16Vertex Textures YesMax 2D/3D/Cube Textures

8Kx8K/2Kx2K/8K

Filtering Type Support BLF, TLF and Dynamic AF w. up to 16 sub-samples

Texture Compression DX9: DXT1/3/5; DX10: BCxNon Power of 2 Textures YesRender to Texture Yes, Incl. Off-screen Surface SupportMulti-Sample Render Single Sample OnlyMulti-Target Render Max = 8Alpha-Blend FP formats Both FP16/FP32 formats are supported

2007 2008Product Inte

l G35

Intel GM965

Eaglelake

Cantiga

Gfx Arch Gen4 Gen54 Ch 32bit Fixed/ 2 Ch 16bit FloatPoint 1X 1XBilinear 1X 1XTrilinear 1X 2XAnisotropic 1X/n 1X/n4 Ch 16bit FloatPoint 0.5X 1XBilinear 0.5X 1XTrilinear 0.25X 0.5XAnisotropic 0.5X/n 0.5X/n4 Ch 32bit Float/ 1 Ch 32bit FloatPoint 0.25X 0.25X

Page 16: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1717

Tips on Texture Sampler/Pixel OpsAvoid 32-bit Float wherever possible

FP32 filtering is optional on DX9* and DX10, and is not supported. Check the capsKeep MRTs to <4

Can provide performance upside in some cases vs. 2 passesUse the same format if possibleKeep size to under 128x128 if possible

Minimize Number of Clear()sClear() Color and Z/Stencil buffer at the same time (when both are required)

Minimize Lock/Blit of Z and/or Stencil BufferTextures

Ideally provide this as a scalability optionUse Compressed Textures and mip-maps whenever possibleMinimize use of large textures even through GenX supports up to 8Kx8KFiltering type has generally shown a low impact to performance but should be used judiciously

Stencil Shadows are generally fill intensiveShadow Map is preferred for performance

Dynamic memory allocationAllocate surfaces in priority order – render surfaces used most frequently should be allocated firstMinimize Lock()D3DPOOL_DEFAULT for lockable memory

Dynamic Vertex/Index buffersD3DPOOL_MANAGED for non-lockable memory

Textures, Back buffer chain, Vertex/Index buffer

Page 17: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1818

Direct3D* 10

Intel® Integrated Graphics has been architected with DX10* in mind DX10 design goals were to make significant optimizations over DX9

Reduced runtime overhead, ensure feature consistency, optimize state management and add new features that increase GPU usage However, there are optimizations such as constant reorganization or using constants as immediates that we can do in DX9 that cannot be done in DX10

DX10 Driver statusDX10 drivers for Eaglelake and Cantiga will be available at launchDX10 drivers will also support Gen4 products such as Intel® G35 Express and Intel GM965 Express chipsets

Page 18: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

1919

Direct3D* 10

DX10 still has a few optional features>1 Sample MSAA: Gen4/5 support single sampled rendering32-bit FP Filtering: Gen4/5 DOES NOT support 32b FP Point FilteringRGB32 Rendertarget: Not supported16-bit UNORM Blending: Not supported in Gen4. Supported in Gen5 onwardsCheck format support using ID3D10Device::CheckFormatSupport

Constant ManagementGroup constant buffers based on frequency of updatesEnsure constant buffers are sized according to usage

Minimize Usage of Geometry Shader and StreamOutPerformance not yet characterized

Likely to be a limiter: should be avoidedDrawAuto() causes synchronization in current devices and should be avoided

Page 19: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

2020

Direct3D* 10Some of the games have heavier fx enabled for DX10 render path – these games recommend to fall back to DX9 for less capable devices

In general, DX10 can provide a higher performance path based on the inherent design goalsIntel’s focus for optimizations moving forward will be to concentrate on DX10Additionally there are features that can improve performance in DX10 path

For example in case of DX9, FP16 is used for hdr effects. However, with DX10 a R11G11B10 format is available which can be used as a destination format for hdr rendering – for integrated graphicsthis can be significant bandwidth savings

Prefer scaling to be API independent

Game Scaling DX8 DX9 DX10

High Detail

Standard Detail

Low Detail

Observation

Reco

mm

enda

tion

Page 20: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

2121

SummaryIntegrated graphics market will continue to grow especially for mobile platformsAggressive improvement planned for integrated graphics

Consistent Feature Set with DX10Intel will push more optimizations and improvements in DX10 driver vs. DX9

Summary of tips & tricksMost of the tips and tricks for integrated graphics devices are applicable to other graphics adaptersIn general, the need for optimization is even more important for integrated devices

Resources are available for developersPlans to improve visibility into performance of Intel® Integrated Graphics

Page 21: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

2222

Take Advantage of Graphics Developer Resources from IntelLatest Intel graphics drivers

http://www.intel.com/support/graphics/ Graphics Developer Community

http://www.intel.com/software/graphics Quick Reference Guide to Intel® Integrated Graphics

http://softwarecommunity.intel.com/articles/eng/1488.htm

Intel® GMA 3000 and X3000 Developer’s Guide

http://softwarecommunity.intel.com/articles/eng/1487.htm

Tools and Tips for Debugging Issues on Intel® 3000 and X3000 Series Integrated Graphics

http://softwarecommunity.intel.com/articles/eng/1489.htm

Page 22: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

2323

09:00am - Optimizing DirectX* Rendering on Multi-Core Hardware

10:30am - Gaming on the Go12:00pm - COLLADA in the Game02:30pm - Interactive Ray Tracing in Games04:00pm - Speed Up Synchronization Locks

Other Intel Sessions at GDCwww.intel.com/software/graphics

09:00am - The Future of Programming for Multi-Core with the Intel Compilers

10:30am - Getting the Most Out of Intel Graphics12:00pm - Comparative Analysis of Game

Parallelization02:30pm - Threading Quake 4* and Quake Wars*

Wednesday (February 20th)

Thursday (February 21st)

Page 23: 11 Getting the Most Out of Intel ® Graphics Ray Paik & Katen Shah February 21, 2008

2424