Game Connection, October 29-31, 2014
Next-gen Mobile GPUs and Rendering Techniques
Niklas SmedbergSenior Engine Programmer, Epic Games
Game Connection, October 29-31, 2014
Introduction
• Niklas Smedberg, a.k.a. “Smedis”– 17 years in the game industry– Graphics programmer since the C64 demo scene
Game Connection, October 29-31, 2014
Content
• Next-gen Hardware– Tile-based GPU vs direct-rendering GPU
• Next-gen Rendering Techniques– Unreal Engine 4– Previous goal: Bringing AAA Console graphics to mobile
(DONE!)– New goal: Bringing AAA PC graphics to mobile (DONE!??)
Game Connection, October 29-31, 2014
Next Generation Mobile Hardware
• Big leap in features and performance
• Full-featured (e.g. OpenGL ES 3.1+AEP, DirectX 10/11)
• Peak performance now comparable to consoles (Xbox 360/PS3)– About 300+ GFLOPS and 26 GB/s
• New goal: Bring AAA PC graphics to mobile
Game Connection, October 29-31, 2014
Performance Trends (FP16 GFLOPS)
2010 2011 2012 2013 20140
50
100
150
200
250
300
350
6.4 12.825.5
154
300+ 2010 SGX 5352011 SGX 543MP22012 SGX 543MP32013 G64302014 Adreno, K1, GX6650
Game Connection, October 29-31, 2014
iOS 8 Metal
• New rendering API for iOS• Better match for the hardware
– Like a “game console API”– Much more efficient on the CPU– Exposes hardware features
• 20x faster on our rendering thread– OpenGL ES API: 30% of renderthread– Metal API: 1.6% of renderthread
• All power and perf responsibility is now in the hands of the developer!
Game Connection, October 29-31, 2014
Tech Demo: Zen Garden
Game Connection, October 29-31, 2014
Tile-based Mobile GPU
• Mobile GPUs are usually tile-based (next-gen too)– Tile-based: ImgTec, Qualcomm*, ARM– Direct: NVIDIA, Intel, Vivante
* Qualcomm Adreno can render either tile-based or direct to frame buffer– Extension: GL_QCOM_binning_control
Game Connection, October 29-31, 2014
Tile-Based Mobile GPU
Summary:• Split the screen into tiles
– E.g. 32x32 pixels (ImgTec) or 300x300 (Qualcomm)
• The whole tile fits within GPU, on chip
• Process all drawcalls for one tile– Write out final tile results to RAM
• Repeat for each tile to fill the image in RAM
Game Connection, October 29-31, 2014
ImgTec Tile-based Rendering Process
Game Vertex Processing
Tile Data (RAM)
Pixel Processing (Top-most only)
Frame Buffer (RAM)
Cmd Buffer (RAM)
Hidden Surface Removal
Tile Memor
y
Per Tile:
Game Connection, October 29-31, 2014
ImgTec Series 6 G6430
• One GPU core• Four shader units (USC)
– 16-way scalar
• FP16: Two SOP per clock– (a*b + c*d)
• FP32: Two MADD per clock– (a*b + c)
• 154 GFLOPS @ 400 MHz– 16-bit floating point
Game Connection, October 29-31, 2014
FP16 Is Faster Than FP32
• ImgTec Series 6: 50% faster– FP16 pipeline: Two SOP per clock– FP32 pipeline: Two MADD per clock
• ImgTec Series 6XT: 100% faster– FP16 pipeline: Four MADD per clock– FP32 pipeline: Two MADD per clock
• Qualcomm Snapdragon: 100% faster
Game Connection, October 29-31, 2014
ImgTec Rendering Tips
• Hidden Surface Removal– For opaque only– Don’t keep alpha-test enabled all the time– Don’t keep “discard” keyword in shader source, even if it’s not
used
• Group opaque drawcalls together
• Sort on state, not distance
Game Connection, October 29-31, 2014
Framebuffer Resolve/Restore
• Expensive to switch Frame Buffer Object on Tile-based GPUs– Saves the current FBO to RAM– Reloads the new FBO from RAM
• Best performance:– A single rendertarget for the entire frame– No post-processing passes
• Does not apply to NVIDIA Tegra GPUs!– This made it simpler for us to make our “Rivalry” tech demo for K1
Game Connection, October 29-31, 2014
Framebuffer Resolve/Restore
• Clear ALL FBO attachments after new frame/rendertarget– Clear after eglSwapBuffers / glBindFramebuffer– Avoids reloading FBO from RAM– NOTE: Do NOT clear unnecessary on non-tile-based GPUs (e.g.
NVIDIA)
• Discard unused attachments before new frame/rendertarget– Discard before eglSwapBuffers / glBindFramebuffer– Avoids saving unused FBO attachments to RAM– glDiscardFramebufferEXT / glInvalidateFramebuffer
Game Connection, October 29-31, 2014
iOS Performance Profiling
• Screenshot from Xcode, which shows:– How we clear FBO at the beginning of every render pass– Other important performance info
Game Connection, October 29-31, 2014
Programmable Blending
• GL_EXT_shader_framebuffer_fetch (gl_LastFragData)• Reads current pixel background “for free”• Potential uses:
– Custom color blending– Blend by background depth value (depth in alpha)
• E.g. Soft intersection against world geometry for particles– Tone-mapping within tile-memory– Deferred shading without resolving G-buffer
• Stay on GPU and avoid expensive round-trip to RAM• See also: GL_EXT_shader_pixel_local_storage
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Soul: Programmable Blending
Game Connection, October 29-31, 2014
Core Optimization: Opaque Draw Ordering
• All platforms,1. Group draws by material (shader) to reduce state changes
• Then for all platforms except ImgTec,2. Skybox last: 5 ms/frame savings (vs drawing skybox first)3. Sort groups nearest first : extra 3 ms/frame savings4. Sort inside groups nearest first : extra 7 ms/frame savings
• (Timings from a UE4 map on a Qualcomm-based device)
Game Connection, October 29-31, 2014
Optimization: Resolution
• Resolution does not match GPU performance!• Use custom resolution to match perf/pixel• Examples of same GPU:
– iPhone 5S = 0.7 Mpix (1136x640)
– iPad Air = 3.1 Mpix (2048x1536), more than 4 times slower!
• Other examples:
– Prev-Gen Console = 0.9 Mpix (1280x720)
– Current-Gen Console = 2.1 Mpix (1920x1080)
Game Connection, October 29-31, 2014
Single Content, Multiple Platforms
• Core motivating factor in designing UE4– Authoring consistency between PC and Mobile
• Authoring environment for both platforms– Physically-based shading model– High dynamic range linear color space– High quality post processing
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Comparison: PC
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Comparison: Mobile
Game Connection, October 29-31, 2014
Consistency: Material Editor
• One material for many platforms– Artist authored feature
levels to scale shader perf from PC to mobile
• Using cross compiler tool to retarget HLSL into GLSL
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Directional Light + SDF Shadows
Game Connection, October 29-31, 2014
Consistency: HDR Directional Lightmaps
• Two compressed + One uncompressed textures:1. HDR color with log luma encoding2. World space 1st order spherical harmonic luma directionality3. SDF shadow texture
• Optimized for Mobile and PVRTC compression:– PVRTC (ImgTec) lacks separately compressed encoding for alpha– PC color: RGB/Luma, LogLuma in Alpha– Mobile color: RGB/Luma * LogLuma, no Alpha (LogLuma derived
with ALU)– Mobile: A single dynamic directional light as the “sun”
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Image-Based Lighting
Game Connection, October 29-31, 2014
Image-Based Lighting
• Selects a mip-level in an IBL cubemap based on per-pixel roughness – Same as PC, filtering same as PC
• PC: FP16• Mobile: Decode(RGBM)^2• PC: Blend multiple cubemaps per surface and parallax
correction• Mobile: One infinite-distance cubemap per object
Game Connection, October 29-31, 2014
FP16 RGB=Color A=Depth
Mobile Post Processing Pipeline
Anti-Aliasing
Tonemapping
1/2 DOF Filter
1/2 DOFDownsample
1/4 Final Filter Passes and Merge {Light Shafts, Bloom, Vignette}
Bloom Filter Tree
1/4 Lightshaft Filter, Pass 2
1/4 Lightshaft Filter, Pass 1
1/4 DOF Near Dilation
1/4 Smart Reduction
Light Shaft (Sun) MaskA=Depth to A=CoC+Sun
[conversion done on chip if possible]
1/8
1/16
1/32
1/64
1/32
1/16
1/8
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Mobile Depth of Field
Game Connection, October 29-31, 2014
Packed Circle of Confusion and Sun Intensity
• Optimization done for Depth of Field and Light Shafts– Represent sun intensity and circle of confusion (CoC) in one FP16 value– Saves needing an extra render target
Depth (0 to 65504)
CoC (0 to 1) Sun Intensity (1 to 65504)
0=Max Near Bokeh
0.5=In Focus
1=Max Far Bokeh and No Sun
65504=Max Sun (still at Max Far Bokeh)
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Vignette + Bloom + Light Shafts
Game Connection, October 29-31, 2014
Vignette + Bloom + Light Shafts: Optimized
• Composited at quarter resolution– Using pre-multiplied alpha FP16– Also performs the final filtering passes for bloom and light shafts
• Optimized to minimize rendertarget switches
• 3 effects applied to scene with one full-res TEX fetch– Applied in the tonemap pass
Game Connection, October 29-31, 2014
Mobile Light Shafts
• Filtered at quarter resolution• Monochromatic with artist controlled tint on final composite
– Bloom and light shaft down-sample pass shared– RGB = color for bloom, A = light shaft intensity
• Light shaft filter runs in tonemapped space (8-bit/channel)– Applied in linear (reverse tonemap before tint and composite)
• Filtering done by 3 passes of 8 taps
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Mobile Bloom
Game Connection, October 29-31, 2014
Mobile Bloom
• Lower quality version to minimize resolves– Limited effect radius and less passes
• Turns out to be faster and higher-quality– Standard hierarchical algorithm with some optimizations– Down-sample from 1:1/4 res first (shared with light shaft)– Then down-sample in 1:1/2 resolution passes– Single pass circle-based filter (instead of two-pass
Gaussian)• 15 taps on circle during down-sampling• 7 taps for both circles during up-sample+merge pass
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Film Post Tonemapping
Game Connection, October 29-31, 2014
Film Post
Film Post Tonemapping: Mobile Shader
RGBA8 LDR sRGB Output
Quantization Grain
(optional)
Linear to sRGB
Tint, Tonemap Curve
Color Matrix {Saturation,
Channel Mixer} (optional)
Shadow Tint (optional)
Film Grain(optional)
Blend in {Bloom, Vignette, Light Shafts}
(optional)
1/4 Pre-multiplied Alpha
Film Grain Jitter
(optional)
Blend in DOF
(optional) 1/2 DOF
RGBA16F HDR Linear Color
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013
Anti-Aliasing
Game Connection, October 29-31, 2014
Anti-Aliasing on Mobile
• Using a spatial & temporal anti-aliasing filter– 2x temporal super-sampling (higher quality surface shading)– Blending two jittered frames after tonemapping (32 bpp,
perceptual color space)– Some special logic to remove ghosting/judder (not using
motion re-projection)
• Not currently using MSAA– Needed super-sampled shading for HDR physically based shading
model
Pixel
Game Connection, October 29-31, 2014Game Developer Conference, March 25-29, 2013All Techniques Combined
Game Connection, October 29-31, 2014
Android Extension Pack
• Released with Android “L”• OpenGL ES 3.1 plus a specific set of extensions
– Compute shaders– Tessellation– ASTC– Detailed spec on www.khronos.org: GL_ANDROID_extension_pack_es31a
• Full UE4 desktop rendering pipeline on Android!– Deferred rendering with G-buffer– Physically-based shading– Image-based lighting– Solid 30 FPS on NVIDIA K1 mobile GPU
Game Connection, October 29-31, 2014
Tech Demo: Rivalry
Game Connection, October 29-31, 2014
Unreal Engine 4
Full source code available!
unrealengine.com
Includes all C++, shaders, tools, content$19/mo