© Copyright Khronos Group 2013 - Page 1
Accelerating Extraordinary User Experiences on Mobile Devices
Neil Trevett Khronos, President
NVIDIA, Vice President Mobile Content
© Copyright Khronos Group 2013 - Page 2
1990’s 2000’s 2010’s
A New Era in Computing
PC Internet Mobile
© Copyright Khronos Group 2013 - Page 3
Why is Mobile Parallelism Needed?
Courtesy Metaio http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related
State-of-the-art Augmented Reality without GPU Compute
© Copyright Khronos Group 2013 - Page 4
Augmented Reality with GPU Parallelism
High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria
Research today on CUDA equipped laptop PCs
How will this compute capability migrate from high-end PCs to mobile?
© Copyright Khronos Group 2013 - Page 5
Mobile SOC Performance Increases
1
100
CPU
/GPU
AG
GRE
GAT
E PE
RFO
RMA
NCE
2013 2015
Tegra 4 Quad A15
2014 2011
2012
Tegra 2 1st Dual A9 Tegra 2 Dual A9
Logan
10
Core 2 Duo
Parker
Core i5
HTC One X+
Google Nexus 7
100x perf increase in four years
Mobile Device Shipping Dates
Full Kepler GPU CUDA 5.0
OpenGL 4.3
64-bit Denver CPU Maxwell GPU
Tegra 3 Quad A9
© Copyright Khronos Group 2013 - Page 6
Mobile GPU Compute Today (almost) • Kayla = Tegra 3 + discrete Kepler GPU
- Tuned to match Logan GPU performance
• Full set of drivers on Linux - OpenGL ES 3.0 - OpenGL 4.3 - CUDA 5.0
© Copyright Khronos Group 2013 - Page 7
Power is the New Design Limit • The Process Fairy keeps bringing more transistors..
..but the ‘End of Voltage Scaling’ means power is much more of an issue than in the past
In the Good Old Days Leakage was not important, and voltage
scaled with feature size
L’ = L/2 D’ = 1/L2 = 4D f’ = 2f V’ = V/2 E’ = CV2 = E/8 P’ = P
Halve L and get 4x the transistors and 8x the capability for
the same power
The New Reality Leakage has limited threshold voltage,
largely ending voltage scaling
L’ = L/2 D’ = 1/L2 = 4D f’ = ~2f V’ = ~V E’ = CV2 = E/2 P’ = 4P
Halve L and get 4x the transistors and 8x the capability for
4x the power!!
© Copyright Khronos Group 2013 - Page 8
Mobile Thermal Design Point
2-4W 4-7W
6-10W 30-90W
4-5” Screen takes 250-500mW
7” Screen takes 1W
10” Screen takes 1-2W Resolution makes a difference -
the iPad3 screen takes up to 8W!
Typical max system power levels before thermal failure Even as battery technology improves - these thermal limits remain
© Copyright Khronos Group 2013 - Page 9
How to Save Power?
• Much more expensive to MOVE data than COMPUTE data
• Process improvements WIDEN the gap - 10nm process will increase ratio another 4X
• Energy efficiency must be key metric during silicon AND app design - Awareness of where data lives, where
computation happens, how is it scheduled
32-bit Integer Add 1pJ
32-bit Float Operation 7pJ
32-bit Register Write 0.5pJ
Send 32-bits 2mm 24pJ
Send 32-bits Off-chip 50pJ
For 40nm, 1V process
Write 32-bits to LP-DDR2 600pJ
© Copyright Khronos Group 2013 - Page 10
Mobile Parallel Compute Use Case Pipeline … • … largely driven by using camera as sensors
Augmented Reality
Face, Body and Gesture Tracking
Computational Photography and
Videography
3D Scene/Object Reconstruction
© Copyright Khronos Group 2013 - Page 11
Camera Sensor Processing • CPU
- Single processor or Neon SIMD - running fast - Makes heavy use of general memory - Non-optimal performance and power
• GPU - Programmable and flexible - Many way parallelism - run at lower frequency - Efficient image caching close to processors - BUT cycles frames in and out of memory
• Camera ISP (Image Signal Processor) - Little or no programmability - Data flows thru compact hardware pipe - Scan-line-based - no global memory - Best perf/watt
~760 math Ops ~42K vals = 670Kb
300MHz ~250Gops
© Copyright Khronos Group 2013 - Page 12
Dark Silicon • GPUs are much more power efficient than CPUs
- When exploiting data parallelism can x10 as efficient – but can go further…
• Lots of space for transistors on SOC – but can’t turn them all on at same time! - Would exceed Thermal Design Point
• Dark Silicon - specialized hardware – only turned on when needed - Dedicated units can increase locality and parallelism of computation
Power Efficiency
Computation Flexibility
Enabling new mobile experiences requires pushing computation onto GPUs and
dedicated hardware
Dedicated Hardware
GPU Compute
Multi-core CPU
X1
X10
X100
© Copyright Khronos Group 2013 - Page 13
Low Power Environment Scanning • Many sensor use cases would consume too much power to be running 24/7
- Environment aware use cases have to be very low power
• ‘Scanners’ - very low power, always on, detect things in the environment - Trigger the next level of processing capability
ARM 7 1 MIP and accelerometers can detect someone in the vicinity
DSP Low power activation of camera
to detect someone in field of view
GPU GPU acceleration for precision
gesture processing
© Copyright Khronos Group 2013 - Page 14
Advanced Camera Control Use Cases • High-dynamic range (HDR) and computational flash photography
- High-speed burst with individual frame control over exposure and flash
• Rolling shutter elimination - High-precision intra-frame synchronization between camera and motion sensor
• HDR Panorama, photo-spheres - Continuous frame capture with constant exposure and white balance
• Subject isolation and depth detection • High-speed burst with individual frame control over focus
• Time-of-flight or structured light depth camera processing - Aligned stacking of data from multiple sensors
• Augmented Reality - 60Hz, low-latency capture with motion sensor synchronization - Multiple Region of Interest (ROI) capture - Multiple sensors for scene scaling - Detailed feedback on camera operation per frame
© Copyright Khronos Group 2013 - Page 15
Camera Control API Complements Acceleration
Pre-ISP Processing (Bayer Space)
ISP
Post-ISP Processing (YUV Space)
Application
Need Camera API to feed processing pipeline for
advanced use cases
Image/Vision Acceleration APIs ?
Sensor Lens
Flash
New Camera Working Group Call for participation being publicly announced today!
http://www.khronos.org/camera
© Copyright Khronos Group 2013 - Page 16
Precursor APIs for Camera Control Initiative • FCAM – Open source project
- Capture of stream of camera images with precision control - A pipeline that converts requests into image stream - All parameters packed into the requests - no global state - Programmer has full control over sensor settings for each frame in stream
- Control over focus and flash - No hidden daemon running
- Control ISP - Can access supplemental
statistics from ISP if available
• Android New Camera HAL (2013) - Uses some of these concepts
© Copyright Khronos Group 2013 - Page 17
Camera Control API Usage
Burst control of sensor, flash, lens
with precision timestamps on frames
Application
Statistics Feedback
Output Scalers
and Formatters
Downstream Processing
ISP Image Processing
MEMS Sensors
Precision timestamps on positional sensor
samples
Sensor Lens
Flash EGLStream
and EGL access to sample timestamps
Camera Control API
© Copyright Khronos Group 2013 - Page 18
OpenVX • Vision Hardware Acceleration Layer
- Enables hardware vendors to implement accelerated imaging and vision algorithms
- For use by high-level libraries or apps
• Focus on enabling real-time vision - On mobile and embedded systems
• Diversity of efficient implementations - From programmable processors, through
GPUs to dedicated hardware pipelines
Open source sample implementation
Hardware vendor implementations
OpenCV open source library
Other higher-level CV libraries
Application
Dedicated hardware can help make vision processing performant and low-power enough
for pervasive ‘always-on’ use
© Copyright Khronos Group 2013 - Page 19
OpenVX and OpenCV are Complementary
Governance Open Source
Community Driven No formal specification
Formal specification and full conformance tests
Implemented by hardware vendors
Scope Very wide
1000s of functions of imaging and vision Multiple camera APIs/interfaces
Tight focus on hardware accelerated functions for mobile vision Use external camera API
Conformance No Conformance testing Every vendor implements different subset
Full conformance test suite / process Reliable acceleration platform
Use Case Rapid prototyping Production deployment
Efficiency Memory-based architecture
Each operation reads and writes memory Sub-optimal power / performance
Graph-based execution Optimized nodes and data transfer
Highly efficient
© Copyright Khronos Group 2013 - Page 20
OpenVX Power Efficiency • OpenVX Graph for power and performance efficiency
- Each Node can be implemented in software or accelerated hardware - Nodes may be fused by the implementation - Eliminates transferring the image to and from memory
• EGLStreams can provide data and event interop with other APIs - BUT use of other Khronos APIs are not mandated
• VXU Utility Library provides efficient access to single nodes - Open source implementation – easy way to start using OpenVX
OpenVX Node
OpenVX Node
OpenVX Node
OpenVX Node
Heterogeneous Processing
Native Camera Control
© Copyright Khronos Group 2013 - Page 21
Android Three Layer Ecosystem
API Drivers - Java (SDK) and Native (NDK)
Apps and Games Most use Java, Cutting-edge apps/games use native APIs
Middleware and Apps Engines Use native APIs for power and performance
ISVs
Most developers use Java SDK for easy development and portability BUT
Leading edge apps, games and middleware need power efficient and performance of native C
SOCs
© Copyright Khronos Group 2013 - Page 22
General Native GPU Compute
APIs for Android GPU Compute and Graphics
Graphics GPU Compute
Java
Native
RenderScript Run performance critical sections as native C. Automatically offload C code segments to the
GPU if possible
Java Binding to OpenGL ES (similar to JSR239)
GLSL shaders for GPGPU compute ? Full ANSI C programming of heterogeneous CPUs and GPUs
© Copyright Khronos Group 2013 - Page 23
OpenCL as Parallel Compute Foundation
C++ syntax/compiler
extensions
OpenCL HLM Aparapi Java language extensions for
parallelism
JavaScript binding to OpenCL for initiation of
OpenCL C kernels
WebCL River Trail Language extensions
to JavaScript
C++ AMP Shevlin Park Uses Clang and LLVM
PyOpenCL Python wrapper
around OpenCL
CUDA or DirectCompute may also be used as compiler targets – BUT OpenCL provides cross-platform, cross-vendor coverage
© Copyright Khronos Group 2013 - Page 24
Visual Sensor Revolution • Single sensor RGB cameras are just the start of the mobile visual revolution
- IR sensors – LEAP Motion, eye-trackers
• Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras - Even stereo pair can enable object scaling and enhanced depth extraction - Plenoptic Field processing needs FFTs and ray-casting
• Hybrid visual sensing solutions - Different sensors mixed for different distances and lighting conditions
• GPUs today – more dedicated ISPs tomorrow?
Stereo Camera LG Electronics
Plenoptic Array Pelican imaging
Capri Structured Light 3D Camera PrimeSense
© Copyright Khronos Group 2013 - Page 25
Paths to Low Power Parallelism • Dark Silicon – use silicon area to implement range of power/perf processor types • Localize and parallelize – port code from CPUs -> GPUs -> DSPs -> Hardware • Smart triggering of compute resources only when needed – sensor fusion • Graph-based processing APIs – node fusion to avoid memory round trips • Developer instrumentation for power!
- Dynamic and feedback-driven software power optimization - Instrumentation for energy-aware compilers and profilers - Most compilers just look at one thread, take a more global view - Power optimizing compiler back-end / installers
• Neil Trevett • [email protected]