| © 2013 Aptina Imaging Corporation | Aptina Confidential 1
© 2013 Aptina Imaging Corporation. All rights reserved. Products are warranted only to meet Aptina’s production data sheet specifications. Information, products, and/or specifications are subject to change without notice. All information is provided on an “AS IS” basis without warranties of any kind. Dates are estimates only. Drawings not to scale. Aptina and the Aptina logo are trademarks of Aptina Imaging Corporation. All other trademarks are the property of their respective owners.
Imaging on Embedded GPUs
Investigating flexible imaging pipelines using embedded GPUs
Mikaël Bourges-Sévenier (msevenier at aptina dot com)
Director, High-Performance Imaging
December 19, 2013
Bay Area Multimedia
| © 2013 Aptina Imaging Corporation | Aptina Confidential 2
• Overview: the need for computational imaging
• What is imaging?
• Architecture of some embedded GPUs
• 8MP MobileHDR pipeline on ARM Mali T604
• Khronos Camera: a standard API for computational imaging
• Q&A
Agenda
| © 2013 Aptina Imaging Corporation | Aptina Confidential 3
Computational Imaging evolution Spatial
(Volumetric)
Gesture
AR
Face Detect
Face Track
Presence
Colorimetry
Brightness
Web Cam
Smart Camera
True Color, Brightness Compensation, Exposure control
User Identity Access Control
Augmented Information
3D Imaging
Interactive Services
| © 2013 Aptina Imaging Corporation | Aptina Confidential 4
• Requires significant computing over large data sets
Mobile Compute driving Imaging use cases
Augmented Reality
Face, Body and Gesture Tracking
Computational Photography
3D Scene/Object Reconstruction
Time
| © 2013 Aptina Imaging Corporation | Aptina Confidential 5
Increasing Use of Imaging Sensors D
iffe
rent
iati
on O
ppor
tuni
ty
Time
Photography Input = 2D Camera
Processors = ISP + CPU Product = Static Images
Computational Photography Input = MEMS + 2D Camera
Processors = ISP + CPU + GPU Product = Real Time Images
We are here
Perceptual Imaging Input = MEMS + Depth Camera
Processors = ISP + CPU + GPU + DSP Product = Real Time Extracted Information
Perceptual Imaging 1. Uses the full array of mobile sensors 2. to extract information in real-time 3. about the user and environment
4. to generate enhanced user interactions
| © 2013 Aptina Imaging Corporation | Aptina Confidential 6
Hardware Save Power e.g. Camera Sensor ISP • CPU
‣ Single processor or Neon SIMD - running fast
‣ Makes heavy use of general memory
‣ Non-optimal performance and power
• GPU
‣ Programmable and flexible
‣ Many way parallelism - run at lower frequency
‣ Efficient image caching close to processors
‣ BUT cycles frames in and out of memory
• Camera ISP (Image Signal Processor)
‣ Little or no programmability
‣ Data flows thru compact hardware pipe
‣ Scan-line-based - no global memory
‣ Best perf/watt
| © 2013 Aptina Imaging Corporation | Aptina Confidential 7
0
50
100
150
200
250
300
350
400
450
Sep-2011 Dec-2011 Apr-2012 Jul-2012 Oct-2012 Jan-2013 May-2013 Aug-2013 Nov-2013 Mar-2014 Jun-2014
Evolution of Embedded GPUs
GFLOPS
Trend
Adreno 320
Adreno 330
Mali T628
PowerVR 6
Tegra 5
PowerVR 5XT
Mali T604
40% more GFLOPS/quarter
Estimated at sustained peak performance. Likely to be much less in practice.
| © 2013 Aptina Imaging Corporation | Aptina Confidential 8
• Pre-processing: for non-standard Bayer pixels (e.g. iHDR)
• ISP: for fast demosaic, lens shading, denoising, 3A, statistics …
• Post-processing: for special reconstruction of colors (e.g. Clarity+)
• Processing requires control of metadata aligned with data
Computational Imaging pipeline
Pre-processing Image Signal Processor (ISP) Post-processing
CMOS sensor Color Filter Array
Lens
Bayer RGB YUV
App
Lens, sensor, aperture control
Metadata
3A stats
| © 2013 Aptina Imaging Corporation | Aptina Confidential 9
• DSP are similar to CPU
‣ Typically integer optimized (some have rudimentary floating point support)
‣ With signal processing intrinsics
• FPGA
‣ Can be tailored to a cross between CPU/DSP and GPU
Different Computing Devices
Latency-Optimized CPU
Fast serialProcessing
lots of big on-chip cachessophisticated control
Throughput-Optimized GPU
Scalable parallelProcessing
multithreading can hide latencysimpler control, cost amortized over ALUs via SIMD
a b
c
+ +
SISD(scalar ALU)
SIMD(vector ALU)
b1 b2 b3 b4a2a1 a4a3
c1 c2 c3 c4
OpenCL works on all devices but performance
isn’t guaranteed
| © 2013 Aptina Imaging Corporation | Aptina Confidential 10
• Stream-based (ISP)
‣ For low-memory devices
‣ Set of lines processed by kernels
‣ Delay: #lines a kernel needs
• Frame-based (GPU)
‣ For fast data-parallel devices
‣ Full image frame processed
‣ Delay: whole frame(s)
Stream-based vs. Frame-based
Kernelcontinuous streamof pixels
Q
Kernel
final image accumulates lines
Kernel Kernel KernelFrame Frame
Frame Frame
Completely different kernels
| © 2013 Aptina Imaging Corporation | Aptina Confidential 11
What is Imaging? Capture image from a camera sensor and process it to get a render-able image.
| © 2013 Aptina Imaging Corporation | Aptina Confidential 12
How Imaging Sensors work
http://www.photoaxe.com
Bayer GRBG pattern • 50% green • 25% red and blue
Bayer CFA is one type of pattern
| © 2013 Aptina Imaging Corporation | Aptina Confidential 13
Bayer Demosaicing • 50% More G than R, B since eye is more sensitive to luminance
than chrominance
• Convert pixel colors from Bayer space to Full RGB color
• Complex interpolation to avoid artifacts (e.g. on edges) RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
RGB
0 1
2 3
0 GRBG1 RGGB2 GBRG3 BGGR
| © 2013 Aptina Imaging Corporation | Aptina Confidential 14
OpenCL (memory system)
Desktop Embedded
Non-uniform memory • Data is physically copied between GPU and CPU memory
Uniform memory • __local memory may be in __global • Cheap data exchange between
CPU and GPU
| © 2013 Aptina Imaging Corporation | Aptina Confidential 15
A tour of some embedded GPUs ARM Mali T604, Qualcomm Adreno 330
| © 2013 Aptina Imaging Corporation | Aptina Confidential 16
ARM Mali T604, T628 • Found in Samsung Exynos 5 Dual (T604)/Octa
(T628) Application Processors ‣ Chromebook, Nexus 10, Samsung S4…
• 32nm process for T604, 28nm for T628
• T604 has 4 shader cores, T628 has 8 cores
• Tri-pipe architecture: each GPU core has 3 types of instruction pipelines
‣ 1x load/store
‣ 1x texture
‣ 2x ALU (T604) / 4x ALU (T628)
• 64-bit integers and IEEE 754 floating-point ALUs
| © 2013 Aptina Imaging Corporation | Aptina Confidential 17
29 868v00 CONFIDENTIAL
OpenCL and OpenGL ES The Vithar Architecture:
OpenGL ES OpenCL
Load/Store Pipeline
Arithmetic Pipeline
Arithmetic Pipeline
Texturing Pipeline
Thread Issue
Thread Completion
• 3 kinds of pipelines
‣ Arithmetic
‣ Load/Store
‣ Texture
• Barrel-threaded (like AMD/NVIDIA)
• No SIMT execution (unlike AMD/NVIDIA)
• SIMD (like AMD)
‣ Use vectors for best performance!
• 256 threads max (64 in practice)
OpenCL and OpenGL ES
| © 2013 Aptina Imaging Corporation | Aptina Confidential 18
• Automatic hardware load balancing
• Seamless concurrent execution
• Integrated seamless power manager
Midgard Job execution and Load-balancing
19
Job Execution and Load-balancing
| © 2013 Aptina Imaging Corporation | Aptina Confidential 19
Qualcomm MSM8974 • Process: 28nm
• CPU: 4x Krait 2.3 GHz,
‣ ARMv7A Neon instruction set
‣ Power and performance efficiencies over ARM
‣ 4KB+4KB L0, 16KB+16KB L1, 2MB L2 cache
‣ No 64b support
• GPU: Adreno 330 450 MHz
‣ 32x 32b scalar ALUs/pipeline, 8 pipelines, 129.6 GFLOPS
• 16b kernels provide 2x performance
‣ 128b registers
‣ 8 KB local memory per shader core
‣ 8 KB constant memory
‣ 12 reads, 4 writes simultaneous per clock
‣ 512 work-items max
‣ 1.5 MB on-chip SRAM
‣ Tiled renderer max 3.6 GPix/s
• Hexagon DSP
‣ 3x core, 600 MHz, 16 KB L1, 256 KB L2, integrated MMU
‣ Limited floating-point support (no division, no log/exp…)
• RAM: 2GB 2x LP-DDR3 800 MHz (12.8 GB/s)
Qualcomm Confidential and Proprietary | MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION PAGE 23 80-NA157-29 A Aug 2012
MSM8974 Adreno 330 vs Adreno 320
� Adreno 330 has better performance � 450 MHz GPU clock (up from 400 MHz in Adreno 320) � 2x better shader performance than A320 – 2x more ALU blocks
� Dedicated GPU power rail � Will allow GPU to be at a lower frequency and voltage than the FABRIC � Gives many more options to the GPU-DCVS algorithm for scaling voltage and
frequency, leading to lower power consumption � MSM8974 has significantly increased memory bandwidth – Dual-channel
800 MHz DDR3 support � A330 is software compatible with A320
Adreno 330 Shader Processor “SP” Block
Total of 32 (32-bit) scalar ALUs
mseve
nier-ap
tina.c
om
98.24
8.48.4
8
2013
.10.19
at 21
:47:19
PDT
16-bit ALUs used if all kernel is 16-bit,
otherwise 32b ALU is used
| © 2013 Aptina Imaging Corporation | Aptina Confidential 20
MobileHDR pipeline
| © 2013 Aptina Imaging Corporation | Aptina Confidential 21
Arndale Samsung Exynos 5 Dual board • Arndale Samsung Exynos 5 board
‣ CPU: ARM Corte-A15 (2-core) 1.7 GHz 32nm
• 32KB L1 cache, 1MB L2 cache
‣ GPU: ARM MALI T604
• 64 concurrent threads
• Vector ALUs
• 128b registers
• OpenCL 1.1 Full Profile
‣ RAM: 2GB LP-DDR3 800 MHz (12.8 GB/s)
‣ Truly unified cached memory
• CPU and GPU memory is shared – NO COPY!
• 128b wide L1 and L2 access
| © 2013 Aptina Imaging Corporation | Aptina Confidential 22
ARM Mali T604 GPUs In Samsung Exynos 5 Dual
Type Vector GPU Process 32nm
OpenCL 1.1 Full Profile Unified memory Yes
Rendering Tile Work-items 256
Clock 533MHz L2 cache 1MB
Register width 128b Global memory 2GB LP-DDR3 800Mhz (12.8 GB/s)
ALUs 8 (2 ALUs/core) Throughput 100 GFLOPS
Local memory 32KB/core (global)
Constant memory 64KB
Texture cache yes
Compute devices (shader cores)
4
Cacheline 64 bytes
16/32/64b floats No/yes/yes
| © 2013 Aptina Imaging Corporation | Aptina Confidential 23
Avoid buffer copy
• Mali/Adreno have unified memory ‣ Use CL_MEM_ALLOC_PTR to avoid copy between CPU and GPU
• Mali has no local memory
• Adreno has local memory (1.5MB SRAM 115GB/s)
Host data pointers
Global Memory
Buffer created by malloc()
CPU(Host)
GPU(Compute Device)
Buffers created by user (malloc) are notmapped into the GPU memory space
Global Memory
Buffer created by malloc()
CPU(Host)
Buffer created by clCreateBuffer()
GPU(Compute Device)
COPY clCreateBuffer(CL_MEM_USE_HOST_PTR)creates a new buffer and copies the data over(but the copy operations are expensive)
Host data pointers
Global Memory
Buffer created by malloc()
CPU(Host)
GPU(Compute Device)
Buffers created by user (malloc) are notmapped into the GPU memory space
Global Memory
Buffer created by malloc()
CPU(Host)
Buffer created by clCreateBuffer()
GPU(Compute Device)
COPY clCreateBuffer(CL_MEM_USE_HOST_PTR)creates a new buffer and copies the data over(but the copy operations are expensive)
Host data pointers
Global Memory
CPU(Host)
Buffer created by clCreateBuffer()
GPU(Compute Device)
clCreateBuffer(CL_MEM_ALLOC_HOST_PTR)creates a buffer visible by both GPU and CPU
� Where possible don’t use CL_MEM_USE_HOST_PTR– Create buffers at the start of your application– Use CL_MEM_ALLOC_HOST_PTR instead of malloc() – Then you can use the buffer on both CPU host and GPU
clCreateBuffer(CL_MEM_USE_HOST_PTR) clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) malloc()
| © 2013 Aptina Imaging Corporation | Aptina Confidential 24
Aptina Sensor with MobileHDR™ Turned off
| © 2013 Aptina Imaging Corporation | Aptina Confidential 25
Aptina Sensor with MobileHDR™ Turned on
| © 2013 Aptina Imaging Corporation | Aptina Confidential 26
AR0833 8MP Camera sensor • Frame is inscribed in a 1/3.2” circle
‣ 4:3 for images e.g. 8MP 3264 x 2448
‣ 16:9 for video e.g. 6MP 3264 x 1836
• 10-bit per pixel (framed in 16 bits)
• At 30fps, we need 343 MB/s for 180 MPix/s
• Interlaced HDR feature
• Interface with ISP
‣ Data over MIPI CSI-2 (serial)
‣ Control over I2C
4:3
2448
3264
16:9
1836
3264
1/3.2" image circle
| © 2013 Aptina Imaging Corporation | Aptina Confidential 27
Feature: Interlaced HDR
• 1 frame contains 2 exposures interlaced
• Ratio between odd and even pairs
‣ User controlled: 1x, 2x, 4x, 8x
Aptina reserves the right to change products or specifications without notice.AR0833_DS - Rev. F Pub. 4/13 EN 30 ©2011 Aptina Imaging Corporation. All rights reserved.
AR0833: 1/3.2-Inch 8Mp CMOS Digital Image SensorFeatures
Aptina Confidential and Proprietary Preliminary
Features
Interlaced HDR Readout
The sensor enables HDR by outputting frames where even and odd row pairs within a single frame are captured at different integration times. This output is then matched with an algorithm designed to reconstruct this output into an HDR still image or video.
The sensor HDR is controlled by two shutter pointers (Shutter pointer1, Shutter pointer2) that control the integration of the odd (Shutter pointer1) and even (Shutter pointer 2) row pairs.
Figure 16: HDR Integration Time
Tint 1
Tint 2Sample pointer
Shutter pointer 1
Shutter pointer 2
I-FRAME 1
I-FRAME 2
Output Frame from Sensor
EXPOSUREI-FRAME 1
EXPOSUREI-FRAME 2
OutputI-FRAME 1 and 2
Aptina reserves the right to change products or specifications without notice.AR0833_DS - Rev. F Pub. 4/13 EN 30 ©2011 Aptina Imaging Corporation. All rights reserved.
AR0833: 1/3.2-Inch 8Mp CMOS Digital Image SensorFeatures
Aptina Confidential and Proprietary Preliminary
Features
Interlaced HDR Readout
The sensor enables HDR by outputting frames where even and odd row pairs within a single frame are captured at different integration times. This output is then matched with an algorithm designed to reconstruct this output into an HDR still image or video.
The sensor HDR is controlled by two shutter pointers (Shutter pointer1, Shutter pointer2) that control the integration of the odd (Shutter pointer1) and even (Shutter pointer 2) row pairs.
Figure 16: HDR Integration Time
Tint 1
Tint 2Sample pointer
Shutter pointer 1
Shutter pointer 2
I-FRAME 1
I-FRAME 2
Output Frame from Sensor
EXPOSUREI-FRAME 1
EXPOSUREI-FRAME 2
OutputI-FRAME 1 and 2
Aptina reserves the right to change products or specifications without notice.AR0833_DS - Rev. F Pub. 4/13 EN 30 ©2011 Aptina Imaging Corporation. All rights reserved.
AR0833: 1/3.2-Inch 8Mp CMOS Digital Image SensorFeatures
Aptina Confidential and Proprietary Preliminary
Features
Interlaced HDR Readout
The sensor enables HDR by outputting frames where even and odd row pairs within a single frame are captured at different integration times. This output is then matched with an algorithm designed to reconstruct this output into an HDR still image or video.
The sensor HDR is controlled by two shutter pointers (Shutter pointer1, Shutter pointer2) that control the integration of the odd (Shutter pointer1) and even (Shutter pointer 2) row pairs.
Figure 16: HDR Integration Time
Tint 1
Tint 2Sample pointer
Shutter pointer 1
Shutter pointer 2
I-FRAME 1
I-FRAME 2
Output Frame from Sensor
EXPOSUREI-FRAME 1
EXPOSUREI-FRAME 2
OutputI-FRAME 1 and 2
Exposure 1
Exposure 2
| © 2013 Aptina Imaging Corporation | Aptina Confidential 28
mobileHDR demo
• Zero-copy between sensor/OpenCL and OpenCL/OpenGL
• On Arndale board (Samsung Exynos 5 Dual with Mali T604 GPU)
Noise Reduction
iHDR Reconstruction Bayer scaler
Tone Mapping Color Correction
10b iHDR3264x1836 14b
RGB888
EGLImage
CL Image
1080p
OpenCL
GL Texture
OpenGL ES
| © 2013 Aptina Imaging Corporation | Aptina Confidential 29
Summary • Embedded GPUs are ideal candidates for computational imaging
‣ Performance at reasonable image size is now available
‣ Power efficiency is being addressed
• OpenCL 1.1 is available on all recent application processors
‣ But may be reserved to OEM
‣ Performance portability isn’t guaranteed (but so it is true for any high-performance applications)
• Opening camera imaging processing “black box” is now feasible for incredible new applications
| © 2013 Aptina Imaging Corporation | Aptina Confidential 30
Khronos Camera A standard to control image acquisition and processing.
| © 2013 Aptina Imaging Corporation | Aptina Confidential 31
Typical Imaging Pipeline • Pre- and Post-processing can be done on CPU, GPU, DSP…
• ISP controls camera via 3A algorithms Auto Exposure (AE), Auto White Balance (AWB), Auto Focus (AF)
• ISP may be a separate chip or within Application Processor
Pre-processing Image Signal Processor (ISP) Post-processing
CMOS sensor Color Filter Array
Lens
Bayer RGB/YUV
App
Lens, sensor, aperture control 3A
Need for advanced camera control API: - to drive more flexible app camera control
- over more types of camera sensors - with tighter integration with the rest of the system
| © 2013 Aptina Imaging Corporation | Aptina Confidential 32
Advanced Camera Control Use Cases • High-dynamic range (HDR) and computational flash photography
‣ High-speed burst with individual frame control over exposure and flash
• Rolling shutter elimination
‣ High-precision intra-frame synchronization between camera and motion sensor
• HDR Panorama, photo-spheres
‣ Continuous frame capture with constant exposure and white balance
• Subject isolation and depth detection
• High-speed burst with individual frame control over focus
• Time-of-flight or structured light depth camera processing
‣ Aligned stacking of data from multiple sensors
• Augmented Reality
‣ 60Hz, low-latency capture with motion sensor synchronization
‣ Multiple Region of Interest (ROI) capture
‣ Multiple sensors for scene scaling
‣ Detailed feedback on camera operation per frame
| © 2013 Aptina Imaging Corporation | Aptina Confidential 33
Camera API Architecture (FCAM based) • No global state
‣ State travels with image requests
‣ Every stage in the pipeline may have different state
• -> allows fast, deterministic state changes
• Synchronize devices
‣ Lens, flash, sound capture, gyro…
‣ Devices can schedule Actions
• E.g. to be triggered on exposure change
• Enables device synchronization
| © 2013 Aptina Imaging Corporation | Aptina Confidential 34
Visual Sensor Revolution • Single sensor RGB cameras are just the start of the mobile visual revolution
‣ IR sensors – LEAP Motion, eye-trackers
• Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras
‣ Stereo pair can enable object scaling and enhanced depth extraction
‣ Plenoptic Field processing needs FFTs and ray-casting
• Hybrid visual sensing solutions
‣ Different sensors mixed for different distances and lighting conditions
• GPUs today – more dedicated ISPs tomorrow?
Dual Camera LG Electronics
Plenoptic Array Pelican imaging
Capri Structured Light 3D Camera PrimeSense
| © 2013 Aptina Imaging Corporation | Aptina Confidential 35
Khronos APIs for Augmented Reality
Advanced Camera Control and stream
generation
3D Rendering and Video Composition
On GPU
Audio Rendering
Application on CPUs, GPUs
and DSPs
Sensor Fusion
Vision Processing
MEMS Sensors
Camera Control API
EGLStream - stream data
between APIs
Precision timestamps on all sensor samples
AR needs not just advanced sensor processing, vision acceleration, computation and rendering - but also for all
these subsystems to work efficiently together
| © 2013 Aptina Imaging Corporation | Aptina Confidential 36
Khronos Camera API • Catalyze camera functionality not available on any current platform
‣ Open API that aligns with future platform directions for easy adoption
‣ E.g. could be used to implement future versions of Android Camera HAL
• Control multiple sensors with synch and alignment ‣ E.g. Stereo pairs, Plenoptic arrays, TOF or structured light depth cameras
• More detailed control per frame ‣ Format flexibility, Region of Interest (ROI) selection
• Global Timing & Synchronization
‣ E.g. Between cameras and MEMS sensors
• Application control over ISP processing (including 3A)
‣ Including multiple, re-entrant ISPs
• Flexible processing/streaming ‣ Multiple output streams and streaming rows (not just frames)
‣ RAW, Bayer and YUV Processing
| © 2013 Aptina Imaging Corporation | Aptina Confidential 37
Camera API Design Milestones and Philosophy • C-language API starting from proven designs
‣ e.g. FCAM, Android camera HAL V3
• Design alignment with widely used hardware standards ‣ e.g. MIPI CSI
• Focus on mobile, power-limited devices ‣ But do not preclude other use cases such as automotive, surveillance, DSLR…
• Minimize overlap and maximize interoperability with other Khronos APIs
‣ But other Khronos APIs are not required
• Provide support for vendor-specific extensions
Apr13 Jul13
Group charter approved
4Q13
Provisional specification
1Q14
First draft specification
2Q14
Sample implementation
and tests
3Q14
Specification ratification
| © 2013 Aptina Imaging Corporation | Aptina Confidential 38
Questions & Answers
Thank you!