"taming the beast: performance and energy optimization across embedded feature detection and...
TRANSCRIPT
Copyright © 2014 Cadence Design Systems 1
Chris Rowen -Cadence Fellow
May 2014
Taming the Beast: Performance and Energy Optimization Across
Embedded Feature Detection and Tracking
Copyright © 2014 Cadence Design Systems 2
• What’s the problem in imaging and vision?
• An extended example: feature/gesture recognition/tracking pipeline
• A quick look at features detectors
• A deep dive on connected component identification
• Mapping to an vision DSP—issues and opportunities
• Performance and energy optimization results
• Wrap-up
Agenda
Copyright © 2014 Cadence Design Systems 3
IVPEP:
Platform for Imaging Applications Everywhere
Front-collision
warning
Automatic high beam
Traffic sign
detection /
recognition
Lane tracking
Gesture control
Face detection,
recognition and
tracking
High dynamic range (HDR)
image/video capture
Video pre-
processing for
improved encoding
Stabilization
Digital
zoom Low-light image
enhancement
Computer Vision
Auto (ADAS)
Still Image and Video Capture
Scene
analysis
Advanced Driver
Assistance Systems
Handsets, Tablets PCs, DSCs DTV, Tablets, PCs,
Consumer Gaming
Decode artifact
compensation
Scaling, frame
rate adjustment
Sharpening
Video Post-Processing DTV, Mobile
Digital effects photography
Pedestrian detection and
tracking
Display
adaptation
3D effects
Object detection,
tracking and
identification
Augmented
reality
registration
Copyright © 2014 Cadence Design Systems 4
Imaging Computation Chain — Sensor to Display
Imaging Processor fit were performance and complexity collide:
• High performance required (implying significant power consumption or latency
sensitivity)
• Complex multi-YUVframe algorithms that require high non-local memory bandwidth
• New and complex algorithms are required in Bayer domain for ISP performance
• Product differentiation depends on rapidly evolving, sometimes proprietary algorithms or
performance
• Complete applications built by chaining a range of imaging/video functions
• The Future: Acceleration of demand from both RTL soft and CPU offload
Sensor
Single frame ISP Multi-
frame
ISP
3D
no
ise
re
du
ctio
n
Still
ima
ge
sta
bili
za
tio
n
Vid
eo
Sta
bili
za
tio
n
Hig
h D
yn
am
ic R
an
ge
(H
DR
)
Fa
ce
, b
link, sm
ile d
ete
ctio
n
Red
-eye
re
du
ctio
n
Skin
be
au
tificatio
n
Pa
no
ram
a s
titc
hin
g
JP
EG
co
mp
ressio
n
Vid
eo
co
mp
ressio
n
Tra
nsco
din
g
3D
ca
ptu
re
Smart
Photo
Video
Encode
Pix
el P
roce
ssin
g
Filt
eri
ng
(in
cl. D
eb
lockin
g)
Bitstr
ea
m P
roce
ssin
g
De-
code
Fra
me
Rate
Co
nve
rsio
n
De
-in
terl
acin
g
3D
Nois
e F
ilte
rin
g
Video
Post
Process
Dis
pla
y C
om
pe
nsa
tion
Ed
ge
/co
rne
r/b
lob
de
tectio
n
Fe
atu
re d
ete
ctio
n
Ob
ject tr
ackin
g
Ba
rco
de
/QR
co
de
de
tectio
n
Image/Video
Analysis
Mo
tio
n a
na
lysis
Ge
stu
re d
ete
ctio
n
Fa
ce
/ge
stu
re r
eco
gn
itio
n
Sce
ne
an
aly
sis
Au
gm
en
ted
re
alit
y
RG
B B
aye
r filte
rin
g
Defe
ct co
rre
ctio
n
Le
ns s
ha
din
g c
orr
ectio
n
Dem
osa
icin
g
Sh
arp
en
ing
Colo
r/G
am
ma
co
rre
ctio
n
Nois
e filt
eri
ng
Au
to e
xp
osu
re c
orr
ectio
n
White
ba
lan
ce
Colo
r sp
ace
co
nve
rsio
n
Bla
ck le
ve
l co
mp
en
sa
tio
n
Exte
nd
ed
Dep
th o
f F
ield
Sca
ling
Ima
ge
wa
rpin
g
Dis
pla
y
Sto
rage
Su
pe
r-re
so
lution
/dig
ita
l zo
om
Mu
lti-
sen
so
r arr
ay i
nte
gra
tio
n
3A control
(Histogram,
face, …)
Multi sensor processing
(HDR,Stereo depth)
Ga
in
Bayer domain ISP processing Intelligent image post-processing
Copyright © 2014 Cadence Design Systems 5
Preprocess (Denoise, Contrast)
Region of Interest (Motion
detect, Skin tone, etc.)
Morphological operations
(Dilate, Erode)
Bounding box
(Connected components)
Hand/Pose detection
(Classifier)
Hand tracking
(Meanshift, Feature
detection+Optical flow)
Gesture recognition
Example Gesture Recognition Pipeline
Dense Processing Sparse processing
Copyright © 2014 Cadence Design Systems 6
Input Image Background subtraction
Dilate/Erode Connected components
Bounding Box
Denoising
The Gesture Recognition Pipeline Visualized
Copyright © 2014 Cadence Design Systems 7
Connected Component Labelling
Input Image
Output Image – each shade of gray
represents a
a different object and is assigned a
unique label (1, 2, 3, …)
Copyright © 2014 Cadence Design Systems 8
• Each set of connected pixels are uniquely labeled
• Connectivity checks performed on 4 or 8 neighbor pixels
• Input usually binary (example all foreground pixels are
considered connected) or grayscale (similar pixels are
considered connected)
• Approaches to connected component labelling
• Two pass algorithm • First pass: Scan and propagate labels top-down and left-right,
give new labels to unconnected points and maintain label
mappings when two labels are found connected
• Second pass: Re-label using label mappings from fist pass
• Vector friendly, performance depends on number of labels
generated in first pass
• Single pass algorithm • Often based on contour tracing
• Not vector friendly, requires access to entire image
• A combination of the two approaches
Connected Components
8-connectivity
4-connectivity
Images from Wikipedia
Copyright © 2014 Cadence Design Systems 9
Connected Components — Vector Processing
Checking connection and propagating
labels in down, right-down or left-down
direction is easy Current vector
Above vector
Left label
Checking connection and propagating
labels left-right direction potentially
requires sequential propagation. Ability to
terminate when labels have stopped
changing improves performance
Current vector
Many vectors are all background pixels -
ability to skip over entire vectors of
background pixels improves performance Current vector
Copyright © 2014 Cadence Design Systems 10
Connected Components — Vector Processing
Initial labels
1 2 2 2 5 5
Re-labelling is a lookup operation, but
the table size may be large depending on
how many initial labels were assigned.
Fast vector lookup operations can
improve performance
Label Map
1->1, 2->1,
5->2
1 1 1 1 2 2
Final labels
Copyright © 2014 Cadence Design Systems 11
Connected Components — Vector Processing
Test image from http://pets2012.net/
Using these vector
techniques performance of
highly optimized scalar
connected component code is
improved by 30X !
Copyright © 2014 Cadence Design Systems 12
Preprocess (Denoise, Contrast)
Region of Interest (Motion
detect, Skin tone, etc.)
Morphological operations
(Dilate, Erode)
Bounding box (Connected
components)
Hand/Pose detection
(Classifier)
Hand tracking (Meanshift,
Feature detection+Opt
ical flow)
Gesture recognition
Example Gesture Recognition Pipeline
Copyright © 2014 Cadence Design Systems 13
Feature response
scale-space creation to search for keypoints
Find points with Extrema
in scale-space
response
Feature localization
• Interpolation in scale/space
• Rejecting not interesting features
Feature descriptor creation
Feature Detection
Millions of Pixels
Millions of pixels
input, few 100s or
1000s of points
output
Few 100s or
1000s of points
scattered in
scale and space
Few 100s or
1000s of points
scattered in
image
Detection Descriptor
Copyright © 2014 Cadence Design Systems 14
• SIFT (Scale Invariant Feature Transform)
• Uses difference of Gaussians to calculate feature response
• The filters used are symmetric, data is typically 8-bit
• SURF (Speeded Up Robust Features) - Fast Hessian
• Uses difference of boxes to calculate feature response
• Integral images speedup calculation of sum of pixels in a box by reducing
them to sums and differences
• Integral data is usually 32-bits
• FAST (Features from Accelerated Segment Tests)
• Uses pixel intensity comparisons to find “interesting” points
• Finds N contiguous pixels in a circle around the point of interest that are
either brighter or darker than the point of interest
• Operations involve comparisons and bit manipulations , intensities are
generally 8-bit
• A good architecture needs to support a range of data types and
accelerate a range of operations
Feature Detection — Three Popular Approaches
Copyright © 2014 Cadence Design Systems 15
Mapping to an Vision DSP — Issues and
Opportunities
Instruction Fetch/Dispatch: variable length
Instruction Memory: configurable
Data Memory: Configurable
Xte
nsa c
ontr
ol pro
cessin
g (
VLIW
)
mDMA
…
Pix
el R
eg
file
Pre
dic
ate
s
Sh
ift/
Se
l
MUL
Shift
Multiple ALUs
Load/Store Units
Pix
el A
ccu
m
User-defined FUs
On-chip
network bridge
Sca
lar
exe
cutio
n p
ipelin
es
Sca
lar
exe
cutio
n p
ipelin
es
…
Pix
el
Reg
file
Pre
dic
ate
s
Shift/S
el
MUL
Shift
Multiple ALUs
Pix
el
Accum
User-defined FU
Pix
el
Reg
file
Pre
dic
ate
s
Shift/S
el
MUL
Shift
Multiple ALUs
Pix
el
Accum
User-defined FU
Pix
el
Reg
file
Pre
dic
ate
s
Shift/S
el
MUL
Shift
Multiple ALUs
Pix
el
Accum
User-defined FU
Cross-element select/reduction network
Memory Data Rotator
1. Exploit data locality:
• Compiler-automated
vectorization
• Tile manager runtime layer
hides integrated mDMA
programming
• Vector data types
• Extended native C operators
• State-of-the-art code
scheduling
2. Leverage libraries:
• New mappings only when
needed
• >700 OpenXV/OpenCV-based
functions
3. Use tools in tuning:
• Instant single/multi-core
simulation
• Multi-dimensional profiling
• Memory analysis
• User-defined ISA extension
IVP-EP subsystem organization
Copyright © 2014 Cadence Design Systems 16
IVP-EP Performance: Up to 4x Boost Over
Previous Generation IVP Processor
*with instruction set option package
0
1
2
3
4
5
Sp
ee
d-u
p o
ve
r IV
P
* *
Copyright © 2014 Cadence Design Systems 17
Connected Components
Performance and energy comparison
0
5
10
15
20
25
30
35
Frames Per Second Frames Per Watt(core)
Frames Per Watt(memory)
Frames Per Watt(total)
Re
lati
ve
Pe
rfo
rma
nc
e
RISC Core
IVP-EP
Copyright © 2014 Cadence Design Systems 18
Lessons Learned
Expanded Code Tuning Checklist
Measure reference cycle
performance/quality.
Convert floating point types to lowest fixed-
point that meets image quality need
Identify any inherent loop recurrences and
move dependencies from inner loops to
outer loops.
Decompose deep table lookups into
computed function of shallow table lookups
Choose best native scalar/vector data-types
Compile with auto-vectorization
If necessary, add vector reorganization
operations to maximize vector usage
Use abstract DMA or “tile manager” library to
pre-load/post-store data in background
If MP, partition data and add API task/
communications library calls to
communications library to initiate tasks and
coordinate computation across processors
Use memory tools to validate stack/heap
usage
Measure and compare final
performance/energy
• More than enough interesting, hard vision
problems
• Algorithms are diverse in structure and
evolving
• Implementation approach must balance
Ease/agility of development vs.
speed/efficiency of result
• Object detection and tracking is a multi-
phase algorithm, with different techniques
for exploiting parallelism in each phase
• Tools matter
• Target architecture matters
• Libraries matter
• Superior frames per sec, mW and time-to-
solution are achievable
Copyright © 2014 Cadence Design Systems 19
• Cadence Tensilica Imaging/Video Processor (IVP) • http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-processing
• Connected Component Labelling • http://en.wikipedia.org/wiki/Connected-component_labeling
• SIFT • Lowe, D.G. (2004), "Distinctive Image Features from Scale-Invariant Keypoints", International
Journal of Computer Vision 60 (2),
http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/lowe_ijcv2004.pdf
• http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
• http://w3.inf.fu-berlin.de/lehre/SS09/CV/uebungen/uebung09/SIFT.pdf
• SURF • http://www.vision.ee.ethz.ch/~surf/papers.html
• FAST • Rosten, Edward; Tom Drummond (2005). "Fusing points and lines for high performance tracking".
IEEE International Conference on Computer Vision 2,
http://edwardrosten.com/work/rosten_2005_tracking.pdf
• http://en.wikipedia.org/wiki/Features_from_accelerated_segment_test#cite_note-2
Resources
© 2014 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and Xtensa are registered trademarks of Cadence
Design Systems, Inc. in the United States and other countries. All other trademarks are the property of their respective owners and are not affiliated with
Cadence.