"taming the beast: performance and energy optimization across embedded feature detection and...

Copyright © 2014 Cadence Design Systems 1

Chris Rowen -Cadence Fellow

May 2014

Taming the Beast: Performance and Energy Optimization Across

Embedded Feature Detection and Tracking


• What’s the problem in imaging and vision?

• An extended example: feature/gesture recognition/tracking pipeline

• A quick look at features detectors

• A deep dive on connected component identification

• Mapping to an vision DSP—issues and opportunities

• Performance and energy optimization results

• Wrap-up

Agenda


IVPEP:

Platform for Imaging Applications Everywhere

Front-collision

warning

Automatic high beam

Traffic sign

detection /

recognition

Lane tracking

Gesture control

Face detection,

recognition and

tracking

High dynamic range (HDR)

image/video capture

Video pre-

processing for

improved encoding

Stabilization

Digital

zoom Low-light image

enhancement

Computer Vision

Auto (ADAS)

Still Image and Video Capture

Scene

analysis

Advanced Driver

Assistance Systems

Handsets, Tablets PCs, DSCs DTV, Tablets, PCs,

Consumer Gaming

Decode artifact

compensation

Scaling, frame

rate adjustment

Sharpening

Video Post-Processing DTV, Mobile

Digital effects photography

Pedestrian detection and

tracking

Display

adaptation

3D effects

Object detection,

tracking and

identification

Augmented

reality

registration


Imaging Computation Chain — Sensor to Display

Imaging Processor fit were performance and complexity collide:

• High performance required (implying significant power consumption or latency

sensitivity)

• Complex multi-YUVframe algorithms that require high non-local memory bandwidth

• New and complex algorithms are required in Bayer domain for ISP performance

• Product differentiation depends on rapidly evolving, sometimes proprietary algorithms or

performance

• Complete applications built by chaining a range of imaging/video functions

• The Future: Acceleration of demand from both RTL soft and CPU offload

Sensor

Single frame ISP Multi-

frame

ISP

3D

no

ise

re

du

ctio

n

Still

ima

ge

sta

bili

za

tio

n

Vid

eo

Sta

bili

za

tio

n

Hig

h D

yn

am

ic R

an

ge

(H

DR

)

Fa

ce

, b

link, sm

ile d

ete

ctio

n

Red

-eye

re

du

ctio

n

Skin

be

au

tificatio

n

Pa

no

ram

a s

titc

hin

g

JP

EG

co

mp

ressio

n

Vid

eo

co

mp

ressio

n

Tra

nsco

din

g

3D

ca

ptu

re

Smart

Photo

Video

Encode

Pix

el P

roce

ssin

g

Filt

eri

ng

(in

cl. D

eb

lockin

g)

Bitstr

ea

m P

roce

ssin

g

De-

code

Fra

me

Rate

Co

nve

rsio

n

De

-in

terl

acin

g

3D

Nois

e F

ilte

rin

g

Video

Post

Process

Dis

pla

y C

om

pe

nsa

tion

Ed

ge

/co

rne

r/b

lob

de

tectio

n

Fe

atu

re d

ete

ctio

n

Ob

ject tr

ackin

g

Ba

rco

de

/QR

co

de

de

tectio

n

Image/Video

Analysis

Mo

tio

n a

na

lysis

Ge

stu

re d

ete

ctio

n

Fa

ce

/ge

stu

re r

eco

gn

itio

n

Sce

ne

an

aly

sis

Au

gm

en

ted

re

alit

y

RG

B B

aye

r filte

rin

g

Defe

ct co

rre

ctio

n

Le

ns s

ha

din

g c

orr

ectio

n

Dem

osa

icin

g

Sh

arp

en

ing

Colo

r/G

am

ma

co

rre

ctio

n

Nois

e filt

eri

ng

Au

to e

xp

osu

re c

orr

ectio

n

White

ba

lan

ce

Colo

r sp

ace

co

nve

rsio

n

Bla

ck le

ve

l co

mp

en

sa

tio

n

Exte

nd

ed

Dep

th o

f F

ield

Sca

ling

Ima

ge

wa

rpin

g

Dis

pla

y

Sto

rage

Su

pe

r-re

so

lution

/dig

ita

l zo

om

Mu

lti-

sen

so

r arr

ay i

nte

gra

tio

n

3A control

(Histogram,

face, …)

Multi sensor processing

(HDR,Stereo depth)

Ga

in

Bayer domain ISP processing Intelligent image post-processing


Preprocess (Denoise, Contrast)

Region of Interest (Motion

detect, Skin tone, etc.)

Morphological operations

(Dilate, Erode)

Bounding box

(Connected components)

Hand/Pose detection

(Classifier)

Hand tracking

(Meanshift, Feature

detection+Optical flow)

Gesture recognition

Example Gesture Recognition Pipeline

Dense Processing Sparse processing


Input Image Background subtraction

Dilate/Erode Connected components

Bounding Box

Denoising

The Gesture Recognition Pipeline Visualized


Connected Component Labelling

Input Image

Output Image – each shade of gray

represents a

a different object and is assigned a

unique label (1, 2, 3, …)


• Each set of connected pixels are uniquely labeled

• Connectivity checks performed on 4 or 8 neighbor pixels

• Input usually binary (example all foreground pixels are

considered connected) or grayscale (similar pixels are

considered connected)

• Approaches to connected component labelling

• Two pass algorithm • First pass: Scan and propagate labels top-down and left-right,

give new labels to unconnected points and maintain label

mappings when two labels are found connected

• Second pass: Re-label using label mappings from fist pass

• Vector friendly, performance depends on number of labels

generated in first pass

• Single pass algorithm • Often based on contour tracing

• Not vector friendly, requires access to entire image

• A combination of the two approaches

Connected Components

8-connectivity

4-connectivity

Images from Wikipedia


Connected Components — Vector Processing

Checking connection and propagating

labels in down, right-down or left-down

direction is easy Current vector

Above vector

Left label

Checking connection and propagating

labels left-right direction potentially

requires sequential propagation. Ability to

terminate when labels have stopped

changing improves performance

Current vector

Many vectors are all background pixels -

ability to skip over entire vectors of

background pixels improves performance Current vector



Initial labels

1 2 2 2 5 5

Re-labelling is a lookup operation, but

the table size may be large depending on

how many initial labels were assigned.

Fast vector lookup operations can

improve performance

Label Map

1->1, 2->1,

5->2

1 1 1 1 2 2

Final labels



Test image from http://pets2012.net/

Using these vector

techniques performance of

highly optimized scalar

connected component code is

improved by 30X !


Preprocess (Denoise, Contrast)

Region of Interest (Motion

detect, Skin tone, etc.)

Morphological operations

(Dilate, Erode)

Bounding box (Connected

components)

Hand/Pose detection

(Classifier)

Hand tracking (Meanshift,

Feature detection+Opt

ical flow)

Gesture recognition

Example Gesture Recognition Pipeline


Feature response

scale-space creation to search for keypoints

Find points with Extrema

in scale-space

response

Feature localization

• Interpolation in scale/space

• Rejecting not interesting features

Feature descriptor creation

Feature Detection

Millions of Pixels

Millions of pixels

input, few 100s or

1000s of points

output

Few 100s or

1000s of points

scattered in

scale and space

Few 100s or

1000s of points

scattered in

image

Detection Descriptor


• SIFT (Scale Invariant Feature Transform)

• Uses difference of Gaussians to calculate feature response

• The filters used are symmetric, data is typically 8-bit

• SURF (Speeded Up Robust Features) - Fast Hessian

• Uses difference of boxes to calculate feature response

• Integral images speedup calculation of sum of pixels in a box by reducing

them to sums and differences

• Integral data is usually 32-bits

• FAST (Features from Accelerated Segment Tests)

• Uses pixel intensity comparisons to find “interesting” points

• Finds N contiguous pixels in a circle around the point of interest that are

either brighter or darker than the point of interest

• Operations involve comparisons and bit manipulations , intensities are

generally 8-bit

• A good architecture needs to support a range of data types and

accelerate a range of operations

Feature Detection — Three Popular Approaches


Mapping to an Vision DSP — Issues and

Opportunities

Instruction Fetch/Dispatch: variable length

Instruction Memory: configurable

Data Memory: Configurable

Xte

nsa c

ontr

ol pro

cessin

g (

VLIW

)

mDMA

…

Pix

el R

eg

file

Pre

dic

ate

s

Sh

ift/

Se

l

MUL

Shift

Multiple ALUs

Load/Store Units

Pix

el A

ccu

m

User-defined FUs

On-chip

network bridge

Sca

lar

exe

cutio

n p

ipelin

es

Sca

lar

exe

cutio

n p

ipelin

es

…

Pix

el

Reg

file

Pre

dic

ate

s

Shift/S

el

MUL

Shift

Multiple ALUs

Pix

el

Accum

User-defined FU

Pix

el

Reg

file

Pre

dic

ate

s

Shift/S

el

MUL

Shift

Multiple ALUs

Pix

el

Accum

User-defined FU

Pix

el

Reg

file

Pre

dic

ate

s

Shift/S

el

MUL

Shift

Multiple ALUs

Pix

el

Accum

User-defined FU

Cross-element select/reduction network

Memory Data Rotator

1. Exploit data locality:

• Compiler-automated

vectorization

• Tile manager runtime layer

hides integrated mDMA

programming

• Vector data types

• Extended native C operators

• State-of-the-art code

scheduling

2. Leverage libraries:

• New mappings only when

needed

• >700 OpenXV/OpenCV-based

functions

3. Use tools in tuning:

• Instant single/multi-core

simulation

• Multi-dimensional profiling

• Memory analysis

• User-defined ISA extension

IVP-EP subsystem organization


IVP-EP Performance: Up to 4x Boost Over

Previous Generation IVP Processor

*with instruction set option package

0

1

2

3

4

5

Sp

ee

d-u

p o

ve

r IV

P

* *


Connected Components

Performance and energy comparison

0

5

10

15

20

25

30

35

Frames Per Second Frames Per Watt(core)

Frames Per Watt(memory)

Frames Per Watt(total)

Re

lati

ve

Pe

rfo

rma

nc

e

RISC Core

IVP-EP


Lessons Learned

Expanded Code Tuning Checklist

Measure reference cycle

performance/quality.

Convert floating point types to lowest fixed-

point that meets image quality need

Identify any inherent loop recurrences and

move dependencies from inner loops to

outer loops.

Decompose deep table lookups into

computed function of shallow table lookups

Choose best native scalar/vector data-types

Compile with auto-vectorization

If necessary, add vector reorganization

operations to maximize vector usage

Use abstract DMA or “tile manager” library to

pre-load/post-store data in background

If MP, partition data and add API task/

communications library calls to

communications library to initiate tasks and

coordinate computation across processors

Use memory tools to validate stack/heap

usage

Measure and compare final

performance/energy

• More than enough interesting, hard vision

problems

• Algorithms are diverse in structure and

evolving

• Implementation approach must balance

Ease/agility of development vs.

speed/efficiency of result

• Object detection and tracking is a multi-

phase algorithm, with different techniques

for exploiting parallelism in each phase

• Tools matter

• Target architecture matters

• Libraries matter

• Superior frames per sec, mW and time-to-

solution are achievable


• Cadence Tensilica Imaging/Video Processor (IVP) • http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-processing

• Connected Component Labelling • http://en.wikipedia.org/wiki/Connected-component_labeling

• SIFT • Lowe, D.G. (2004), "Distinctive Image Features from Scale-Invariant Keypoints", International

Journal of Computer Vision 60 (2),

http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/lowe_ijcv2004.pdf

• http://en.wikipedia.org/wiki/Scale-invariant_feature_transform

• http://w3.inf.fu-berlin.de/lehre/SS09/CV/uebungen/uebung09/SIFT.pdf

• SURF • http://www.vision.ee.ethz.ch/~surf/papers.html

• FAST • Rosten, Edward; Tom Drummond (2005). "Fusing points and lines for high performance tracking".

IEEE International Conference on Computer Vision 2,

http://edwardrosten.com/work/rosten_2005_tracking.pdf

• http://en.wikipedia.org/wiki/Features_from_accelerated_segment_test#cite_note-2

Resources

© 2014 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and Xtensa are registered trademarks of Cadence

Design Systems, Inc. in the United States and other countries. All other trademarks are the property of their respective owners and are not affiliated with

Cadence.

http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-processing







http://en.wikipedia.org/wiki/Connected-component_labeling



http://en.wikipedia.org/wiki/Scale-invariant_feature_transform





http://w3.inf.fu-berlin.de/lehre/SS09/CV/uebungen/uebung09/SIFT.pdf



http://www.vision.ee.ethz.ch/~surf/papers.html

http://edwardrosten.com/work/rosten_2005_tracking.pdf

http://en.wikipedia.org/wiki/Features_from_accelerated_segment_test#cite_note-2




"taming the beast: performance and energy optimization across embedded feature detection and...

Technology

cadence design systems

high performance

opportunities performance

tracking display adaptation

imaging applications

multi sensor processing

complex algorithms

embedded feature detection