wikitude arm workshop
TRANSCRIPT
Martin Lechner, CTO
Utilizing NEON for Accelerated Computer Vision
Processing in Augmented Reality Scenarios
Who is Wikitude?
2
Wikitude is the World leading Augmented Reality Ecosystem
● World Class team & technology
● A large and active developer community
● Leading developer and editorial tools for implementing AR applications
● High-profile monetization and distribution network
● Makers of the AR-Standard “ARML 2.0”
+45.000 Registered AR developers
+1.500 AR apps
+100 Countries
Wikitude’s Main Products
3
Wikitude SDK
Studio
Cloud Recognition
Targets API
Publishing App
Powered by World-Class AR Technology
4
+ AR Content
creation through
World-class in-house IP bundled well managed and proven product suite
5
● 2D Natural Feature Tracking
● Tracking in 6 Degrees of Freedom
● 3D scene and 3D object
recognition and tracking
● Fully integrated in existing Wikitude
SDK and product suite
● Focus on both Indoor and Outdoor
scenarios
● Improved robustness for
- Changing lighting conditions
- Moving objects
- Low textured environments
Wikitude Computer Vision
Wikitude Computer Vision
6
● Optimized for mobile computing
- Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)
- Vector Processing/SIMD (ARM NEON™)
- OpenGL ES (ARM Mali™)
- GPU Compute/OpenCL (ARM Mali)
Wikitude Computer Vision
7
● Optimized for mobile computing
- Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)
- Vector Processing/SIMD (ARM NEON™)
- OpenGL ES (ARM Mali™)
- GPU Compute/OpenCL (ARM Mali)
Why utilizing NEON in Image Processing?
8
• Development time well spent!
- Most state of the art mobile devices run on chips based
on ARMv7 or ARMv8 architectures
- Most of them include NEON instruction set
• Image Processing: a perfect match for SIMD
- Computationally expensive on CPU
- Can run in parallel
- Simple operations
- Multiple data sets (pixels or pixel ranges) must be
applied to the same operation
How to code for NEON
9
Intrinsics
• C library
- Contains vector data types and functions
(intrinsics)
• Code is converted to NEON code
• Easier to write and read
• Might result in not highly optimized
code
Assembler
• Assembler code as you would
expect it …
• A bit harder to maintain
• Full control over the optimizations
Why utilizing NEON?
10
Computer Vision process is a pipeline that contains
lots of functions to be SIMD-optimized
1. Recognition
Convert camera image to greyscale
Downsampling
Analyzing every pixel (range) in the image and
perform operations (e.g. Gradient Image)
2. Tracking
Calculate image similarities, e.g. Sum of Squared Differences
(SSD)
Matrix Operations (pose calculation)
Example: Calculate Patch Cross Correlation
11
a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]
… … … … … … … …
a[7] b[7] c[7] d[7] e[7] f[7] g[7] h[7]
a’[0] b’[0] c’[0] d’[0] e’[0] f’[0] g’[0] h’[0]
… … … … … … … …
a’[7] b’[7] c’[7] d’[7] e’[7] f’[7] g’[7] h’[7]
a[0]*a’[0] + … + h[0]*h’[0] +
+ … +
+ a[7]*a’[7] + … + h[7]*h’[7] sqrSum =
One Step: Calculate Squared Sum of Patches
Wrapper Logic
12
double calculateSqrSum (…){
int sqrSum;
#if defined(NEON_AVAILABLE)
if(!(size%8)){
// too complex with assembler
sqrSum = calculateSqrSum_neon_intrinsics(…);
else {
sqrSum = calculateSqrSum_neon_assembly(…);
}
#else
sqrSum = calculateSqrSum_impl(…);
#endif
return sqrSum;
}
C++ Implementation
13
int sqrSum = 0;
// width of images
int rowPtrBase1 = 0;
int rowPtrBase2 = 0;
// counter value
int rowPtr1 = 0;
int rowPtr2 = 0;
for (…){}
return sqrSum;
for(int rowIdx = 0; rowIdx < 8; rowIdx++ ){
rowPtr1 = rowPtrBase1;
rowPtr2 = rowPtrBase2;
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
sqrSum += img1[rowPtr1++] * img2[rowPtr2++];
rowPtrBase1 += strideWindow;
rowPtrBase2 += strideTemplate;
}
Intrinsics
14
uint8x8_t a_loaded;
uint8x8_t b_loaded;
uint16x8_t res_loaded;
uint32x4_t allSum = vdupq_n_u32(0);
for(int rowIdx = 0; rowIdx < size; rowIdx++){
for (uint32_t i = 0; i < size; i += 8) {
//load row into neon registers
a_loaded = vld1_u8(&(image1[rowIdx+i]));
b_loaded = vld1_u8(&(image2[rowIdx+i]));
//multiply values
res_loaded = vmull_u8(a_loaded, b_loaded);
//pairwise add and accumulate result
allSum = vpadalq_u16(allSum,res_loaded);
}
}
return vgetq_lane_u32(allSum,0) + vgetq_lane_u32(allSum,1) + vgetq_lane_u32(allSum,2) + vgetq_lane_u32(allSum,3) ;
1 Row of Pixels (8x8 bits)
Pair-wise multiplied Vector (8x16 bits)
Pair-wise added and accumulated Vector (4x32 bits)
Intrinsics - Algorithm
15
a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]
a[1] b[1] c[1] d[1] e[1] f[1] g[1] h[1]
… … … … … … … …
a’[0
]
b’[0
] c’[0]
d’[0
]
e’[0
] f’[0]
g’[0
]
h’[0
]
a’[1
]
b’[1
] c’[1]
d’[1
]
e’[1
] f’[1]
g’[1
]
h’[1
]
… … … … … … … …
a’’[0] =
a[0]*a’[0]
b’’[0] =
b[0] *b’[0]
c’’[0] =
c[0] *c’[0]
d’’[0] =
d[0] *d’[0]
e’’[0] =
e[0] *e’[0]
f’’[0] =
f[0] *f’[0]
g’’[0] =
g[0] *g’[0]
h’’[0] =
h[0] *h’[0]
a’’[1] =
a[1]*a’[1]
b’’[1] =
b[1] *b’[1]
c’’[1] =
c[1] *c’[1]
d’’[1] =
d[1] *d’[1]
e’’[1] =
e[1] *e’[1]
f’’[1] =
f[1] *f’[1]
g’’[1] =
g[1] *g’[1]
h’’[1] =
h[1] *h’[1]
… … … … … … … …
a’’’ = 0 + a’’[0] + b’’[0] b’’’ = 0 + c’’[0] + d’’[0] c’’’ = 0 + e’’[0] + f’’[0] d’’’ = 0 + g’’[0] + h’’[0]
a’’’ = a’’’ + a’’[7] + b’’[7] b’’’ = b’’’ + c’’[7] + d’’[7] c’’’ = c’’’ + e’’[7] + f’’[7] d’’’ = d’’’ + g’’[7] + h’’[7]
…
sqrSum = a’’’ + b’’’ + c’’’ + d’’’
Assembly
16
#ifdef __aarch64__
#ifdef __APPLE__
#define IMAGE_LINE_0 v16
#else
#define IMAGE_LINE_0 V16.8B
#endif
#else
#define IMAGE_LINE_0 d16
#endif
LOAD_LINE IMAGE_LINE_0, 0
LOAD_LINE IMAGE_LINE_1, 1
CALC_LINE IMAGE_LINE_0, PATCH_LINE_0, 0
CALC_LINE IMAGE_LINE_1, PATCH_LINE_1, 1
LOAD_LINE IMAGE_LINE_2, 2
LOAD_LINE IMAGE_LINE_3, 3
CALC_LINE IMAGE_LINE_2, PATCH_LINE_2, 2
CALC_LINE IMAGE_LINE_3, PATCH_LINE_3, 3
Assembly Macros (Essential Parts)
17
.macro LOAD_LINE IMAGE_LINE line
#ifdef __aarch64__
#ifdef __APPLE__
LD1.8B { \IMAGE_LINE }, [IMAGE_PTR], STRIDE
#else
LD1 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE
#endif
#else
vld1.u8 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE
#endif
.endm
Runtimes
18
Test Set:
• Nexus 4
• 1000 patches, 8x8 pixels each
• 4 test runs, calculate average runtime
C++ Intrinsics Assembly
Absolute
Runtime 15,49 ms 12,84 ms 7,89 ms
Relative Runtime 100% 82,89% 50,94%
Speedup 0% 17,11% 49,06%
What should run on NEON?
19
Just because you can run an algorithm on
NEON doesn’t mean you should …
1. Analyze your bottlenecks
- Use Profiling!
- Does it make sense to optimize the bottlenecks?
2. Analyze what bottlenecks can be optimized
3. Is the current implementation already optimized
- Check for flaws in the code, e.g. it copies too much data etc.
4. Build Prototypes with NEON Intrinsics
5. If still not fast enough, use Assembler
Other ways to optimize
20
OpenCL
• Run Code on GPU
• Low level API framework standardized by Khronos
• Similar considerations: Analyze your code and optimization potential first!
• Not (widely) supported on mobile platforms yet