a high-throughput screening approach to discovering good ... · a high-throughput screening...
TRANSCRIPT
A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual RepresentationNicolas Pinto1 , David Cox 2 and James DiCarlo1
1. The McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology2. The Rowland Institute at Harvard
Take home message:
GPU Metaprogramming dramatically accelerates the discovery of bio-inspired vision models that beat state-of-the-art computer vision systems
Abstract1While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model - e.g. the number of units per layer, the size of pooling kernels, expo-nents in normalization operations, etc. Since the number of such parameters (explicit or implicit is typically large, and the computational cost of evaluating one particular parameter set is high, the space of possible model instantia-tions goes largely unexplored. Thus, when a model fails to approach the abili-ties of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea, or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training.
Here, we present a high-throughput approach to the exploration of such pa-rameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and pa-rameter instantiations, screening those that show promising object recogni-tion performance for further analysis. We show that this approach can yield significant, reproducible gains in performance in across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art vision systems from the literature.
As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both ar-tificial vision and our understanding of the computational underpinning of biological vision.
The Rowland Institute at HarvardHARVARD UNIVERSITY
High-throughput screening for good models5
Analysis: Why are these models good ?7
Metaprogramming3
A Vast space of models to explore2
Recipe:1) Use scripting (e.g. Python) and templates,2) Instrumentalize your code,3) Let the computer auto-tune it!
Unsupervised Learning Test with a “screening”object recognition task Validate on other tasksChoose the best modelsGenerate Random Models
Unsupervised Learning Screen Distribution
Law & Order “World”
Performance / Cost
Hardware
Relative Speedup 1x 80x4x 544x 222x 1544x 2712x
Relative Perf. / $ 1x 120x2x 272x 833x 772x 1356x
Full System Cost(approx.)
$1,500** $1,000$2,700** $3,000* $400 $3,000* $3,000*
Manufacturer Intel IntelIntel NVIDIA Sony, IBM, Toshiba NVIDIA NVIDIA
Model Q9450 Q9450Q9450 7900 GTX PlayStation 3 8800 GTX GTX 280
Year 2008 20082008 2006 2007 2007 2008
Implementation MATLAB SSE2MATLAB Cg Cell SDK CUDA CUDA
# cores used 1 44 4x96 2+6 4x128 4x240
CPUs GPUs
10-1
100
101
102
103
theoreticalobserved
Proc
essi
ng P
erfo
rman
ce in
GFL
OPS
(lo
g-sc
ale)
Software
LanguagesExtensions / Libraries
Performance / Cost
Relative Speedup$ / GFLOPS
Year of implementation
Full System Cost(approx.)
Hardware
ManufacturerModel# cores used
GFLOPS (max)
5.3
(max
)
0.5
(max
) {
IntelQ9450
MATLAB / CMEX
1
2008
$1,500**
3000
1
0.5
1350
Q9450
21 (m
ax)
2 (m
ax)
{
4
Intel
MATLAB / C
4
2008
$2,700**
MEX
2
25
Q9450
85 (m
ax)
40 (m
ax)
{
4
Intel
CSSE2 / pthread
80
2008
$1,00040
22-11*
187
(max
)
68 (m
ax)
{
NVIDIA7900 GTX
Python / C++OpenGL / Cg
136-544*
2006
$1,500-$3,000*
96
68-272*
4
154
(max
)
111
(max
)
76 (m
ean)
9 (m
in)
{
Sony, IBM, ToshibaPlayStation 3 (Cell)
Python / CCell SDK
222
2007
$400
8 (2+6)
111
8-4*
518
(max
)
193
(max
) {
NVIDIA8800 GTX
Python / CPyCUDA
386-1544*
2007
$1,500-$3,000*
128
193-772*
240
4-2*
933
(max
)
339
(max
) {
NVIDIAGTX 280
Python / CPyCUDA
678-2712*Relative GFLOPS / $ 1 2 120 136-272* 833 386-772* 678-1356*
2008
$1,500-$3,000*339-1356*
num
ber o
f mod
els
50 60 70 80 90 1000
50
100
150
200
250
N=2500
top 5 models
Percent Correct
• bio-inspired architecture• 52 free parameters• >10 models25
Best Models
50
60
70
80
90
100
Perc
ent C
orre
ct
top 5 models (screen)
5 4 3 2 1 blend
di�erent initial weights
50
100di�erent
videos
Cars
vs.
Planes
• Why are the best models better? • How can we adjust the parameter space to get farther?
False True
Preprocessing / Norm / Zero-mean
tcerroc %
Parameter value
3 5 7 9
Layer 2 / Linear Filtering / RF size
Parameter value
16 32 64 128
1 3 5 7
50
60
70
80
90
Layer 3 / Learning / Neighborhood size
9
50
60
70
80
90
50
60
70
80
90
50
60
70
80
90
32.5
coun
t
median of Hamming distances (from 5-choose-2 binarized parameter vectors)
N = 100,000p = 0.136
Parameters similarity (top-�ve vs rest)
coun
t
12828
N = 100,000p = 0.082
median of Euclidean distances(from 5-choose-2 similarity matrices)
Representation similarity (top-�ve vs rest)
>=
coun
t
-ln(p)
Parameter Signi�cance
102 4 6 8
10
20
0.25
0.15
0.05
Frac
tion expected fraction
Filtering ActivationNormalizationPooling
Learning
Parameters by Category
low signi�cance
medium signi�cance
high signi�cance
conv_kernel_template.cu
...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
...
version A2x faster!version B
...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1
...
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)extern "C" {
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if {
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ...
Cheetah Templating
conv_kernel_4x4x4.cu
20 kB
conv_kernel_8x8x4.cu
64 kB
#include <stdio.h>
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];
#define IMUL(a, b) __mul24(a, b)extern "C" {
__global__ void convolve_beta_j0(float4 *input, float4 *output) {
__shared__ float shared_in[131][4+1];
// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;
// -- load input to shared memory {
input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);shared_in[threadIdx.x+128*0][0] = input_v4.x;shared_in[threadIdx.x+128*0][1] = input_v4.y;shared_in[threadIdx.x+128*0][2] = input_v4.z;shared_in[threadIdx.x+128*0][3] = input_v4.w;
} if((threadIdx.x+128*1)<131) {
input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);shared_in[threadIdx.x+128*1][0] = input_v4.x;shared_in[threadIdx.x+128*1][1] = input_v4.y;shared_in[threadIdx.x+128*1][2] = input_v4.z;shared_in[threadIdx.x+128*1][3] = input_v4.w;
} __syncthreads();
// -- compute dot products float v, w;
float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; ...
4 GPU performance
Result6Validation
50
60
70
80
90
100
Perc
ent C
orre
ctPe
rcen
t Cor
rect
50
60
70
80
90
100
50
60
70
80
90
100
Cars vs. Planes (validation)
Synthetic Faces MultiPIE Hybrid
v1-like(control)
state-of-the-art(from literature)
top 5 models(high-throughput search)
5SIFTregular + GB PHOG PHOW SLF 4 3 2 1 blend50
60
70
80
90
100
Boats vs. Animals
v1-like(control)
state-of-the-art(from literature)
top 5 models(high-throughput search)
regular + blendSIFT GB PHOG PHOW SLF
v1-like(control)
state-of-the-art(from literature)
top 5 models(high-throughput search)
SIFTregular + GB PHOG PHOW SLF blend
v1-like(control)
state-of-the-art(from literature)
top 5 models(high-throughput search)
regular + blendSIFT GB PHOG PHOW SLF
5 4 3 2 1
5 4 3 2 15 4 3 2 1
We discovered models that beat state-of-the-art computer vision(object and face recognition)
L1
L2
L3
input
Read-out (standard “back-end” classi�er)
number of �lters
kernel
kernel
number of �lters
number of �lters
Learning
kernel
normalization
normalization
normalization
norm strengththresh/sat
norm strengththresh/sat
norm strengththresh/sat
RateTrace“Temp. Adv.”“Auto-reset”
...
Learning
RateTrace“Temp. Adv.”“Auto-reset”
...
Learning
RateTrace“Temp. Adv.”“Auto-reset”
...