fast 1d convolution algorithms - drone research
TRANSCRIPT
P. Giannakeris, P. Bassia, N. Kilis, R.T. ChadoulisProf. Ioannis Pitas
Aristotle University of Thessaloniki
www.aiia.csd.auth.grVersion 2.6
Fast 1D Convolution
Algorithms
Outline
β’ Signals and systems
β’ 1D convolutions
Linear & Cyclic 1D convolutions
Discrete Fourier Transform, Fast Fourier Transform
Winograd algorithm
Nested convolutions
Block convolutions.
β’ Machine learning
Convolutional neural networks.
Motivationβ’ Machine learning
Fast implementation of 1D/2D/3D convolutions in
Convolutional Neural Networks (CNNs).
β’ Fast implementation of 1D digital filters
1D signal filtering (e.g., audio/music, ECG, EEG)
1D Signal feature calculation
β’ Fast implementation of 1D correlation
1D template matching
Time-of-flight (distance) calculation (e.g., sonar)
Motivationβ’ Fast implementation of 2D/3D convolutions
Image/video filtering
Image/video feature calculation
β’ Gabor filters
β’ Spatiotemporal feature calculation
β’ Fast implementation of 2D correlation
Template matching
Correlation tracking
1D Linear Systems
β’ Linearity:
π ππ₯1 + ππ₯2 = ππ π₯1 + ππ π₯2 .
β’ Shift-Invariance:
π¦(π) = π[π₯(π)] β
π¦ π βπ = π π₯ π βπ .
LSI system convolution : π¦ π = β π β π₯ π .
π[π₯(π)]π₯(π) π¦(π)
1D Linear Systems
Linear 1D convolution
β’ The one-dimensional (linear) convolution of:
β’ an input signal π₯ of length πΏ and
β’ a convolution kernel β (filter mask, finite impulse response) of length π:
π¦ π = β π β π₯ π =
π=0
πβ1
β π π₯ π β π .
β’ For a convolution kernel centered around 0 and π = 2π£ + 1,
convolution takes the form:
π¦ π = β π β π₯ π =
π=βπ£
π£
β π π₯ π β π .
Linear 1D convolution
β’ The one-dimensional (linear) convolution is a linear operator.
β’ Signals π₯, β can be interchanged:
π¦ π = β π β π₯ π = π₯ π β β π .
β’ It can model any linear shift-invariant system.
β’ It has a geometrical interpretation:
β’ Signal is π₯ π flipped around π = 0 producing π₯ βπ .
β’ Flipped signal π₯ βπ is shifted by π samples, producing π₯ π β π .
β’ Products β π π₯ π β π , π = 0,β¦ ,π β 1 are calculated and summed.
β’ This is repeated for other temporal shifts π.
Linear 1D convolution β Example
Linear 1D convolution β Example
Linear 1D convolution β Example
Linear 1D convolution β Example
Linear 1D correlation
β’ Correlation of template signal β and input signal π₯ π (inner
product):
πβπ₯ π =
π=0
πβ1
β π π₯ π + π = πππ π .
β’ π = β 0 ,β¦ , β π β 1 π: template vector.
β’ π π = [π₯ π , ..., π₯ π + Ξ β 1 ]Ξ€ βΆ local signal vector.
Linear 1D correlation
Differences from convolution:
β’ π₯ π is not flipped around π = 0 .
β’ It is often confused with convolution: they are identical only
if β is centered at and is symmetric about π = 0.
β’ It is used for:
β’ 1D template matching;
β’ Time-delay estmation.
Linear 1D convolution: matrix
notation
β’ Vectorial convolution input/output, kernel representation:
π± = π₯ 0 ,β¦ , π₯(πΏ β 1) π: the first input vector.
π‘ = β 0 ,β¦ , β(π β 1) π: the second input vector.
π² = π¦ 0 ,β¦ , π¦ π β 1 π: the output vector, with π = πΏ +π β 1.
β’ 1D linear convolution between two discrete signals π±, π‘ can be
expressed as the matrix-vector product:π² = ππ±,
where π is a π Γ πΏ matrix.
Linear 1D convolution: matrix
notation
β’ π : a π Γ πΏ band matrix of the form:
π― =
β 0 0 β― 0β(1) β(0) β― β―
β― β― β― 0β(πβ1) β πβ1 β― β(0)
0 0 β― β(1)β― β― β― β―
0 0 β― β(πβ1)
.
β’ Alternative matrix notation: π² = ππ‘, where π is an N Γπ matrix.β’ Fast calculation of the product π² = ππ± using BLAS/cuBLAS.
Optimal algorithms: minimal
multiplicative complexityβ’ As multiplications are more expensive than additions, optimal algorithms
having minimal multiplicative complexity have been designed.
β’ Example: Complex number multiplication normally involves four
multiplications and two additions:
π + πΌπ = π + ππ π + ππ = ππ β ππ + ππ + ππ π.β’ Gauss trick: a way to reduce the number of multiplications to three
(optimal algorithm): ππ, ππ, π + π π + π ,as: πΌ = (π + π)(π + π) β ππ β ππ.
β’ Other formulation:
π1 = π π + π , π2= π π β π , π3 = π π + π ,π = π1 β π3,πΌ = π1 + π2.
Optimal Linear Convolution
Algorithmsβ’ Winograd theory for optimal linear convolution algorithms:
β’ Minimal number of multiplications.
β’ The computation of the first π outputs π¦ 0 ,β¦ , π¦(π β 1) of an π-tap filter
can be written as:
π¦ 0
π¦ 1β―
π¦(π β 1)
=
π₯ 0 π₯ 1 β― π₯(πβ1)
π₯ 1 π₯ 2 β― π₯(π)β― β― β―
π₯(πβ1) π₯ π β― π₯ π+πβ2
β 0β 1β―
β πβ1
.
β’ Notation consistent with literature.
Optimal Linear Convolution
Algorithms
β’ Linear convolution can be written as a product π² = ππ‘, where π is an π Γπ matrix:
πππ = π₯(π + π) , 0 β€ π β€ π β 1, 0 β€ π β€ π β 1.
β’ Example: for input π₯ π with size π = 4, kernel β π with size π = 3 and
output π¦ π with size π = 2, the matrix vector product:
π¦ 0
π¦ 1=
π₯ 0 π₯ 1 π₯ 2π₯ 1 π₯ 2 π₯ 3
β 0β 1β 2
,
requires 2 Γ 3 = 6 multiplications and 4 additions.
Optimal Linear Convolution
Algorithmsβ’ Example: optimal Winograd linear convolution algorithm.
β’ 8 additions and 4 multiplications.
β’ 3 additions on β can be pre-calculated:
π¦ 0
π¦ 1=
π₯ 0 π₯ 1 π₯ 2π₯ 1 π₯ 2 π₯ 3
β 0β 1β 2
=π1 +π2 +π3
π2 βπ3 βπ4,
where:
π1 = π₯ 0 β π₯ 2 β 0 , π2= π₯ 1 + π₯ 2β 0 +β 1 +β 2
2,
π4 = π₯ 1 β π₯ 3 β 2 , π3 = π₯ 2 β π₯ 1β 0 ββ 1 +β 2
2.
Optimal Linear Convolution
Algorithms
Intermediate addition result is used 2 times.
Optimal Linear Convolution
Algorithmsβ’ Winograd linear convolution algorithm requires π + π β 1multiplications,
π and π: lengths of π¦ and β, respectively.
β’ General form of optimal Winograd linear convolution algorithms:
π² = πT ππ‘ β πππ± ,
β indicates element-wise π + π β 1multiplications.
π±, π‘, π²: input signal, filter coefficient and output signal vectors.
Optimal Linear Convolution
Algorithms-ExampleInput signal, filter coefficient, output signal vectors for π = 3 and π = 2 :
π± = [π₯ 0 π₯ 1 π₯ 2 π₯ 3 ] π, π‘ = [β 0 β 1 β 2 ] π , π² = [π¦ 0 π¦ 1 ] π.
The involved matrices are:
πT =
1 00 1
β1 01 0
0 β10 1
1 00 β1
, π =
1 0 0Ξ€1 2 Ξ€1 2 Ξ€1 2Ξ€1 2 Ξ€β1 2 Ξ€1 20 0 1
,
πT =1 1 1 00 1 β1 β1
.
Cyclic 1D convolution
β’ One-dimensional cyclic convolution of length π :
π¦ π = π₯ π β β π =
π=0
πβ1
β π π₯ π β π π ,
(π)π= π mod π.
β’ It is of no use in modeling linear systems.
β’ Important use: Embedding linear convolution in a fast cyclic convolution
π¦ π = π₯ π β β π of length π β₯ πΏ +π β 1 and then performing a cyclic
convolution of length π.
Linear convolution embeddingβ’ Zero-padding signal vector π± to length π: π±p = π₯ 0 ,β¦ , π₯ π β 1 , 0, β¦ , 0 π.
β’ Zero-padding convolution kernel vector π‘ to length π: π‘p = β 0 ,β¦ , β πΏ β 1 , 0,β¦ , 0 π.
β’ Linear convolution embedding in a cyclic convolution of length π = πΏ +π β 1:
Cyclic 1D convolution - Example
Cyclic convolution of
π₯(π) = [1 2 0] and
β π = 3 5 4 .
Clock-wise
Anticlock-wise
Folded sequence
0 π ππππ 1 π πππ 2 π ππππ
1
20
3
5 4
3
5 4 3
5
4
4
3 50 0 02 22
1 1 1
π¦ 0 = 1 Γ 3 + 2 Γ 4 + 0 Γ 5 π¦ 1 = 1 Γ 5 + 2 Γ 3 + 0 Γ 4 π¦ 2 = 1 Γ 4 + 2 Γ 5 + 0 Γ 3
Cyclic 1D convolution: matrix
notationβ’ Cyclic convolution definition as matrix-vector product:
π² = ππ±,
where:
π± = π₯ 0 ,β¦ , π₯ π β 1 π: the input vector.
π² = π¦ 0 ,β¦ , π¦ π β 1 π: the output vector.
π : a N Γ N Toeplitz matrix of the form:
π― =
β(0) β(1) β(2)β(π β 1) β(0) β(1)β(π β 2) β(π β 1) β(0)
β―β―β―
β(π β 1)β(π β 2)β(π β 3)
β― β― β― β― β―β(1) β(2) β(3) β― β(0)
.
Cyclic Convolution via DFT
β’ Cyclic convolution calculation using 1D Discrete Fourier Transform (DFT):
π² = πΌπ·πΉπ π·πΉπ π± βπ·πΉπ π‘ .
β’ Fast calculation of DFT, IDFT through an FFT algorithm.
π·πΉπ
π·πΉπ
πΌπ·πΉπ
π₯(π)
β(π)
π(π)
π»(π)
Y(π) π¦(π)
1D FFT
β’ There are various FFT algorithms to speed up the calculation of DFT.
β’ The best known is the radix-2 decimation-in-time (DIT) Fast FourierTransform (FFT) (Cooley-Tuckey).
1. DFT of a sequence π₯ π of length π (π = 0,β¦ ,π β 1):
π π = Οπ=0πβ1 π₯ π πβ
2ππ
πππ , π = 0,β¦ ,π β 1.
π-th complex roots of unity: πππ = πβ
2ππ
ππ, π = 0,β¦ ,π β 1.
1D FFT2. The Radix-2 DIT algorithm rearranges the DFT:
π(π) =
π=0
π2β1
π₯ 2π πβ2πππ2
ππ
π·πΉπ ππ ππ£ππβπππππ₯ππ ππππ‘ ππ π₯(π)
+πβ2ππππ
π=0
π2β1
π₯ 2π + 1 πβ2πππ2
ππ
π·πΉπ ππ πππβπππππ₯ππ ππππ‘ ππ π₯(π)
= πΈ(π) + πβ2πππππ(π).
3. The complex exponential is periodic which means that:
π(π +π
2) =
π=0
π2β1
π₯ 2π πβ2ππΞ€π 2π(π+
π2) +πβ
2πππ (π+
π2)
π=0
π2β1
π₯ 2π + 1 πβ2ππΞ€π 2π(π+
π2)
=
π=0
π2β1
π₯ 2π πβ2ππΞ€π 2πππβ2πππ +πβ
2πππ π πβππ
π=0
π2β1
π₯ 2π + 1 πβ2ππΞ€π 2πππβ2πππ
=
π=0
π2β1
π₯ 2π πβ2ππΞ€π 2ππ
βπβ2πππ π
π=0
π2β1
π₯ 2π + 1 πβ2ππΞ€π 2ππ
= πΈ(π) β πβ2πππ ππ(π).
1D FFT
4. We can rewrite π as :
π π = πΈ π + πβ2πππ ππ π .
π π +π
2= πΈ π β πβ
2πππ ππ π .
5. By inserting a recursion, we can compute in that way π(π) and π(π +π
2) as well.
The algorithm gains its speed by re-using the results of intermediate computations to
compute multiple DFT outputs.
1D FFTβ’ radix-2 FFT breaks a length-N DFT into many size-2 DFTs called
"butterfly" operations.
β’ There are log2N FFT
stages.
Radix-2 1D FFT properties
β’ DFT length π should be a power of 2.
β’ The nice FFT structure is based on the properties of the π-th
complex roots of unity πππ = πβ
2ππ
ππ, π = 0,β¦ , π β 1.
β’ Computational complexity of the 1D FFT is π ππππ2π .
β’ Computational complexity of the 1D FFT is π(π2).
β’ Other FFT algorithms: Radix-4 FFT (π is power of 4),
Prime Factor Algorithm (PFA) π = ππ, π, π co-prime numbers.
Cyclic 1D convolution
computation
β’ Its computation by definition requires:
β’ Two nested for loops
β’ Computational complexity (number of multiplications) is π(π2).
β’ Computational complexity through two 1D FFTs and one inverse 1D
FFT is π ππππ2π .
β’ Fast linear convolution calculation through cyclic convolution
embedding.
π transform
π π§ =
π=0
πβ1
π₯ π π§βπ .
The π transform of a discrete signal π₯(π) having domain [0, β¦ , π] is
given by:
The domain of π transform is the complex plane, since π§ is a complex
number.
Convolution property of the π transform (polynomial product π(π§)π»(π§)):
π¦(π) = π₯(π) β β(π) β π(π§) = π(π§)π»(π§).
Cyclic convolution and π transform
π¦ π = π₯ π β β π =
π=0
πβ1
β π π₯ π β π π ,
where : (π)π = π mod π.
π¦ π = π₯ π β β π < => π π§ = π π§ π» π§ mod π§π β 1.
Polynomial product form of the 1D cyclic convolution:
Convolutions as polynomial
products-examplesβ’ 1D linear convolution π¦ π = β π β π₯ π of length πΏ = π = 3:
β’ 1D cyclic convolution π¦ π = π₯ π β β πof length π = 3:
π π§ = π π§ π» π§ mod π§3 β 1= π₯ 0 + π₯ 1 π§ + π₯ 2 π§2 β 0 + β 1 π§ + β 2 π§2 mod π§3 β 1.
Factorization: π§3 β 1 = π1 π§ π2 π§ = (π§ β 1)(π§2 + π§ + 1)(π£ = 2 factors).
π π§ = π π§ π» π§ = π₯ 0 + π₯ 1 π§ + π₯ 2 π§2 β 0 + β 1 π§ + β 2 π§2 .
Chinese Remainder Theorem (CRT)If we have a polynomial Y(z) and:
π π§ = π§π β 1, where the Greatest Common Divider πΊπΆπ· ππ π§ , ππ π§ = 1 for π β π.
π π§ =ΰ·
π=1
Ξ½
ππ π§ .
ππ π§ = π π§ mod ππ(π§).
0 < πππ π π§ β€ Οπ=1π πππ[ππ(π§)] , where deg is the degree of a polynomial.
π π π§ = πππ mod ππ(π§), 1 β€ π β€ π.
Then the polynomial π π§ is given by: π π§ = Οπ=1Ξ½ π π π§ ππ π§ mod π§π β 1
From CRT to Winograd
convolution algorithm
π(π§) = π»(π§)π(π§) mod π§π β 1.
A 'divide-and-conquer' approach can be used for creating fast convolution
algorithms, by factorizing polynomial π§π β 1 to its prime factors ππ π§ .
Then, we calculate first the items:
ππ π§ = π π§ mod ππ π§ , π = 1,β¦ , ππ¨π π§ = π¨ π§ mod ππ π§ , π = 1,β¦ , π
and their products:
ππ π§ = ππ π§ π»π π§ mod ππ π§ , π = 1,β¦ , π.
CRT helps us to create fast convolution algorithms by embedding linear
convolutions in cyclic ones:
Winograd algorithm
Fast 1D cyclic convolution with
minimal complexityβ’ Winograd convolution algorithms or fast
filtering algorithms:
π² = π ππ±β¨ππ‘ .
β’ They require only 2π β π£ multiplications in
their middle vector product, thus having
minimal complexity.
β’ π: number of cyclotomic polynomial
factors of polynomial π§π β 1 over the
rational numbers π.
Example: Winograd Cyclic
convolution Algorithm for N=3
( )( )3
3 3 2
1 3
/
1 ( ) 1 ( ) ( ) 1 1 1N
N
d
c N
z c z z c z c z z z z z=
β = β = β = β + +
( ) ( )3
3( ) ( ) ( ) mod 1 ( ) ( ) ( ) mod 1N
NY z H z X z z Y z H z X z z=
= β = β
1 232
0 1 2
0 0
( ) ( ) ( )N N
n n
n n
n m
X z x z X z x z X z x x z x zβ =
= =
= = = + +
1 232
0 1 2
0 0
( ) ( ) ( )N N
n n
n n
n m
H z h z H z h z H z h h z h zβ =
= =
= = = + +
( )( )2( ) ( ) ( ) mod 1 1Y z H z X z z z z= β + +
I. Definitions:
Example: Winograd Cyclic
convolution Algorithm for N=3
( ) ( ) ( )1
2
1 0 1 2 1 0 1 2mod mod 1i
i iX X z P X x x z x z z X x x x=
= = + + β = + +
( ) ( ) ( ) ( )2
2 2
2 0 1 2 2 0 2 1 2mod 1i
X x x z x z z z X x x x x z=
= + + + + = β + β
( ) ( )1 0 1 2 0 2 0 2 1 2 1 2Similarly for H : and i H h h h b H h h h h z b b z= + + = = β + β = +
0 1 2 3 1 0 2 2 1 2Let , and :a x x x a x x a x x= + + = β = β 1 0 2 1 2 and X a X a a z= = +
II. Computing the reduced Polynomials ππ , π»π:
Example: Winograd Cyclic
convolution Algorithm for N=3
( ) ( ) ( ) ( )1
1 0 0 1 0 0mod ( ) mod 1i
i i i iY X z H z P z Y a b z Y a b=
= = β =
( ) ( )
( ) ( ) ( )( ) ( )
2
2 2 2 2
2
2 1 2 1 2
2 1 1 2 2 1 2 2 1 2 2
mod ( )
mod 1
i
Y X z H z P z
Y a a z b b z z z
Y a b a b z a b a b a b
=
=
= + + + +
= β + + β
( ) ( )1 2 2 1 2 2 1 2 1 2 1 1 2One can see, that: , thus, Y becomes: a b a b a b a a b b a b+ β = β β β +
( ) ( ) ( )2 1 1 2 2 1 2 1 2 1 1Y a b a b z a a b b a b= β + β β β +
( ) ( )0 0 0 1 1 1 2 2 2 1 2 1 2 1 1Let c , c and ca b a b a b a a b b a b= = β = β β β +
1 0 2 1 2 and Y c Y c c z= = +
III. Computing the products ππ:
6 multiplications
4 multiplications
Example: Winograd Cyclic
convolution Algorithm for N=3
( ) ( ) 233
1 1 1
3 3 1
3 3
ppp c z c z z z
R R Rp
=β β β β β= = =
( ) ( ) ( ) ( )0 1 2
2 32
0 1 2 0 1 2 0 1 2
1
1mod 1 2 2
3
NN
i i
iy y y
Y RY z Y c c c c c c z c c c z=
=
= β = = + β + β + + β β
IV. Using the Chinese Remainder Theorem (CRT) to Compute π:
( ) ( ) 233
2 2 2
1
3 3
ppc z c z z z
R R Rp
= + += = =
Example: Winograd Cyclic
convolution Algorithm for N=3V. Finding Matrices π΄ and π΅:
0 0 1 20
0
1 0 21
1
2 1 22
2
3 0 11 2
1 1 1
1 0 1
0 1 1
1 1 0
x x x xax
x x xaA A x A
x x xax
x x xa a
+ + β β = = = β β
ββ β
Similarly, we get exactly the same result for matrix π΅!
Example: Winograd Cyclic
convolution Algorithm for N=3
( )
( ) ( )
( ) ( )
( ) ( )
1 0 1 2 0 0 1 1 2 2 1 2 1 2
2 0 1 2 0 0 1 1 2 2 1 2 1 23 1
3 0 1 2 0 0 1 1 2 2 1 2 1 2
2 2
2 2
2
y c c c a b a b a b a a b b
Y y c c c a b a b a b a a b b
y c c c a b a b a b a a b b
+ β + β + β β
= = β + = + + β β β β β β + + β β
( ) ( ) ( )
( )( )
0 0
0 0
1 1
1 14 1
2 2
2 2
1 2 1 2
1 1 1 1 1 1
1 0 1 1 0 1
0 1 1 0 1 1
1 1 0 1 1 0
a bx h
a bM A x B h x h
a bx h
a a b b
β β = = = β β β ββ β
VI. Finding matrix πΆ:
Example: Winograd Cyclic
convolution Algorithm for N=3
( ) ( ) ( )
( ) ( )
( ) ( )
( ) ( )( )( )
3 1 3 4 4 1
0 0
0 0 1 1 2 2 1 2 1 2 00 01 02 03
1 1
0 0 1 1 2 2 1 2 1 2 10 11 12 13
2 2
0 0 1 1 2 2 1 2 1 2 20 21 22 23
1 2 1 2
2
2
2
1 1 2 1
1 1
Y C M
a ba b a b a b a a b b C C C C
a ba b a b a b a a b b C C C C
a ba b a b a b a a b b C C C C
a a b b
C
=
+ β + β β
+ + β β β = β + + β β β β
β
= 1 2
1 2 1 1
β β
VI. Finding matrix πΆ:
Theoretically minimal number of multiplications: π!
Winograd algorithm version
with minimal computational
complexity β’ Can be equivalently expressed as:
π² = ππT ππ±β¨πTππ‘ .
β’ Matrices π, π typically have elements 0,+1,β1.
β’ Multiplications πTππ‘, ππTπ²β² are done only by
additions/subtractions.
β’ π is an π Γ π permutation matrix.
β’ πh can be precomputed.
Block diagram: Winograd Cyclic
convolution Algorithm for N=3
Block diagram: Winograd Cyclic
convolution Algorithm for N=3
Winograd algorithm
Fast 1D cyclic convolution with
minimal complexity
β’ Winograd algorithm works on small blocks of the input signal.
β’ The input block and filter are transformed.
β’ The outputs of the transform are multiplied together in an element-
wise fashion.
β’ The result is transformed back to obtain the outputs of the
convolution.
β’ GEneral Matrix Multiplication (GEMM) BLAS or CUBLAS routines can
be used.
Nested convolutions
β’ Winograd algorithms exist for relatively short convolution lengths,
e.g.: π = 3, 5, 7.
β’ Use of efficient short-length convolution algorithms iteratively to
build long convolutions.
β’ Does not achieve minimal multiplication complexity.
β’ Good balance between multiplications and additions.
Decomposition of 1D convolution into a 2D convolution:
β’ 1D convolution of length: π = π1π2β’ with π1, π2 co-prime integers, π1, π2 = 1
β’ results into a 2D π1 Γ π2 convolution.
Nested convolutions
β’ Remember circular convolution definition:
π¦(π) =
π=0
πβ1
β π π₯( π β π π), π = 0, . . , π β 1.
β’ If π = π1π2, π and π can be analyzed as:
π β‘ π1π2 + π2π1 mod π1π2 π1, π1 = 0,β¦ ,π1 β 1.π β‘ π1π2 + π2π1 mod π1π2 π2, π2 = 0,β¦ ,π2 β 1.
β’ 1D circular convolution becomes now a 2D π1 Γ π2 convolution:
π¦(π1π2 + π2π1) =
π1=0
π1β1
π2=0
π2β1
β π1π2 +π2π1 π₯(π1(π2βπ2) + π2(π1 β π1)).
Nested convolutions
β’ The number of multiplications required are π1π2, where, π1 and
π2 is the number of multiplications for convolutions π1 and π2accordingly.
β’ The same applies for the reverse decomposition: π2 Γ π1.
β’ However, the number of additions varies over the nesting order :π1 Γ π2: π΄1π2 +π1π΄2π2 Γ π1: π΄2π1 +π2π΄1,
with π΄1 and π΄2 being the number of additions required for π1 and π2accordingly.
β’ Additions must be calculated beforehand, so as to choose the
nesting order.
Nested convolutionsFor example, lets assume we have the circular convolution
of length π = 12 and π1 = 3,π2 = 4 :
then,π1, π1 = 0,1,2π2, π2 = [0, 1, 2, 3]
and
π0 π§ = π₯0 + π₯3π§ + π₯6π§2 + π₯9π§
3
π1 π§ = π₯4 + π₯7π§ + π₯10π§2 + π₯1π§
3
π2 π§ = π₯8 + π₯11π§ + π₯2π§2 + π₯5π§
3
π»0 π§ = β0 + β3π§ + β6π§2 + β9π§
3
π»1 π§ = β4 + β7π§ + β10π§2 + β1π§
3
π»2 π§ = β8 + β11π§ + β2π§2 + β5π§
3
Nested convolutions
π and π can be defined now as 3Γ4 matrices
with the polynomial terms as elements:
πΏ =
π₯0 π₯3 π₯6 π₯9π₯4 π₯7 π₯10 π₯1π₯8 π₯11 π₯2 π₯5
π― =
β0 β3 β6 β9β4 β7 β10 β1β8 β11 β2 β5
Nested convolutions
β’ We use Winograd 3-point circular convolution C, A, and B matrices at
the outer nesting layer:
π² = π ππβ¨ππ
β’ The matrix-vector products ππ and ππ, are replaced by matrix
multiplication ππ and ππ respectively, each one resulting to a 4Γ4
matrix.
β’ We use a fast 4-point Winograd routine to calculate the 4 circular
convolutions ππ β¨ ππ (one per row). The 4-point routine is βnestedβ
inside the 3-point routine.
β’ The correct order of the elements in the result can be found based on
the nesting formulation:
π¦π1π2+π2π1 .
Block Based convolutions
β’ Input signal x π is split in overlapping/non-overlapping blocks.
β’ Blocks are convolved independently.
β’ Great parallelism is achieved.
β’ Two block-based convolution methods:
β’ Overlap-add method
β’ Overlap-save method.
Overlap-add method
It uses the following signal processing principles:
β’ The linear convolution of two discrete-time signals of length πΏ and πresults in a discrete-time convolved signal of length π = πΏ +π β 1.
β’ Additivity:
[π₯1 π + π₯2 π ] β β π = π₯1 π β β π + π₯2 π β β π .
Overlap-add method
1. Break the input signal π₯(π) into non-overlapping blocks π₯π π of length πΏ.
2. Zero pad β π to be of length π = πΏ +π β 1.
Calculate the π·πΉπ π»(π),π = 0,1,β¦ ,π β 1.
3. For each block π:
β’ Zero pad π₯π π to be of length π = πΏ +π β 1. Calculate the DFT ππ π .
β’ Use the IDFT of π» π ππ π for block output π¦π π ,π = 0,1,β¦ ,π β 1.
4. Form overall output π¦ π by overlapping the last π β 1 samples of π¦π π with the
first π β 1 samples π¦π+1 π and adding the result.
Overlap-add methodπΏ πΏπΏ
π₯2 π
π₯3 π
π₯1 π
π¦2 π
π¦1 π
π¦3 π
(π β 1) zeros
(π β 1) zeros
(π β 1) zeros
(π β 1) points
(π β 1) points
(π β 1) points
(π β 1) points
(π β 1) points
Overlap-add method
Overlap-add method
Overlap-save method
It uses the following signal processing principles:
β’ The linear convolution of two discrete-time signals of length π and πresults in a discrete-time convolved signal of length π = πΏ +π β 1.
β’ Time-Domain Aliasing:
π₯π π =
π=ββ
β
π₯πΏ π β ππ , π = 0,1, β¦ , π β 1.
Overlap-save method
Overlap-Save method: 1. Pad π β 1 zeros at the beginning of the input sequence π₯(π).2. Break the padded input signal into overlapping blocks π₯π π of length π = πΏ +π β 1, where the overlap length is π β 1.
3. Zero-pad β(π) to be of length π = πΏ +π β 1.
Calculate the π·πΉπ π»(π),π = 0,1,β¦ ,π β 1.
4. For each block π:β’ Calculate the π·πΉπ ππ π ,π = 0,1,β¦ ,π β 1.
β’ Calculate the IDFT of: ππ π = ππ π π»(π), π = 0,1,β¦ ,π β 1 to get the block output
π¦π π ,π = 0,1,β¦ ,π β 1.
5. Discard the first π β 1 points of each output block π¦π π . Concatenate the remaining (i.e., last) πΏ samples of each block π¦π π to form the output π¦ π .
Overlap-save method
πΏ πΏ πΏ
(π β 1) zeros(π β 1) point overlap
(π β 1) point overlap
πΌπππ’π‘ π πππππ ππππππ
ππ’π‘ππ’π‘ π πππππ ππππππ
Discard (π β 1) points
Overlap-save method
Overlap-save method
Applications
β’ Convolutional neural networks
β’ Signal processing
Signal filtering
Signal restoration
Signal deconvolution
β’ Signal analysis
Time delay estimation
Distance calculation (e.g., sonar)
1D template matching
Convolutional Neural Networks
β’ Two step architecture:
β’ First layers with sparse NN connections: convolutions.
β’ Fully connected final layers.
β’ Need for fast convolution calculations.
Convergence of machine learning and signal processing
Deep Learning Frameworks
Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs
Convolutions in DL Frameworks
β’ All 5 frameworks work with cuDNN as backend.
β’ cuDNN unfortunately not open source.
β’ cuDNN supports FFT and Winograd convolutions.
Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs
Basic Exercises
1. Implement the definition of the 1D linear convolution algorithm
in C/C++ or Python.
2. Implement the definition of the 1D DFT algorithm in C/C++ or
Python.
3. Implement the 1D cyclic convolution algorithm for length π =2π in C/C++ using FFT.
4. Derive analytically the 1D Winograd cyclic convolution
algorithm for length π = 3.
5. Implement the 1D Winograd cyclic convolution algorithm for
length π = 3 in C/C++ or Python.
Advanced Exercises
1. Implement the 1D linear convolution for filter length M and
signal length πΏ in C/C++ using your own myFFT routine.
2. Implement a nested convolution algorithm for length π = π1π2in C/C++.
3. Study the properties of cyclotomic polynomials. Derive
analytically the 1D Winograd cyclic convolution algorithm for
length π = 4,π = 6 or π = 9.
4. Implement the 1D linear convolution algorithm for filter length
M and a large signal length πΏ using overlap-add in a) C/C++ or
b) CUDA.
References
β’ Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, p. 436, 2015.
β’ V. Oppenheim, Discrete-time signal processing, Pearson Education India, 1999.
β’ I. Pitas, Digital image processing algorithms and applications. John Wiley & Sons, 2000.
β’ S. Winograd, Arithmetic complexity of computations, vol. 33. Siam, 1980.
β’ J. H. McClellan and C. M. Rader, Number theory in digital signal processing," 1979.
β’ H. J. Nussbaumer, Fast Fourier transform and convolution algorithms.
β’ Springer Science & Business Media, 1981.
β’ I. Pitas and M. Strintzis, Multidimensional cyclic convolution algorithms with minimal multiplicative
complexity," IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 384-
390, 1987.
β’ A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, Proceedings CVPR, pp.
4013-4021, 2016.