fast 1d convolution algorithms - drone research

P. Giannakeris, P. Bassia, N. Kilis, R.T. ChadoulisProf. Ioannis Pitas

Aristotle University of Thessaloniki

[email protected]

www.aiia.csd.auth.grVersion 2.6

Fast 1D Convolution

Algorithms

Outline

• Signals and systems

• 1D convolutions

Linear & Cyclic 1D convolutions

Discrete Fourier Transform, Fast Fourier Transform

Winograd algorithm

Nested convolutions

Block convolutions.

• Machine learning

Convolutional neural networks.

Motivation• Machine learning

Fast implementation of 1D/2D/3D convolutions in

Convolutional Neural Networks (CNNs).

• Fast implementation of 1D digital filters

1D signal filtering (e.g., audio/music, ECG, EEG)

1D Signal feature calculation

• Fast implementation of 1D correlation

1D template matching

Time-of-flight (distance) calculation (e.g., sonar)

Motivation• Fast implementation of 2D/3D convolutions

Image/video filtering

Image/video feature calculation

• Gabor filters

• Spatiotemporal feature calculation

• Fast implementation of 2D correlation

Template matching

Correlation tracking

1D Linear Systems

• Linearity:

𝑇 𝑎𝑥1 + 𝑏𝑥2 = 𝑎𝑇 𝑥1 + 𝑏𝑇 𝑥2 .

• Shift-Invariance:

𝑦(𝑛) = 𝑇[𝑥(𝑛)] ⇒

𝑦 𝑛 −𝑚 = 𝑇 𝑥 𝑛 −𝑚 .

LSI system convolution : 𝑦 𝑘 = ℎ 𝑘 ∗ 𝑥 𝑘 .

𝑇[𝑥(𝑛)]𝑥(𝑛) 𝑦(𝑛)

1D Linear Systems

Linear 1D convolution

• The one-dimensional (linear) convolution of:

• an input signal 𝑥 of length 𝐿 and

• a convolution kernel ℎ (filter mask, finite impulse response) of length 𝑀:

𝑦 𝑘 = ℎ 𝑘 ∗ 𝑥 𝑘 =

𝑖=0

𝑀−1

ℎ 𝑖 𝑥 𝑘 − 𝑖 .

• For a convolution kernel centered around 0 and 𝑀 = 2𝑣 + 1,

convolution takes the form:

𝑦 𝑘 = ℎ 𝑘 ∗ 𝑥 𝑘 =

𝑖=−𝑣

𝑣

ℎ 𝑖 𝑥 𝑘 − 𝑖 .

Linear 1D convolution

• The one-dimensional (linear) convolution is a linear operator.

• Signals 𝑥, ℎ can be interchanged:

𝑦 𝑘 = ℎ 𝑘 ∗ 𝑥 𝑘 = 𝑥 𝑘 ∗ ℎ 𝑘 .

• It can model any linear shift-invariant system.

• It has a geometrical interpretation:

• Signal is 𝑥 𝑖 flipped around 𝑖 = 0 producing 𝑥 −𝑖 .

• Flipped signal 𝑥 −𝑖 is shifted by 𝑘 samples, producing 𝑥 𝑘 − 𝑖 .

• Products ℎ 𝑖 𝑥 𝑘 − 𝑖 , 𝑖 = 0,… ,𝑀 − 1 are calculated and summed.

• This is repeated for other temporal shifts 𝑘.

Linear 1D convolution – Example

Linear 1D correlation

• Correlation of template signal ℎ and input signal 𝑥 𝑘 (inner

product):

𝑟ℎ𝑥 𝑘 =

𝑖=0

𝑀−1

ℎ 𝑖 𝑥 𝑘 + 𝑖 = 𝒉𝑇𝒙 𝑘 .

• 𝒉 = ℎ 0 ,… , ℎ 𝑀 − 1 𝑇: template vector.

• 𝒙 𝑘 = [𝑥 𝑘 , ..., 𝑥 𝑘 + Μ − 1 ]Τ ∶ local signal vector.

Linear 1D correlation

Differences from convolution:

• 𝑥 𝑘 is not flipped around 𝑖 = 0 .

• It is often confused with convolution: they are identical only

if ℎ is centered at and is symmetric about 𝑖 = 0.

• It is used for:

• 1D template matching;

• Time-delay estmation.

Linear 1D convolution: matrix

notation

• Vectorial convolution input/output, kernel representation:

𝐱 = 𝑥 0 ,… , 𝑥(𝐿 − 1) 𝑇: the first input vector.

𝐡 = ℎ 0 ,… , ℎ(𝑀 − 1) 𝑇: the second input vector.

𝐲 = 𝑦 0 ,… , 𝑦 𝑁 − 1 𝑇: the output vector, with 𝑁 = 𝐿 +𝑀 − 1.

• 1D linear convolution between two discrete signals 𝐱, 𝐡 can be

expressed as the matrix-vector product:𝐲 = 𝐇𝐱,

where 𝐇 is a 𝑁 × 𝐿 matrix.

Linear 1D convolution: matrix

notation

• 𝐇 : a 𝑁 × 𝐿 band matrix of the form:

𝑯 =

ℎ 0 0 ⋯ 0ℎ(1) ℎ(0) ⋯ ⋯

⋯ ⋯ ⋯ 0ℎ(𝑀−1) ℎ 𝑀−1 ⋯ ℎ(0)

0 0 ⋯ ℎ(1)⋯ ⋯ ⋯ ⋯

0 0 ⋯ ℎ(𝑀−1)

.

• Alternative matrix notation: 𝐲 = 𝐗𝐡, where 𝐗 is an N ×𝑀 matrix.• Fast calculation of the product 𝐲 = 𝐇𝐱 using BLAS/cuBLAS.

Optimal algorithms: minimal

multiplicative complexity• As multiplications are more expensive than additions, optimal algorithms

having minimal multiplicative complexity have been designed.

• Example: Complex number multiplication normally involves four

multiplications and two additions:

𝑅 + 𝐼𝑖 = 𝑎 + 𝑏𝑖 𝑐 + 𝑑𝑖 = 𝑎𝑐 − 𝑏𝑑 + 𝑏𝑐 + 𝑎𝑑 𝑖.• Gauss trick: a way to reduce the number of multiplications to three

(optimal algorithm): 𝑎𝑐, 𝑏𝑑, 𝑎 + 𝑏 𝑐 + 𝑑 ,as: 𝐼 = (𝑎 + 𝑏)(𝑐 + 𝑑) − 𝑎𝑐 − 𝑏𝑑.

• Other formulation:

𝑘1 = 𝑐 𝑎 + 𝑏 , 𝑘2= 𝑎 𝑑 − 𝑐 , 𝑘3 = 𝑏 𝑐 + 𝑑 ,𝑅 = 𝑘1 − 𝑘3,𝐼 = 𝑘1 + 𝑘2.

Optimal Linear Convolution

Algorithms• Winograd theory for optimal linear convolution algorithms:

• Minimal number of multiplications.

• The computation of the first 𝑚 outputs 𝑦 0 ,… , 𝑦(𝑚 − 1) of an 𝑛-tap filter

can be written as:

𝑦 0

𝑦 1⋯

𝑦(𝑚 − 1)

=

𝑥 0 𝑥 1 ⋯ 𝑥(𝑛−1)

𝑥 1 𝑥 2 ⋯ 𝑥(𝑛)⋯ ⋯ ⋯

𝑥(𝑚−1) 𝑥 𝑚 ⋯ 𝑥 𝑚+𝑛−2

ℎ 0ℎ 1⋯

ℎ 𝑛−1

.

• Notation consistent with literature.


Algorithms

• Linear convolution can be written as a product 𝐲 = 𝐗𝐡, where 𝐗 is an 𝑚 ×𝑛 matrix:

𝑋𝑖𝑗 = 𝑥(𝑖 + 𝑗) , 0 ≤ 𝑖 ≤ 𝑚 − 1, 0 ≤ 𝑗 ≤ 𝑛 − 1.

• Example: for input 𝑥 𝑘 with size 𝑛 = 4, kernel ℎ 𝑘 with size 𝑚 = 3 and

output 𝑦 𝑘 with size 𝑟 = 2, the matrix vector product:

𝑦 0

𝑦 1=

𝑥 0 𝑥 1 𝑥 2𝑥 1 𝑥 2 𝑥 3

ℎ 0ℎ 1ℎ 2

,

requires 2 × 3 = 6 multiplications and 4 additions.


Algorithms• Example: optimal Winograd linear convolution algorithm.

• 8 additions and 4 multiplications.

• 3 additions on ℎ can be pre-calculated:

𝑦 0

𝑦 1=

𝑥 0 𝑥 1 𝑥 2𝑥 1 𝑥 2 𝑥 3

ℎ 0ℎ 1ℎ 2

=𝑚1 +𝑚2 +𝑚3

𝑚2 −𝑚3 −𝑚4,

where:

𝑚1 = 𝑥 0 − 𝑥 2 ℎ 0 , 𝑚2= 𝑥 1 + 𝑥 2ℎ 0 +ℎ 1 +ℎ 2

2,

𝑚4 = 𝑥 1 − 𝑥 3 ℎ 2 , 𝑚3 = 𝑥 2 − 𝑥 1ℎ 0 −ℎ 1 +ℎ 2

2.


Algorithms

Intermediate addition result is used 2 times.


Algorithms• Winograd linear convolution algorithm requires 𝑚 + 𝑟 − 1multiplications,

𝑚 and 𝑟: lengths of 𝑦 and ℎ, respectively.

• General form of optimal Winograd linear convolution algorithms:

𝐲 = 𝐀T 𝐇𝐡 ⊗ 𝐁𝐓𝐱 ,

⊗ indicates element-wise 𝑚 + 𝑟 − 1multiplications.

𝐱, 𝐡, 𝐲: input signal, filter coefficient and output signal vectors.


Algorithms-ExampleInput signal, filter coefficient, output signal vectors for 𝑚 = 3 and 𝑟 = 2 :

𝐱 = [𝑥 0 𝑥 1 𝑥 2 𝑥 3 ] 𝑇, 𝐡 = [ℎ 0 ℎ 1 ℎ 2 ] 𝑇 , 𝐲 = [𝑦 0 𝑦 1 ] 𝑇.

The involved matrices are:

𝐁T =

1 00 1

−1 01 0

0 −10 1

1 00 −1

, 𝐇 =

1 0 0Τ1 2 Τ1 2 Τ1 2Τ1 2 Τ−1 2 Τ1 20 0 1

,

𝐀T =1 1 1 00 1 −1 −1

.

Cyclic 1D convolution

• One-dimensional cyclic convolution of length 𝑁 :

𝑦 𝑘 = 𝑥 𝑘 ⊛ ℎ 𝑘 =

𝑖=0

𝑁−1

ℎ 𝑖 𝑥 𝑘 − 𝑖 𝑁 ,

(𝑘)𝑁= 𝑘 mod 𝑁.

• It is of no use in modeling linear systems.

• Important use: Embedding linear convolution in a fast cyclic convolution

𝑦 𝑛 = 𝑥 𝑛 ⊛ ℎ 𝑛 of length 𝑁 ≥ 𝐿 +𝑀 − 1 and then performing a cyclic

convolution of length 𝑁.

Linear convolution embedding• Zero-padding signal vector 𝐱 to length 𝑁: 𝐱p = 𝑥 0 ,… , 𝑥 𝛭 − 1 , 0, … , 0 𝑇.

• Zero-padding convolution kernel vector 𝐡 to length 𝑁: 𝐡p = ℎ 0 ,… , ℎ 𝐿 − 1 , 0,… , 0 𝑇.

• Linear convolution embedding in a cyclic convolution of length 𝑁 = 𝐿 +𝑀 − 1:

Cyclic 1D convolution - Example

Cyclic convolution of

𝑥(𝑛) = [1 2 0] and

ℎ 𝑛 = 3 5 4 .

Clock-wise

Anticlock-wise

Folded sequence

0 𝑠𝑝𝑖𝑛𝑠 1 𝑠𝑝𝑖𝑛 2 𝑠𝑝𝑖𝑛𝑠

1

20

3

5 4

3

5 4 3

5

4

4

3 50 0 02 22

1 1 1

𝑦 0 = 1 × 3 + 2 × 4 + 0 × 5 𝑦 1 = 1 × 5 + 2 × 3 + 0 × 4 𝑦 2 = 1 × 4 + 2 × 5 + 0 × 3

Cyclic 1D convolution: matrix

notation• Cyclic convolution definition as matrix-vector product:

𝐲 = 𝐇𝐱,

where:

𝐱 = 𝑥 0 ,… , 𝑥 𝑁 − 1 𝑇: the input vector.

𝐲 = 𝑦 0 ,… , 𝑦 𝑁 − 1 𝑇: the output vector.

𝐇 : a N × N Toeplitz matrix of the form:

𝑯 =

ℎ(0) ℎ(1) ℎ(2)ℎ(𝑁 − 1) ℎ(0) ℎ(1)ℎ(𝑁 − 2) ℎ(𝑁 − 1) ℎ(0)

⋯⋯⋯

ℎ(𝑁 − 1)ℎ(𝑁 − 2)ℎ(𝑁 − 3)

⋯ ⋯ ⋯ ⋯ ⋯ℎ(1) ℎ(2) ℎ(3) ⋯ ℎ(0)

.

Cyclic Convolution via DFT

• Cyclic convolution calculation using 1D Discrete Fourier Transform (DFT):

𝐲 = 𝐼𝐷𝐹𝑇 𝐷𝐹𝑇 𝐱 ⊗𝐷𝐹𝑇 𝐡 .

• Fast calculation of DFT, IDFT through an FFT algorithm.

𝐷𝐹𝑇

𝐷𝐹𝑇

𝐼𝐷𝐹𝑇

𝑥(𝑛)

ℎ(𝑛)

𝑋(𝑘)

𝐻(𝑘)

Y(𝑘) 𝑦(𝑛)

1D FFT

• There are various FFT algorithms to speed up the calculation of DFT.

• The best known is the radix-2 decimation-in-time (DIT) Fast FourierTransform (FFT) (Cooley-Tuckey).

1. DFT of a sequence 𝑥 𝑛 of length 𝑁 (𝑛 = 0,… ,𝑁 − 1):

𝑋 𝑘 = σ𝑛=0𝑁−1 𝑥 𝑛 𝑒−

2𝜋𝑖

𝑁𝑛𝑘 , 𝑘 = 0,… ,𝑁 − 1.

𝑁-th complex roots of unity: 𝑊𝑁𝑛 = 𝑒−

2𝜋𝑖

𝑁𝑛, 𝑛 = 0,… ,𝑁 − 1.

1D FFT2. The Radix-2 DIT algorithm rearranges the DFT:

𝑋(𝑘) =

𝑚=0

𝑁2−1

𝑥 2𝑚 𝑒−2𝜋𝑖𝑁2

𝑚𝑘

𝐷𝐹𝑇 𝑜𝑓 𝑒𝑣𝑒𝑛−𝑖𝑛𝑑𝑒𝑥𝑒𝑑 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑥(𝑛)

+𝑒−2𝜋𝑖𝑁𝑘

𝑚=0

𝑁2−1

𝑥 2𝑚 + 1 𝑒−2𝜋𝑖𝑁2

𝑚𝑘

𝐷𝐹𝑇 𝑜𝑓 𝑜𝑑𝑑−𝑖𝑛𝑑𝑒𝑥𝑒𝑑 𝑝𝑎𝑟𝑡 𝑜𝑓 𝑥(𝑛)

= 𝐸(𝑘) + 𝑒−2𝜋𝑖𝑁𝑘𝑂(𝑘).

3. The complex exponential is periodic which means that:

𝑋(𝑘 +𝑁

2) =

𝑚=0

𝑁2−1

𝑥 2𝑚 𝑒−2𝜋𝑖Τ𝑁 2𝑚(𝑘+

𝑁2) +𝑒−

2𝜋𝑖𝑁 (𝑘+

𝑁2)

𝑚=0

𝑁2−1

𝑥 2𝑚 + 1 𝑒−2𝜋𝑖Τ𝑁 2𝑚(𝑘+

𝑁2)

=

𝑚=0

𝑁2−1

𝑥 2𝑚 𝑒−2𝜋𝑖Τ𝑁 2𝑚𝑘𝑒−2𝜋𝑚𝑖 +𝑒−

2𝜋𝑖𝑁 𝑘 𝑒−𝜋𝑖

𝑚=0

𝑁2−1

𝑥 2𝑚 + 1 𝑒−2𝜋𝑖Τ𝑁 2𝑚𝑘𝑒−2𝜋𝑚𝑖

=

𝑚=0

𝑁2−1

𝑥 2𝑚 𝑒−2𝜋𝑖Τ𝑁 2𝑚𝑘

−𝑒−2𝜋𝑖𝑁 𝑘

𝑚=0

𝑁2−1

𝑥 2𝑚 + 1 𝑒−2𝜋𝑖Τ𝑁 2𝑚𝑘

= 𝐸(𝑘) − 𝑒−2𝜋𝑖𝑁 𝑘𝑂(𝑘).

1D FFT

4. We can rewrite 𝑋 as :

𝑋 𝑘 = 𝐸 𝑘 + 𝑒−2𝜋𝑖𝑁 𝑘𝑂 𝑘 .

𝑋 𝑘 +𝑁

2= 𝐸 𝑘 − 𝑒−

2𝜋𝑖𝑁 𝑘𝑂 𝑘 .

5. By inserting a recursion, we can compute in that way 𝑋(𝑘) and 𝑋(𝑘 +𝑁

2) as well.

The algorithm gains its speed by re-using the results of intermediate computations to

compute multiple DFT outputs.

1D FFT• radix-2 FFT breaks a length-N DFT into many size-2 DFTs called

"butterfly" operations.

• There are log2N FFT

stages.

Radix-2 1D FFT properties

• DFT length 𝑁 should be a power of 2.

• The nice FFT structure is based on the properties of the 𝑁-th

complex roots of unity 𝑊𝑁𝑛 = 𝑒−

2𝜋𝑖

𝑁𝑛, 𝑛 = 0,… , 𝑁 − 1.

• Computational complexity of the 1D FFT is 𝑂 𝑁𝑙𝑜𝑔2𝑁 .

• Computational complexity of the 1D FFT is 𝑂(𝑁2).

• Other FFT algorithms: Radix-4 FFT (𝑁 is power of 4),

Prime Factor Algorithm (PFA) 𝑁 = 𝑃𝑄, 𝑃, 𝑄 co-prime numbers.

Cyclic 1D convolution

computation

• Its computation by definition requires:

• Two nested for loops

• Computational complexity (number of multiplications) is 𝑂(𝑁2).

• Computational complexity through two 1D FFTs and one inverse 1D

FFT is 𝑂 𝑁𝑙𝑜𝑔2𝑁 .

• Fast linear convolution calculation through cyclic convolution

embedding.

𝒁 transform

𝑋 𝑧 =

𝑛=0

𝑁−1

𝑥 𝑛 𝑧−𝑛 .

The 𝑍 transform of a discrete signal 𝑥(𝑛) having domain [0, … , 𝑁] is

given by:

The domain of 𝑍 transform is the complex plane, since 𝑧 is a complex

number.

Convolution property of the 𝑍 transform (polynomial product 𝑋(𝑧)𝐻(𝑧)):

𝑦(𝑛) = 𝑥(𝑛) ∗ ℎ(𝑛) ⇔ 𝑌(𝑧) = 𝑋(𝑧)𝐻(𝑧).

Cyclic convolution and 𝒁 transform

𝑦 𝑘 = 𝑥 𝑘 ⊛ ℎ 𝑘 =

𝑖=0

𝑁−1

ℎ 𝑖 𝑥 𝑘 − 𝑖 𝑁 ,

where : (𝑘)𝑁 = 𝑘 mod 𝑁.

𝑦 𝑘 = 𝑥 𝑘 ⊛ ℎ 𝑘 < => 𝑌 𝑧 = 𝑋 𝑧 𝐻 𝑧 mod 𝑧𝑁 − 1.

Polynomial product form of the 1D cyclic convolution:

Convolutions as polynomial

products-examples• 1D linear convolution 𝑦 𝑘 = ℎ 𝑘 ∗ 𝑥 𝑘 of length 𝐿 = 𝑀 = 3:

• 1D cyclic convolution 𝑦 𝑘 = 𝑥 𝑘 ⊛ ℎ 𝑘of length 𝑁 = 3:

𝑌 𝑧 = 𝑋 𝑧 𝐻 𝑧 mod 𝑧3 − 1= 𝑥 0 + 𝑥 1 𝑧 + 𝑥 2 𝑧2 ℎ 0 + ℎ 1 𝑧 + ℎ 2 𝑧2 mod 𝑧3 − 1.

Factorization: 𝑧3 − 1 = 𝑃1 𝑧 𝑃2 𝑧 = (𝑧 − 1)(𝑧2 + 𝑧 + 1)(𝑣 = 2 factors).

𝑌 𝑧 = 𝑋 𝑧 𝐻 𝑧 = 𝑥 0 + 𝑥 1 𝑧 + 𝑥 2 𝑧2 ℎ 0 + ℎ 1 𝑧 + ℎ 2 𝑧2 .

Chinese Remainder Theorem (CRT)If we have a polynomial Y(z) and:

𝑃 𝑧 = 𝑧𝑁 − 1, where the Greatest Common Divider 𝐺𝐶𝐷 𝑃𝑖 𝑧 , 𝑃𝑗 𝑧 = 1 for 𝑖 ≠ 𝑗.

𝑃 𝑧 =ෑ

𝑖=1

ν

𝑃𝑖 𝑧 .

𝑌𝑖 𝑧 = 𝑌 𝑧 mod 𝑃𝑖(𝑧).

0 < 𝑑𝑒𝑔 𝑌 𝑧 ≤ σ𝑖=1𝜈 𝑑𝑒𝑔[𝑃𝑖(𝑧)] , where deg is the degree of a polynomial.

𝑅𝑖 𝑧 = 𝑑𝑖𝑗 mod 𝑃𝑖(𝑧), 1 ≤ 𝑖 ≤ 𝜈.

Then the polynomial 𝑌 𝑧 is given by: 𝑌 𝑧 = σ𝑖=1ν 𝑅𝑖 𝑧 𝑌𝑖 𝑧 mod 𝑧𝑁 − 1

From CRT to Winograd

convolution algorithm

𝑌(𝑧) = 𝐻(𝑧)𝑋(𝑧) mod 𝑧𝑁 − 1.

A 'divide-and-conquer' approach can be used for creating fast convolution

algorithms, by factorizing polynomial 𝑧𝑁 − 1 to its prime factors 𝑃𝑖 𝑧 .

Then, we calculate first the items:

𝑋𝑖 𝑧 = 𝑋 𝑧 mod 𝑃𝑖 𝑧 , 𝑖 = 1,… , 𝜈𝛨𝑖 𝑧 = 𝛨 𝑧 mod 𝑃𝑖 𝑧 , 𝑖 = 1,… , 𝜈

and their products:

𝑌𝑖 𝑧 = 𝑋𝑖 𝑧 𝐻𝑖 𝑧 mod 𝑃𝑖 𝑧 , 𝑖 = 1,… , 𝜈.

CRT helps us to create fast convolution algorithms by embedding linear

convolutions in cyclic ones:

Winograd algorithm

Fast 1D cyclic convolution with

minimal complexity• Winograd convolution algorithms or fast

filtering algorithms:

𝐲 = 𝐂 𝐀𝐱⨂𝐁𝐡 .

• They require only 2𝑁 − 𝑣 multiplications in

their middle vector product, thus having

minimal complexity.

• 𝜈: number of cyclotomic polynomial

factors of polynomial 𝑧𝑁 − 1 over the

rational numbers 𝑄.

Example: Winograd Cyclic

convolution Algorithm for N=3

( )( )3

3 3 2

1 3

/

1 ( ) 1 ( ) ( ) 1 1 1N

N

d

c N

z c z z c z c z z z z z=

− = − = − = − + +

( ) ( )3

3( ) ( ) ( ) mod 1 ( ) ( ) ( ) mod 1N

NY z H z X z z Y z H z X z z=

= − = −

1 232

0 1 2

0 0

( ) ( ) ( )N N

n n

n n

n m

X z x z X z x z X z x x z x z− =

= =

= = = + +

1 232

0 1 2

0 0

( ) ( ) ( )N N

n n

n n

n m

H z h z H z h z H z h h z h z− =

= =

= = = + +

( )( )2( ) ( ) ( ) mod 1 1Y z H z X z z z z= − + +

I. Definitions:



( ) ( ) ( )1

2

1 0 1 2 1 0 1 2mod mod 1i

i iX X z P X x x z x z z X x x x=

= = + + − = + +

( ) ( ) ( ) ( )2

2 2

2 0 1 2 2 0 2 1 2mod 1i

X x x z x z z z X x x x x z=

= + + + + = − + −

( ) ( )1 0 1 2 0 2 0 2 1 2 1 2Similarly for H : and i H h h h b H h h h h z b b z= + + = = − + − = +

0 1 2 3 1 0 2 2 1 2Let , and :a x x x a x x a x x= + + = − = − 1 0 2 1 2 and X a X a a z= = +

II. Computing the reduced Polynomials 𝑋𝑖 , 𝐻𝑖:



( ) ( ) ( ) ( )1

1 0 0 1 0 0mod ( ) mod 1i

i i i iY X z H z P z Y a b z Y a b=

= = − =

( ) ( )

( ) ( ) ( )( ) ( )

2

2 2 2 2

2

2 1 2 1 2

2 1 1 2 2 1 2 2 1 2 2

mod ( )

mod 1

i

Y X z H z P z

Y a a z b b z z z

Y a b a b z a b a b a b

=

=

= + + + +

= − + + −

( ) ( )1 2 2 1 2 2 1 2 1 2 1 1 2One can see, that: , thus, Y becomes: a b a b a b a a b b a b+ − = − − − +

( ) ( ) ( )2 1 1 2 2 1 2 1 2 1 1Y a b a b z a a b b a b= − + − − − +

( ) ( )0 0 0 1 1 1 2 2 2 1 2 1 2 1 1Let c , c and ca b a b a b a a b b a b= = − = − − − +

1 0 2 1 2 and Y c Y c c z= = +

III. Computing the products 𝑌𝑖:

6 multiplications

4 multiplications



( ) ( ) 233

1 1 1

3 3 1

3 3

ppp c z c z z z

R R Rp

=− − − − −= = =

( ) ( ) ( ) ( )0 1 2

2 32

0 1 2 0 1 2 0 1 2

1

1mod 1 2 2

3

NN

i i

iy y y

Y RY z Y c c c c c c z c c c z=

=

= − = = + − + − + + − −

IV. Using the Chinese Remainder Theorem (CRT) to Compute 𝑌:

( ) ( ) 233

2 2 2

1

3 3

ppc z c z z z

R R Rp

= + += = =


convolution Algorithm for N=3V. Finding Matrices 𝐴 and 𝐵:

0 0 1 20

0

1 0 21

1

2 1 22

2

3 0 11 2

1 1 1

1 0 1

0 1 1

1 1 0

x x x xax

x x xaA A x A

x x xax

x x xa a

+ + − − = = = − −

−− −

Similarly, we get exactly the same result for matrix 𝐵!



( )

( ) ( )

( ) ( )

( ) ( )

1 0 1 2 0 0 1 1 2 2 1 2 1 2

2 0 1 2 0 0 1 1 2 2 1 2 1 23 1

3 0 1 2 0 0 1 1 2 2 1 2 1 2

2 2

2 2

2

y c c c a b a b a b a a b b

Y y c c c a b a b a b a a b b

y c c c a b a b a b a a b b

+ − + − + − −

= = − + = + + − − − − − − + + − −

( ) ( ) ( )

( )( )

0 0

0 0

1 1

1 14 1

2 2

2 2

1 2 1 2

1 1 1 1 1 1

1 0 1 1 0 1

0 1 1 0 1 1

1 1 0 1 1 0

a bx h

a bM A x B h x h

a bx h

a a b b

− − = = = − − − −− −

VI. Finding matrix 𝐶:



( ) ( ) ( )

( ) ( )

( ) ( )

( ) ( )( )( )

3 1 3 4 4 1

0 0

0 0 1 1 2 2 1 2 1 2 00 01 02 03

1 1

0 0 1 1 2 2 1 2 1 2 10 11 12 13

2 2

0 0 1 1 2 2 1 2 1 2 20 21 22 23

1 2 1 2

2

2

2

1 1 2 1

1 1

Y C M

a ba b a b a b a a b b C C C C



a a b b

C

=

+ − + − −

+ + − − − = − + + − − − −

−

= 1 2

1 2 1 1

− −

VI. Finding matrix 𝐶:

Theoretically minimal number of multiplications: 𝟒!

Winograd algorithm version

with minimal computational

complexity • Can be equivalently expressed as:

𝐲 = 𝐑𝐁T 𝐀𝐱⨂𝐂T𝐑𝐡 .

• Matrices 𝐀, 𝐁 typically have elements 0,+1,−1.

• Multiplications 𝐂T𝐑𝐡, 𝐑𝐁T𝐲′ are done only by

additions/subtractions.

• 𝐑 is an 𝑁 × 𝑁 permutation matrix.

• 𝐑h can be precomputed.

Block diagram: Winograd Cyclic


Winograd algorithm

Fast 1D cyclic convolution with

minimal complexity

• Winograd algorithm works on small blocks of the input signal.

• The input block and filter are transformed.

• The outputs of the transform are multiplied together in an element-

wise fashion.

• The result is transformed back to obtain the outputs of the

convolution.

• GEneral Matrix Multiplication (GEMM) BLAS or CUBLAS routines can

be used.

Nested convolutions

• Winograd algorithms exist for relatively short convolution lengths,

e.g.: 𝑁 = 3, 5, 7.

• Use of efficient short-length convolution algorithms iteratively to

build long convolutions.

• Does not achieve minimal multiplication complexity.

• Good balance between multiplications and additions.

Decomposition of 1D convolution into a 2D convolution:

• 1D convolution of length: 𝑁 = 𝑁1𝑁2• with 𝑁1, 𝑁2 co-prime integers, 𝑁1, 𝑁2 = 1

• results into a 2D 𝑁1 × 𝑁2 convolution.

Nested convolutions

• Remember circular convolution definition:

𝑦(𝑙) =

𝑛=0

𝑁−1

ℎ 𝑛 𝑥( 𝑙 − 𝑛 𝑁), 𝑙 = 0, . . , 𝑁 − 1.

• If 𝑁 = 𝑁1𝑁2, 𝑙 and 𝑛 can be analyzed as:

𝑙 ≡ 𝑁1𝑙2 + 𝑁2𝑙1 mod 𝑁1𝑁2 𝑙1, 𝑛1 = 0,… ,𝑁1 − 1.𝑛 ≡ 𝑁1𝑛2 + 𝑁2𝑛1 mod 𝑁1𝑁2 𝑙2, 𝑛2 = 0,… ,𝑁2 − 1.

• 1D circular convolution becomes now a 2D 𝑁1 × 𝑁2 convolution:

𝑦(𝑁1𝑙2 + 𝑁2𝑙1) =

𝑛1=0

𝑁1−1

𝑛2=0

𝑁2−1

ℎ 𝑁1𝑛2 +𝑁2𝑛1 𝑥(𝑁1(𝑙2−𝑛2) + 𝑁2(𝑙1 − 𝑛1)).

Nested convolutions

• The number of multiplications required are 𝑀1𝑀2, where, 𝑀1 and

𝑀2 is the number of multiplications for convolutions 𝑁1 and 𝑁2accordingly.

• The same applies for the reverse decomposition: 𝑁2 × 𝑁1.

• However, the number of additions varies over the nesting order :𝑁1 × 𝑁2: 𝐴1𝑁2 +𝑀1𝐴2𝑁2 × 𝑁1: 𝐴2𝑁1 +𝑀2𝐴1,

with 𝐴1 and 𝐴2 being the number of additions required for 𝑁1 and 𝑁2accordingly.

• Additions must be calculated beforehand, so as to choose the

nesting order.

Nested convolutionsFor example, lets assume we have the circular convolution

of length 𝑁 = 12 and 𝑁1 = 3,𝑁2 = 4 :

then,𝑙1, 𝑛1 = 0,1,2𝑙2, 𝑛2 = [0, 1, 2, 3]

and

𝑋0 𝑧 = 𝑥0 + 𝑥3𝑧 + 𝑥6𝑧2 + 𝑥9𝑧

3

𝑋1 𝑧 = 𝑥4 + 𝑥7𝑧 + 𝑥10𝑧2 + 𝑥1𝑧

3

𝑋2 𝑧 = 𝑥8 + 𝑥11𝑧 + 𝑥2𝑧2 + 𝑥5𝑧

3

𝐻0 𝑧 = ℎ0 + ℎ3𝑧 + ℎ6𝑧2 + ℎ9𝑧

3

𝐻1 𝑧 = ℎ4 + ℎ7𝑧 + ℎ10𝑧2 + ℎ1𝑧

3

𝐻2 𝑧 = ℎ8 + ℎ11𝑧 + ℎ2𝑧2 + ℎ5𝑧

3

Nested convolutions

𝐗 and 𝐇 can be defined now as 3×4 matrices

with the polynomial terms as elements:

𝑿 =

𝑥0 𝑥3 𝑥6 𝑥9𝑥4 𝑥7 𝑥10 𝑥1𝑥8 𝑥11 𝑥2 𝑥5

𝑯 =

ℎ0 ℎ3 ℎ6 ℎ9ℎ4 ℎ7 ℎ10 ℎ1ℎ8 ℎ11 ℎ2 ℎ5

Nested convolutions

• We use Winograd 3-point circular convolution C, A, and B matrices at

the outer nesting layer:

𝐲 = 𝐂 𝐀𝐗⨂𝐁𝐇

• The matrix-vector products 𝐀𝒙 and 𝐁𝒉, are replaced by matrix

multiplication 𝐀𝐗 and 𝐁𝐇 respectively, each one resulting to a 4×4

matrix.

• We use a fast 4-point Winograd routine to calculate the 4 circular

convolutions 𝐀𝐗 ⨂ 𝐁𝐇 (one per row). The 4-point routine is “nested”

inside the 3-point routine.

• The correct order of the elements in the result can be found based on

the nesting formulation:

𝑦𝑁1𝑙2+𝑁2𝑙1 .

Block Based convolutions

• Input signal x 𝑛 is split in overlapping/non-overlapping blocks.

• Blocks are convolved independently.

• Great parallelism is achieved.

• Two block-based convolution methods:

• Overlap-add method

• Overlap-save method.

Overlap-add method

It uses the following signal processing principles:

• The linear convolution of two discrete-time signals of length 𝐿 and 𝑀results in a discrete-time convolved signal of length 𝑁 = 𝐿 +𝑀 − 1.

• Additivity:

[𝑥1 𝑛 + 𝑥2 𝑛 ] ∗ ℎ 𝑛 = 𝑥1 𝑛 ∗ ℎ 𝑛 + 𝑥2 𝑛 ∗ ℎ 𝑛 .

Overlap-add method

1. Break the input signal 𝑥(𝑛) into non-overlapping blocks 𝑥𝑚 𝑛 of length 𝐿.

2. Zero pad ℎ 𝑛 to be of length 𝑁 = 𝐿 +𝑀 − 1.

Calculate the 𝐷𝐹𝑇 𝐻(𝑘),𝑘 = 0,1,… ,𝑁 − 1.

3. For each block 𝑚:

• Zero pad 𝑥𝑚 𝑛 to be of length 𝑁 = 𝐿 +𝑀 − 1. Calculate the DFT 𝑋𝑚 𝑘 .

• Use the IDFT of 𝐻 𝑘 𝑋𝑚 𝑘 for block output 𝑦𝑚 𝑛 ,𝑛 = 0,1,… ,𝑁 − 1.

4. Form overall output 𝑦 𝑛 by overlapping the last 𝑀 − 1 samples of 𝑦𝑚 𝑛 with the

first 𝑀 − 1 samples 𝑦𝑚+1 𝑛 and adding the result.

Overlap-add method𝐿 𝐿𝐿

𝑥2 𝑛

𝑥3 𝑛

𝑥1 𝑛

𝑦2 𝑛

𝑦1 𝑛

𝑦3 𝑛

(𝑀 − 1) zeros

(𝑀 − 1) zeros

(𝑀 − 1) zeros

(𝑀 − 1) points

(𝑀 − 1) points

(𝑀 − 1) points

(𝑀 − 1) points

(𝑀 − 1) points

Overlap-add method

Overlap-save method

It uses the following signal processing principles:

• The linear convolution of two discrete-time signals of length 𝑁 and 𝑀results in a discrete-time convolved signal of length 𝑁 = 𝐿 +𝑀 − 1.

• Time-Domain Aliasing:

𝑥𝑐 𝑛 =

𝑙=−∞

∞

𝑥𝐿 𝑛 − 𝑙𝑁 , 𝑛 = 0,1, … , 𝑁 − 1.

Overlap-save method

Overlap-Save method: 1. Pad 𝑀 − 1 zeros at the beginning of the input sequence 𝑥(𝑛).2. Break the padded input signal into overlapping blocks 𝑥𝑚 𝑛 of length 𝑁 = 𝐿 +𝑀 − 1, where the overlap length is 𝑀 − 1.

3. Zero-pad ℎ(𝑛) to be of length 𝑁 = 𝐿 +𝑀 − 1.

Calculate the 𝐷𝐹𝑇 𝐻(𝑘),𝑘 = 0,1,… ,𝑁 − 1.

4. For each block 𝒎:• Calculate the 𝐷𝐹𝑇 𝑋𝑚 𝑘 ,𝑘 = 0,1,… ,𝑁 − 1.

• Calculate the IDFT of: 𝑌𝑚 𝑘 = 𝑋𝑚 𝑘 𝐻(𝑘), 𝑘 = 0,1,… ,𝑁 − 1 to get the block output

𝑦𝑚 𝑛 ,𝑛 = 0,1,… ,𝑁 − 1.

5. Discard the first 𝑀 − 1 points of each output block 𝑦𝑚 𝑛 . Concatenate the remaining (i.e., last) 𝐿 samples of each block 𝑦𝑚 𝑛 to form the output 𝑦 𝑛 .

Overlap-save method

𝐿 𝐿 𝐿

(𝑀 − 1) zeros(𝑀 − 1) point overlap

(𝑀 − 1) point overlap

𝐼𝑛𝑝𝑢𝑡 𝑠𝑖𝑔𝑛𝑎𝑙 𝑏𝑙𝑜𝑐𝑘𝑠

𝑂𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑔𝑛𝑎𝑙 𝑏𝑙𝑜𝑐𝑘𝑠

Discard (𝑀 − 1) points

Overlap-save method

Applications

• Convolutional neural networks

• Signal processing

Signal filtering

Signal restoration

Signal deconvolution

• Signal analysis

Time delay estimation

Distance calculation (e.g., sonar)

1D template matching

Convolutional Neural Networks

• Two step architecture:

• First layers with sparse NN connections: convolutions.

• Fully connected final layers.

• Need for fast convolution calculations.

Convergence of machine learning and signal processing

Deep Learning Frameworks

Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs

Convolutions in DL Frameworks

• All 5 frameworks work with cuDNN as backend.

• cuDNN unfortunately not open source.

• cuDNN supports FFT and Winograd convolutions.

Image Source: Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Le - Performance Analysis of CNN Frameworks for GPUs

Basic Exercises

1. Implement the definition of the 1D linear convolution algorithm

in C/C++ or Python.

2. Implement the definition of the 1D DFT algorithm in C/C++ or

Python.

3. Implement the 1D cyclic convolution algorithm for length 𝑁 =2𝑛 in C/C++ using FFT.

4. Derive analytically the 1D Winograd cyclic convolution

algorithm for length 𝑁 = 3.

5. Implement the 1D Winograd cyclic convolution algorithm for

length 𝑁 = 3 in C/C++ or Python.

Advanced Exercises

1. Implement the 1D linear convolution for filter length M and

signal length 𝐿 in C/C++ using your own myFFT routine.

2. Implement a nested convolution algorithm for length 𝑁 = 𝑁1𝑁2in C/C++.

3. Study the properties of cyclotomic polynomials. Derive

analytically the 1D Winograd cyclic convolution algorithm for

length 𝑁 = 4,𝑁 = 6 or 𝑁 = 9.

4. Implement the 1D linear convolution algorithm for filter length

M and a large signal length 𝐿 using overlap-add in a) C/C++ or

b) CUDA.

References

• Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, p. 436, 2015.

• V. Oppenheim, Discrete-time signal processing, Pearson Education India, 1999.

• I. Pitas, Digital image processing algorithms and applications. John Wiley & Sons, 2000.

• S. Winograd, Arithmetic complexity of computations, vol. 33. Siam, 1980.

• J. H. McClellan and C. M. Rader, Number theory in digital signal processing," 1979.

• H. J. Nussbaumer, Fast Fourier transform and convolution algorithms.

• Springer Science & Business Media, 1981.

• I. Pitas and M. Strintzis, Multidimensional cyclic convolution algorithms with minimal multiplicative

complexity," IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 384-

390, 1987.

• A. Lavin and S. Gray, Fast algorithms for convolutional neural networks, Proceedings CVPR, pp.

4013-4021, 2016.

Q & A

Thank you very much for your attention!

Contact: Prof. I. Pitas

[email protected]

fast 1d convolution algorithms - drone research

Documents