urban sounds classification using deep learning a …

URBAN SOUNDS CLASSIFICATION USING DEEP

LEARNING

A Degree Thesis

Submitted to the Faculty of the

Escola Tècnica d'Enginyeria de Telecomunicació de

Barcelona

Universitat Politècnica de Catalunya

by

Martí Bolet Boixeda

In partial fulfilment

of the requirements for the degree in

TELECOMMUNICATIONS ENGINEERING

Advisor: Elisa Sayrol

Barcelona, June 2019

1

Abstract

The main goal of this project is to develop a computational cost efficient system to be able to classify audios from an Urban Sound context. To realize this function two types of Machine Learning algorithms are used, Deep Learning with Convolutional neural networks and with addition of handcrafted audio features combining it to finally classify them with a Support Vector Machine.

Different architectures of Deep Learning models are tested and tuned to obtain the Context aware deep learning features. With the handcrafted features and tuning the SVM model, the final performance of the whole system is very good.

2

Resum

En aquest projecte s’ha volgut desenvolupar un sistema computacionalment eficient per

classificació d’àudios basat en tècniques de ‘Machine Learning’. Per aquest projecte es

proposa un sistema basat en ‘Deep Learning’ amb les ‘Convolutional Neural Networks’ i

amb addició de paràmetres extrets directament de l’àudio per combinar-los amb els de la

CNN per aconseguir un millor resultat utilitzant per la classificació final una ‘Support

Vector Machine’ (SVM).

Diferents models de Deep Learning han sigut provats i ajustats per tal d’aconseguir els

‘context aware deep learning features’. Juntament amb els ‘handcrafted features’ i

ajustant el model de SVM també obtenint uns resultats per el nostre cas bastant bons.

3

Resumen

En este proyecto se ha desarrollado un sistema computacionalmente eficiente para

clasificación de audios basado en técnicas de ‘Machine Learning’. Para este proyecto se

propone un sistema conjunto basado en ‘Deep Learning’ con las ‘Covolutional Neural

Networks’ juntamente con los parámetros extraídos directamente del audio para

combinar los dos resultados para conseguir un major resultado utilizando una ‘Support

Vector Machine’ (SVM) para a clasificación final.

Diferentes modelos de Deep Learning han sido testeados y ajustados por tal de

conseguir los ‘context aware deep learning features’. Juntamente con los ‘handracfted

features’ i ajustando el modelo de SVM los resultados finales del sistema son bastante

buenos para nuestro caso.

4

Acknowledgements

First of all I want to thanks to my advisor Elisa Sayrol, to give me the trust of carrying this

project and help me and guide through the whole project.

Also I want to say thanks to J. Adrian Rodriguez to provide me an initial code to and give

me a hand with audio related problem.

For instance, thanks to Albert Gil and Josep Pujal to help me and assisting with any kind

of server computing problem on the GPI Development Platform of UPC.

5

Revision history and approval record

Revision Date Purpose

0 26/04/2019 Document creation

1 21/06/2018 Document revision

2 24/06/2019 Document revision

DOCUMENT DISTRIBUTION LIST

Name e-mail

Martí Bolet Boixeda [email protected]

Elisa Sayrol Cols [email protected]

Manuel Dominguez [email protected]

Written by: Reviewed and approved by:

Date 26/04/2019 Date 24/06/2019

Name Martí Bolet Name Elisa Sayrol

Position Project Author Position Project Supervisor

6

Table of contents

Abstract ............................................................................................................................ 1

Resum .............................................................................................................................. 2

Resumen .......................................................................................................................... 3

Acknowledgements .......................................................................................................... 4

Revision history and approval record ................................................................................ 5

Table of contents .............................................................................................................. 6

List of Figures ................................................................................................................... 8

List of Tables: ................................................................................................................... 9

1. Introduction .............................................................................................................. 10

1.1. Requirements and specifications ...................................................................... 10

1.2. Project Background .......................................................................................... 11

1.3. Work Plan ......................................................................................................... 11

1.4. Incidences ........................................................................................................ 13

2. State of the art of the technology used or applied in this thesis: ............................... 14

2.1. Machine Learning ............................................................................................. 14

2.2. Support Vector Machines ................................................................................. 14

2.2.1. Cost ........................................................................................................... 15

2.2.2. Kernel ........................................................................................................ 15

2.3. Deep Learning .................................................................................................. 16

2.3.1. Loss .......................................................................................................... 17

2.3.2. Optimizer ................................................................................................... 17

2.3.3. Epochs ...................................................................................................... 17

2.3.4. Batch Size ................................................................................................. 17

2.3.5. Dropout ..................................................................................................... 17

2.3.6. Convolutional Neural Networks.................................................................. 18

2.3.6.1. Convolutional Layers ............................................................................... 18

2.3.6.2. Pooling layer ........................................................................................... 19

2.3.7. Fully Connected Layer ............................................................................... 19

2.3.8. Architectures of Convolutional Neural Networks ........................................ 20

2.3.9. Techniques ................................................................................................ 20

2.3.9.1. Transfer Learning .................................................................................... 20

2.3.9.2. Data augmentation .................................................................................. 20

2.4. Deep context-aware and Handcrafted Features ............................................... 21

7

2.4.1. Handcrafted Features ................................................................................ 21

2.4.2. Context-aware deep learning features ....................................................... 22

2.4.2.1. Short Time Fourier Transform and Spectrogram ..................................... 22

2.5. Data Augmentation and CNN classification ...................................................... 24

3. Methodology / project development: ........................................................................ 25

3.1. Development of the system .............................................................................. 25

4. Results .................................................................................................................... 26

4.1.1. Fine Tuning ............................................................................................... 26

4.1.2. Architectures and hyper parameters .......................................................... 27

4.1.3. Spectrogram Optimization ......................................................................... 27

4.1.4. Data Augmentation .................................................................................... 28

4.1.5. SVM .......................................................................................................... 29

4.1.6. Final Results.............................................................................................. 30

4.1.7. Results Conclusions .................................................................................. 30

5. Budget ..................................................................................................................... 31

6. Conclusions and future development: ...................................................................... 32

Bibliography: ................................................................................................................... 33

Glossary ......................................................................................................................... 34

8

List of Figures

Figure 1: Histogram of classes – page 10

Figure 2: Influence of Cost – page 15

Figure 3: Kernel trick – page 15

Figure 4: Artificial neuron – page 16

Figure 5: Artificial neural network – page 16

Figure 6: Activation functions – page 16

Figure 7: Convolutional neural network – page 18

Figure 8: 2D Convolution – page 18

Figure 9: Pooling layer – page 19

Figure 10: Fully-connected layer – page 19

Figure 11: CNN layer structure – page 20

Figure 12: Structure of proposed method – page 21

Figure 13: Mel spectrogram from dog barking – page 23

Figure 14: Plot of grid validation – page 29

Figure 15: Increment of accuracy relative to each class and transform – page 24

Figure 16: Plot comparation of Fine Tuning and Random Initialization – page 26

Figure 17: Best training plots – page 28

Figure 18: Plot of grid validation – page 29

9

List of Tables:

Table 1: Handcrafted features – page 20

Table 2: Initial spectrogrtams parameters – page 26

Table 3: Initial training parameters – page 26

Table 4: FT vs Random Initialization – page 26

Table 5: Different architectures comparasion – page 27

Table 6: Number of parameters and FLOPS – page 27

Table 7: Spectrogram parameters optimization – page 27

Table 8: Data Augmentation results – page 28

Table 9: SVM results – page 29

Table 10: Results of baseline paper – page 30

Table 11: Final results – page 30

10

1. Introduction

The project is carried out at department Teroria de la Senyal i Comunicacions (TSC) of the Universitat Politecnica de Catalunya (UPC).

The main purpose of this project is to develop and improve a system based on Deep Learning, specifically with Convolution Neural Networks (CNN), that is able to classify audios from an urban context in different classes. Nevertheless, the original idea was to distinguish between the sound of breaking glass and of other material and be able to implement an embedded system to place it in glass containers. Due to the lack of a Data Base of this type, the project will use an Urban Sounds data base. According to the initial idea, an implementation of the best CNN could be integrated on an embedded system, so the computational cost is a very important factor to be aware. Therefore, the implementation of the system would be developed and positively tested and checked. The following steps would be to adapt the CNN for an embedded system, so the CNN should not have a high computational cost.

The project main goals are:

1. - Learn about Deep Learning audio classification methods.

2. - State of art about literature for the implementation of new CNN techniques.

3. - Develop a deep learning system using a Deep Learning framework.

1.1. Requirements and specifications

Some of the requirements of the project are to have some knowledge about Deep Learning and how it works. The system is developed in Python and using PyTorch framework that provides two main advantages, Tensor computing with strong acceleration via graphics processing units (GPU) and deep neural networks built on a tape-based autodiff system. To perform reasonably faster calculations than a normal CPU we will use a server with GPU and high computational resources. To train the system we need the Urban Sounds Database provided from U8K with labelled classes. [1]. This dataset contains 8732 labelled sound excerpts (less or equal to 4 seconds) of urban sounds from 10 different classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music.

Figure 1: Histogram of classes in Database

https://serv.cusp.nyu.edu/projects/urbansounddataset/urbansound8k.html



11

The specifications for the system is to develop the sound classifier based on deep neural networks, and adapted to a low computational hardware, so the computational cost has to be as low as possible with similar results. The Neural Network should have a reasonable accuracy according to the literature explored in the beginning of the project.

1.2. Project Background

This projects starts from an idea to develop and automatic sounds classification

embedded system for glass containers. The project is performed in the framework of an

academic work. The original idea is adapted by the project supervisor to adapt the case

to a different database and start developing from scratch, due to a non-existing similar

project in academic scope in the UPC.

1.3. Work Plan

Work Packages

Project: State of Art WP ref: 1

Major constituent: Research Sheet 1 of 5

Short description: Research and learn about Deep Learning

techniques and CNN architectures. Learn to program DL

structures with PyTorch framework.

Planned start date: 15/02/2019

Planned end date: 15/03/2019

Start event: 16/02/2019

End event:15/03/2019

Internal task T1: Read and understand papers related with CNN

Internal task T2: Stanford online course lectures about Deep

Learning [2].

Internal task T3: PyTorch framework approach and tutorials [3].

Internal task T4: Introduction to remote server computing in the

GPI.

Deliverables: Dates:

12

Project: Features Extraction WP ref: 2

Major constituent: SW Sheet 2 of 5

Short description: Split the Dataset. To obtain the best results we

need to extract the spectrogram with the optimums parameters

to facility the CNN pattern extraction.




End event: 12/04/2019

Internal task T1: Adaptation of the Database and ‘CSV’ files.

Internal task T2: Develop the feature extraction method using

some python library.

Internal task T3: Research for the best parameters to extract

features.

Internal task T4: Extract the spectrograms from the database and

compress to train it.


Project: Architecture design WP ref: 3


Short description: Develop the Deep Learning system and SVM

system. Try the best parameters and different models for the

training of the CNN and improve the accuracy. Do the same

thing with the SVM.




End event:

Internal task T1: Implementation of the system.

Internal task T2: Implementation of the model.

Internal task T3: Tests with different parameters.

Internal task T4: Evaluate results and restart from T2.


13

Project: Computational cost WP ref: 4


Short description: An important part is to reduce the

computational cost and memory used for the system.



Start event:

End event:

Internal task T1: Search for a method to compute the cost and

evaluate different cases

Internal task T2: Implement the reduction and evaluate it.


Project: Documentation WP ref: 5

Major constituent: Sheet 5 of 5

Short description:

Documentation about the process of developing the project and

the final report of the whole project



Start event:

End event:

Internal task T1: Project Proposal and Work plan

Internal task T2: Critical Review

Internal task T3: Final Project

Deliverables:

Task T1

Task T2

Task T3

Dates:

10/03/2019

07/05/2019

28/06/2019

1.4. Incidences

One of the biggest delays of the project was due to an audio read problem. The initial

method to read the Sound eXchange (SoX) to read the raw audios and reconvert all to

the same format didn’t work. When changing the form of read the audios, another error

occurred opening a few audios with an unsupported library to open. After some

unsuccessful research, when trying the same in the server platform worked.

Another delay was for working with library ‘libsvm’ [4] for Support Vectors Machines and

working with different versions of python. The library was in python 2 when the whole

project was developed in python 3. Because of that when working with some libraries was

to be adapted to different types of python.

14

2. State of the art of the technology used or applied in

this thesis:

The classification methodology will be based on Machine Learning Techniques such as

Convolutional Neural Networks (CNN), for his high performance in image applications. In

this project we will use mel-spectrogram to transform audio to and 2D image for time-

frequency representation. Also a Support Vector Machine (SVM) will be applied in

addition with hand-crafted audio features to support the CNN results to achieve the best

results.

2.1. Machine Learning

Machine learning (ML) is the scientific study of algorithms and statistical models that

computer systems use to effectively perform a specific task without using explicit

instructions, relying on patterns and inference instead. It is seen as a subset of artificial

intelligence. Machine learning algorithms are used in a wide variety of applications, such

as email filtering, and computer vision, where it is infeasible to develop an algorithm of

specific instructions for performing the task. In this project the system is developed using

Support Vector Machines (SVM) and Deep Learning, two ML types of algorithms.

The types of machine learning algorithms differ in their approach, the type of data they

input and output, and the type of task or problem that they are intended to solve. Some of

it are supervised learning, unsupervised learning, semi supervised learning and feature

learning among others.

Machine learning algorithms build a mathematical model based on sample data, known

as "training data", in order to make predictions or decisions without being explicitly

programmed to perform the task. This consists in three main phases Train, Validation and

Test.

In order to adjust the mathematical model, the system is formulated as minimization of

some loss function on a training set of examples, called Train phase. Loss functions

express the discrepancy between the predictions of the model being trained and the

actual problem instances. In our case, (classification), system has to assign a label to

instances, in such wise models are trained to correctly predict the pre-assigned labels of

a set of examples.

2.2. Support Vector Machines

Support-vector machines (SVMs) are supervised learning models with associated

learning algorithms that analyze data used for classification and regression analysis.

Given a set of training examples, each marked as belonging to one or the other of two

categories, an SVM training algorithm builds a model that assigns new examples to one

category or the other, making it a non-probabilistic binary linear classifier. An SVM model

is a representation of the examples as points in space, mapped so that the examples of

the separate categories are divided by a clear gap that is as wide as possible called

decision boundary. New examples are then mapped into that same space and predicted

to belong to a category based on which side of the gap they fall.

15

The parameters to adjust the decision boundary are the following:

2.2.1. Cost

SVM models use Cost parameter to control trade-off smooth decision boundary and

classifying training examples points correctly. It modifies the optimization problem to

optimize both the fit of the line to data and penalizing the amount of samples inside the

margin at the same time, where C defines the weight of how much samples inside the

margin contribute to the overall error. Consequently, with C you can adjust how hard or

soft your large margin classification should be.

Figure 2: Influence of Cost [5]

2.2.2. Kernel

SVM separates the data using hyper-plains but when the data have high dimensionality

the Kernel trick is very useful to reduce it and adjust better to the data. There are a lot of

types of kernel, but in all they use gamma parameter that defines how far the influence of

a single training example reaches and adapts the decision boundary to each of this.

Figure 3: Kernel trick [6]

16

2.3. Deep Learning

Deep Learning is a class of machine learning algorithms that use a structure of Artificial

Neural Networks to have multiple layers to progressively extract higher level features

usually from raw input. The layers are composed of Neurons that perform individually

linear or non-linear transformations to the input signal to generate and output signal.

Each neuron can be connected to each neuron of different layers, generating the Network.

Figure 4: Artificial Neuron Figure 5: Artificial Neural Network [7]

The artificial neurons or perceptron have multiple inputs (x) and are scaled with the

weights (w) in order to raise or reduce the value of the input, to represent the importance.

The output value of the perceptron depends on the values of the input and the activation

function. The activation function is a non-linear function used to transform the activation

level of a unit into an output signal.

Some of the activations functions are: Sigmoid, Hyper tangent, ReLu and Softmax.

Nowadays ReLu is the most used activation function and Softmax is normally used in the

last layer to obtain the output vector as a probability vector.

Figure 6: Activation functions [7]

17

Neural Networks have to be adapted to the training data and for each application, so they

are parameters to have to fit to obtain the best results. Now will describe the most

important for train the model:

2.3.1. Loss

The Loss function is the most important unit to estimate the error from the prediction to

the original value. To fit the estimated and expected values perfectly the training phase

aim to have a loss of zero. To obtain it the weights of the neurons have to be adjusted

using an optimization function until better predictions.

2.3.2. Optimizer

Optimizer is an optimization algorithm that helps us to minimize the loss function towards

changing and adapting the values of the weights and bias of the network. There are many

different types such as Stochastic Gradient Descent, Adam, Adamax and RMSprop.

2.3.3. Epochs

An epoch is when an entire dataset is passed forward and backward through the neural

network one time. In order to train the model the number of epochs should be more than

one because as the number of epochs increases, more number of times the weight are

changed in the network and the curve goes from underfitting to optimal or even to

overfitting curve.

2.3.4. Batch Size

Since the databases are so big, the databases are spited in batches. The number of

training examples present in this split is the batch size. This batch represents the input in

a single iteration to the neural network. Every batch has the forward and backward

optimization towards the labels of the true prediction.

2.3.5. Dropout

The term of dropout refers to dropping out units (neurons) in a neural network. The

neurons are discarded during training phase randomly with a certain probability; this is

the parameter that we can change. This strategy is used to prevent over-fitting, it forces

the neural network to learn more robust features that are useful in conjunction with

different random subsets of the others neurons.

18

2.3.6. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a category of neural networks that have

proven very effective in areas such as image recognition and image classification. In our

case due to the audio is converted to a spectrogram the CNN will have better results than

a normal neural network.

The CNN is able to successfully capture the spatial dependencies in an image through

the application of relevant filters. The architecture performs a better fitting to the image

dataset due to the reduction in the number of parameters involved and reusability of

weights. The role of the CNN is to reduce the images into a form which is easier to

process, with the most important features for getting a good prediction.

Figure 7: Convolutional Neural Network structure

The type of layers of CNN’s are explained in the following sections.

2.3.6.1. Convolutional Layers

A convolutional layer is a layer with a rectangular filter that applies a 2D convolution and

can detect features or visual features in images such edges, lines, colour drops, etc. This

is a very interesting property because, once it has learned a characteristic at specific

point in the image, it can recognize it later in any part of it. Another important feature is

that convolutional layers can learn spatial hierarchies of patterns by preserving spatial

relationships.

Figure 8: 2D convolution layer [8]

19

2.3.6.2. Pooling layer

The pooling layer is responsible for reducing the spatial size of the Convolved Feature.

This is useful to decrease the computational power required to process the data through

dimensionality reduction. Furthermore, it is useful for extracting dominant features which

are rotational and positional invariant, this maintaining the process of effectively training

of the model. There are two popular types; Max pooling that returns the maximum value

from the portion of the kernel and the Average pooling who returns the average of all the

values of the section.

Figure 9: Pooling layer [9]

2.3.7. Fully Connected Layer

The fully connected (FC) layer in the CNN represents the feature vector for the input. This

feature vector/layer holds information that is vital to the input. When the network gets

trained, this feature vector is then further used for classification, regression, or input into

other network like SVM for translating into other type of output, etc. The convolution

layers before the FC layer hold information regarding local features in the input image

such as edges, blobs, shapes, etc. Each convolutional layer holds several filters that

represent one of the local features. The FC layer holds composite and aggregated

information from all the convolutional layers that matters the most.

Figure 10: Fully-connected layer [10]

20

2.3.8. Architectures of Convolutional Neural Networks

Neural networks structures, also named architectures, have multiple layers, different

types of layers, activations functions, sizes etc. The structure is a very important factor in

our project to have the best results and less operations and parameters to save

computational cost.

Figure 11: CNN layer structure [15]

In the previous Figure bottleneck is a type of convolutional block proposed for

MobileNetV2.

Some of the architectures used in this project are Very Depp Convolutional (VGG) [11],

Resnet [12], Densenet [13], Squeezenet [14] and MobileNetV2 [15].

2.3.9. Techniques

2.3.9.1. Transfer Learning

When a model is created the weights and bias have to be initialized randomly and then

start training from zero. When we use transfer learning, we take the values of the weights

and bias from a similar case, i.e. another classification problem, so we start with a model

with a reasonable accuracy. It’s a popular approach in deep learning to use pre-trained

models as the starting point. It’s proven that the parameters of the first layers of

classification model are very similar, and then changing only the last layers can reduce

the computational cost and time saving initial epochs.

2.3.9.2. Data augmentation

Data augmentation is a technique that consists in generating new synthetic samples to

train. This new synthetic samples are generated through the original samples and

applying little modifications to obtain the new ones. When dealing with image some

transformations can be crop image, rotating or adding noise. Alternately, audio signals

transformations can consists in changing pitch, the dynamic range, adding white noise or

time stretch. The data transforms that will be tried are more explained in later sections.

Data augmentation is useful to have more samples to feed the model and make the

system more reliable to different samples of the training phase.

21

2.4. Deep context-aware and Handcrafted Features

The start point of this project is to apply a technique explained in this paper [16]. This paper proposes the idea to combine two types of audio features and two types of machine learning algorithms to classify the urban sounds audios. In one hand we have the Handcrafted features, in the other hand the Context-aware deep learning features obtaining through a CNN fed with the spectrogram. Finally, combining the two results in a SVM obtain the final classification.

Figure 12: Structure of the proposed method [16]

2.4.1. Handcrafted Features

These features are obtaining in the time domain of the audio. The features are the next:

Table 2: Handcrafted Features [16]

22

The features are obtained from short-term audio slices and with all the vectors the mean

and standard deviation of each feature are computed to obtain only one vector of length

68. To know more about audio signals and representations see [17].

2.4.2. Context-aware deep learning features

Context-aware deep learning features (CadF) are obtained with the last layer values of a

CNN. The CNN is trained using mel-spectograms:

2.4.2.1. Short Time Fourier Transform and Spectrogram

Short Time Fourier Transform (STFT) is a is a Fourier-related transform used to

determine the sinusoidal frequency and phase content of local sections of a signal as it

changes over time. In practice, the procedure for computing STFTs is to divide a longer

time signal into shorter segments of equal length using and then compute the Fourier

transform separately on each shorter segment. This reveals the Fourier spectrum on

each shorter segment. Some parameters to calculate STFT are:

Window length: Length of the time segments.

Window stride: Length of the stride of the length. To reduce the effect of the

windowing in digital signals the

Window type: The type of the window can be rectangular, triangular, sine but to

reduce the previous mentioned effect the best one usually is hamming.

Spectrogram is image of time-frequency domain, where we can see the frequency

components in the time domain during the duration of the audio. There are a lot of

different types of window but to reduce the effect of them in digital analysis, the typical

window is hamming. Another thing to take in account is the frequency bins; this is the

number of frequency bands that we want to calculate. These features will be obtained

using librosa library [18], first to compute Short Time Fourier Transform to transform from

time domain to frequency domain. And then to compute spectrogram is pre-processed

normalizing it to standardize all the audios.

23

Figure 13: Mel spectrogram from dog barking

Then the spectrogram is processed via the neural network and the output is a 10 length

vector representing each of the 10 classes. The maximum value of the position is the

prediction. When we use this in conjunction with HaF in the paper they cut the input

spectrogram in N slices of 200ms and then all are they fed to the neural network to obtain

N outputs of the last layer of size 512. The mean vector of the N vectors is the CadF.

Finally, appending the 2 features vector, an SVM is trained to classify to the 10 different

classes.

24

2.5. Data Augmentation and CNN classification

Another paper [17] consulted to improve the initial system was to apply Data

augmentation to create new samples to train further. In this paper they use Data

Augmentation in the same case of Urban Sounds classification. Proposes applying data

augmentation directly to the audio to modify the next parameters:

Time Stretch (TS): The original audio is stretched to change the duration of it with

the same velocity. Paper proposes 4 factors of stretch, 0.81, 1.27,

Pitch Shift (PS): This transformation consists in changing the pitch of the audio

some semitones up or down. This is one of the bests transforms as we can see on the

table below. PS1 and PS2 correspond to PS1 of -2, -1, 1 and 2 semitones transforms and

PS2 to -3.5, -2.5, 2.5 and 3.5.

Dynamic Range Compression (DRC): The dynamic range is the difference of

the maximum value and the minimum value. The proposed types are music standard, film

standard, speech and radio.

Background Noise (BG): This consists in adding background noise to the

original sample. They add three urban street audios to it, with those don’t contain any of

the 10 classes.

In this table we can see the results in terms of improvement or looseness in each

transform and class.

Figure 14: Increment of accuracy relative to each class and transform [17]

25

3. Methodology / project development:

The first approach was to situate in the new techniques and methods to audio

classification using CNN’s. When we first start the idea, we could try only classifying

using CNN but we decided to probe the baseline paper idea [16]. Others papers, in

environmental sounds classification cases, were using Boltzmann Machines with

unsupervised learning. Other papers explain the parameters to obtain the spectrogram. In

conclusion, all use spectrogram with CNN’s.

3.1. Development of the system

To develop the system is used PyTorch [3], which is a python framework with an easy

API to implement DL systems and Tensor computing via GPU. Another important library

is libsvm [4] written in C++ but with a compiled version for python. Those two are the

most important to work with ML algorithms.

Other important used python libraries are:

Librosa: For audio read and spectrogram extraction. [18]

pyAudioAnalysis: To extract the features from time domain. [20]

Scikit-learn: To perform the PCA reduction. [21]

A lot of PyTorch tutorials were used to develop the system. With the big usage of it, there

is a lot of information online and the official page is much documented. Also the official

forum can be so much helpful. PyTorch additionally provides some useful DL models and

his trained weights to a quickly implementation. In the case of MobileNetV2 was from [22].

Starting with for an idea of initial code, the whole system is developed. This constrains in

a structure of code to be able to change parameters easily and save the experiment

results.

With the whole system developed we can start with the test and improvements. To

measure the performance of the system the metric used is the accuracy. The main goal

of this phase of the project is to improve as much as possible the accuracy of the system.

With the results of the first one we have a starting point to keep improving it. All the tests

and decisions made are explained in the next section. The tests were made changing

values explained in previous sections like hyper parameters and spectrogram creation

parameters. The following block diagram shows the tests made for each part of the

project, specifying parameters to be changed.

Figure 15: Methodology scheme

26

4. Results

The first results obtained were with VGG with 16 layers, with random weights initialization. To extract the spectrogram the parameters of the speech recognition case were used.

The spectrogram was extracted with:

Window size Window stride Window type N mels Low Freq High Freq Max len

20 ms 10 ms Hamming 32 20 Hz 22100 Hz 97

Table 2: Initial spectrogram parameters

Epochs Learning rate Optimizer Batch size Dropout

25 0.01 Adam 64 None

Table 3: Initial training parameters

The result was 40.8 % accuracy for test case and the best validation epoch 54.7 %.

4.1.1. Fine Tuning

Applying transfer learning with the weights of ImageNet classification problem and comparing with the initial VGG we can see that we save a lot of time reducing the epochs of the training. Also the results are better than initialising the weights from zero.

VGG Best Validation Test Results

Random initialization 52.8 % 40.8 %

Fine tuning last layer 54.7 % 41.2 %

Fine tuning all model 56.3 % 43.5 %

Table 4: FT vs Random Initialization

Figure 16: Plot comparation of Fine Tuning vs Random Initialization

27

4.1.2. Architectures and hyper parameters

Now, we started testing different architectures to obtain the best results. The different architectures tested were Resnet with 34 layers, VGG with 16 layers, Densenet with 121 layers, Squeezenet and Mobilenet.

Architecture Best Validation Test Results

VGG 16 56.3 % 43.5 %

Resnet 34 52.5 % 50.7 %

Densenet 121 56.4 % 49.8 %

Squeezenet 60.9 % 49.6 %

Mobilenet 65.5 % 54.3 %

Table 5: Different architectures comparison

Architecture Param memory FLOPS

VGG 16 528 MB 16 GFLOPS

Resnet 34 83 MB 4 GFLOPS

Densenet 121 31 MB 3 GFLOPS

Squeezenet 5 MB 360 MFOPS

Mobilenet 16 MB 579 MFLOPS

Table 6: Number of paramaters and FLOPS

Number of parameters extracted from original papers and FLOPS obtained from

MatConvNet for an input of 224x224 [23].

With the Test results and with an idea of the computational cost the proposed

architecture to continue developing is MobileNet.

4.1.3. Spectrogram Optimization

The following table exposes the changes applied in each test.

Changes Validation Test

Inital parameters 65.5 % 54.3 %

Low Freq: 0, High Freq: 67.1 % 56.3 %

Low Freq: 0, High Freq: sampling rate/2 66.1 % 56.9 %

N mels: 128 69 % 58.4 %

Window size: 0.23 ms Window stride: 0.23 ms

70.2 % 57.3 %

Max len: 128 67.8 % 63.1 %

Table 7: Spectrogram parameters optimization

28

Figure 17: Best training plots

4.1.4. Data Augmentation

When the system cannot improve more, Data Augmentation is applied to create new

samples to train further the system. The new audios were created following the factors of

the paper but in the case of the Background noise instead of adding a background urban

audio, it’s added black normal white noise. Applying this multiplies the number of files 10

times and consequently the computing time increase a lot, a total of 2 days and 14 hours.

The results aren’t as expected, as explained in the paper the accuracy of the system

should increase at least a bit more, therefore the results obtained of 66.1% validation and

59.8 % test aren’t as good as mentioned.

To repeat the test, the numbers of transformations are reduced and tested again with few

less audios.

Data Augmentation Validation accuracy Test accuracy

All transformations 88.96 % 89.06 %

Only 3 transformations 92.77 % 93.69 %

Table 8: Data augmentation results

29

4.1.5. SVM

When the CNN doesn’t improve more, it’s time to test that Handcrafted Features and

Context-aware deep Features will improve the accuracy of the system with the SVM.

Being the last layer of Mobilenet 1240 length layer, in this part will take part trying to

reduce the dimensionality of the CadF applying Principal Component Analysis (PCA).

To train the SVM the parameters changed are the Cost and the Gamma of the kernel.

The kernel used is the Radial Basis Function (RBF). In the next plot we can see the

influence of each parameter to the validation accuracy:

Figure 18: Plot of grid validation

The results are the following:

Method Best parameters Validation accuracy Test accuracy

Only HaF C = 14, gamma = 0.5 88.96 % 89.06 %

HaF + CadF C = 10, gamma = 0.01 92.77 % 93.69 %

PCA reduction C = 10, gamma = 0.01 92.69 % 78.35 %

CadF with 512 fully

connected layer

C = 10, gamma = 0.01 70.02 % 71.36 %

Table 9: SVM results

30

4.1.6. Final Results

To conclude the results, comparing the baseline paper and our results:

Table 10: Results of baseline paper [16]

Method U-8K

HaF 89.06 %

CadF 63.10 %

HaF + CadF 93.69 %

Table 11: Final results

4.1.7. Results Conclusions

To summarize the results, the developed system works as expected, with the addition of

the two types of features, Haf and CadF, the accuracy improves reasonably.

Furthermore, the system has a higher accuracy according to the baseline paper. The

increase of the accuracy is supported with the differences of the proposed methodology

and the baseline paper, while they slice the spectrogram in chunks of 200 ms, our system

works with the whole spectrogram. Also, the dimensions of the CNN are bigger, leading

to the conclusion that the features are more relevant.

Surprisingly, the results with only HaF are much better than expected and the win of

computational cost of avoiding the CNN will be a very important factor to take account.

31

5. Budget

To take account of the budget of the project, will be considered the hours worked on it. The duration of the whole project is a total of 5 months, from February 2019 to June 2019. Considering 20 h/week of work from a Junior Engineer and assuming a mean salary of 1.800 €/month a full journey, the total amount will be around 4.500 €.

In addition, the senior engineer weekly meetings in this 5 months results in a total of 20 meetings, being the price 150 €/h results in 3.000 €.

Finally, the GPU server hosting services in a platform like ‘LeaderGPU’ with a 4 x 1080 GPUs and 64GB RAM will take a cost of 484 € month.

All the software, programs and libraries used are open-source and free.

A small office will take place in Barcelona near the UPC, with a monthly rental of 250 €.

Name Cost/week N. of weeks Total

Junior Engineer 225 € 20 4.500 €

Senior Engineer 150 € 20 3.000 €

GPU Server 121 € 20 2.420 €

Office 250 € 20 5.000 €

Total 14.920 €

Table 5: Table of costs

32

6. Conclusions and future development:

Nowadays CNN are widely used for image classification and image recognition, and the

popularity and investigation are increasing, the performance of CNN’s are proven every

day in many different ways and forms.

However, in this thesis is proven that SVM can increase the accuracy of the CNN and

with the Handcrafted Features the system can work by itself, reducing the computational

cost drastically. The main goal of this project was using CNN to perform the best results

but maybe in this case the best option can be without it. In case of classification audios,

the spectrogram and the Handcrafted Features are two viable options, in our case trying

to reduce as much as possible the costs an option is to use only Handcrafted Features

and SVM to reduce the models and operations.

Finally, for future development a full cost analysis should be done in order to have the

real numbers of each methodology and compare it with embedded systems capacity to

be able to expect the performance. Another important factor will be to compress the

models size with techniques explained in this paper [X], where quantify the weighs can

reduce the size a factor of ten times. In the case of the computational case wasn’t a

trouble, another methodologies could be applied to aim to better accuracies.

33

Bibliography:

[1] “Urbansound8k Dataset” [Online] Available: https://urbansounddataset.weebly.com/urbansound8k.html

[2] Stanford Course: “Convolutional Neural Networks for Visual Recognition” (CS231n) [Online] Available: http://cs231n.stanford.edu

[3] PyTorch: https://pytorch.org/

[4] libSVM Documentation [Online] Available: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

[5] Figure 2 image [Online] Available: https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-machine

[6] Figure 3 image [Online] Available: https://www.hackerearth.com/blog/developers/simple-tutorial-svm-parameter-tuning-python-r/

[7] Figure 5 image [Online] Available: https://www.datacamp.com/community/tutorials/neural-network-models-r

[8] Figure 7 image [Online] Available: https://lilly021.com/convolutional-neural-networks/

[9] Figure 8 image [Online] Available: http://cs231n.github.io/convolutional-networks/

[10] Figure 9 image [Online] Available: https://towardsdatascience.com/the-sparse-future-of-deep-learning-bce05e8e094a

[11] K. Simoyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv:1409.1556v6 [cs.CV] 10 Apr 2015

[12] K. He, X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition”. arXiv:1512.03385 [cs.CV] Dec 2015

[13] G. Huang, Z. Liu, L van der Mateen. “Densly Connected Convolutional Networks”. arXiv: 1608.06993 [cs.CV] Jan 2018

[14] F. N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer. “SqueezeNet: AlexNet-Level accuracy with 50xfewer parameters and <0.5MB model size”. arXiv: 1602.07360 [cs.CV] Feb 2016

[15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen. “MobileNetV2: Inverted Residuals and Linear Bottlenecks” arXiv:1801.04381 [cs.CV] Jan 2018

[16] Giannakopoulos, T.; Perantonis, S. Recognition of Urban Sound Events using Deep Context-Aware Feature Extractors and Handcrafted Features. Preprints 2018, 2018110509 (doi: 10.20944/preprints201811.0509.v1).

[17] J. Salomon, J.P. Bello. “Deep Convolutional Neural networks and Data Augmentation for Environmental Sound Classification”arXiv:1608.04363 [cs.SD] Aug 2018

[18] libRosa python library documentation [Online] Avaiable: https://librosa.github.io/librosa/

[19] Romain Serizel, Victor Bisot, Slim Essid, Gael Richard. Acoustic Features for Environmental SoundAnalysis. Tuomas Virtanen, Mark D. Plumbley, Dan Ellis. Computational Analysis of Sound Scenesand Events, Springer, pp.71-101, 2017, 978-3-319-63449-4. <10.1007/978-3-319-63450-0_4>. <hal-01575619>

[20] pyAudioAnalysis python library. [Online] Available: https://github.com/tyiannak/pyAudioAnalysis

[21] scikit-learn python library documentation. [Online] Available: https://scikit-learn.org/stable/

[22] J. Lin: “MobileNetV2 model for PyTorch and pretrained weights” [Online] Available: https://github.com/tonylins/pytorch-mobilenet-v2

[23] MatConvNet: [Online] Available: http://www.vlfeat.org/matconvnet/pretrained/

https://urbansounddataset.weebly.com/urbansound8k.html

http://cs231n.stanford.edu/

https://pytorch.org/

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-machine

https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-machine

https://www.hackerearth.com/blog/developers/simple-tutorial-svm-parameter-tuning-python-r/

https://www.hackerearth.com/blog/developers/simple-tutorial-svm-parameter-tuning-python-r/

https://www.datacamp.com/community/tutorials/neural-network-models-r

https://www.datacamp.com/community/tutorials/neural-network-models-r

https://lilly021.com/convolutional-neural-networks/

http://cs231n.github.io/convolutional-networks/

https://towardsdatascience.com/the-sparse-future-of-deep-learning-bce05e8e094a

https://towardsdatascience.com/the-sparse-future-of-deep-learning-bce05e8e094a

https://librosa.github.io/librosa/

https://github.com/tyiannak/pyAudioAnalysis

https://scikit-learn.org/stable/

https://github.com/tonylins/pytorch-mobilenet-v2

http://www.vlfeat.org/matconvnet/pretrained/

34

Glossary

ML: Machine Learning

CNN: Convolutional Neural Networks

SVM: Support Vector Machines

HanF: Hancrafted Features

CanF: Context aware deep learning Features

CPU: Central Processing Unit

GPU: Graphics Processing Unit

VGG: Very Deep Convolutional Model architecture

FT: Fine Tuning

RBF: Radial Basis Function

FLOPS: Floating point Operations Per Second

https://www.datacamp.com/courses/convolutional-neural-networks-for-image-processing

https://www.datacamp.com/courses/convolutional-neural-networks-for-image-processing

urban sounds classification using deep learning a …

Documents