hybrid system for automatic music transcription · hybrid system for automatic music ... professor...
TRANSCRIPT
Hybrid System for Automatic Music Transcription
Vasco Salema Cordeiro Aboim de Barros
Thesis to obtain the Master of Science Degree in
Electrotechnical and Computer Engineering
Supervisor: Professor Rodrigo Martins de Matos Ventura
Examination Committee
Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Rodrigo Martins de Matos VenturaMember of the Committee: Professor Pedro Manuel Quintas Aguiar
May 2017
ii
Acknowledgments
I would like to thank my supervisor Prof. Rodrigo Ventura, for supporting me into exploring the somewhat different
yet interesting topic of Automatic Music Transcription.
I express my gratitude to Emmanouil Benetos for the aid and guidance provided in the implementation of his
method.
Finally, I would also like to thank my family and friends for the motivation and support provided during this the-
sis, especially to Duarte Rondão and Bernardo Marchante, my colleagues who have accompanied me on this
journey.
Lisbon, Portugal
15/04/2017Vasco Barros
iii
iv
Resumo
Transcrever automaticamente uma peça de música é uma tarefa muito desafiante. Para tal, é necessário uma
perceção e interpretação do som e da música que se tem revelado difícil de replicar numa máquina. No entanto,
já existem diversos métodos que resolvem subproblemas desta tarefa. Nesta tese é proposto um sistema híbrido
para Transcrição Automática de Música, que combina duas técnicas distintas de Aprendizagem Automática. É
implementado um método de factorização de espectrogramas baseado na técnica “Probabilistic Latent Compo-
nent Analysis”. Este método utiliza uma biblioteca de "templates" de instrumentos e notas pré-extraídos, bib-
lioteca esta que terá um grande impacto no processo de transcrição. Como tal, é desenvolvida e treinada uma
“Deep Neural Network” para identificação de instrumentos contidos num dado ficheiro de som. Combinando
os dois métodos mencionados anteriormente, é então criado um sistema híbrido que elimina a necessidade de
manualmente determinar o correto tamanho da biblioteca de "templates" ao transcrever um dado ficheiro de som.
Este sistema híbrido demostra que através da combinação de métodos distintos de Aprendizagem Automática
é possível garantir maior autonomia no processo de transcrição. Neste caso, o sistema proposto garante a
precisão de transcrição do método “Probabilistic Latent Component Analysis” adquirindo uma maior autonomia
na transcrição, pois através da rede neuronal treinada é feita uma identificação automática dos instrumentos
musicais presentes na música a transcrever.
Palavras-chave - Transcrição Automática de Música, Aprendizagem Automática, Probabilistic Latent Component
Analysis, Deep Learning, Convolutional Neural Networks, Sistema híbrido
v
vi
Abstract
The task of automatically transcribing a piece of music is a very challenging one. It implies sound and music
perceptiveness which has been proving hard to replicate into machines. There are multiple methods to address
sub-problems within this task, achieving successful results. In this thesis a hybrid system for Automatic Music
Transcription is proposed, combining two distinct Machine Learning techniques. A state-of-the-art spectrogram
factorization technique based on Probabilistic Latent Component Analysis is implemented. This method uses a
pre-extracted template library of instruments and their notes to perform the transcription. The template library
greatly impacts the transcription process. As such, to automatically determine the correct library size to be
used, a Deep Neural Network was trained as a classifier, to identify instruments performing in a sound file.
By combining both mentioned techniques, a hybrid transcription system is created that eliminates the need for
a manual instrument identification for each considered sound file. This hybrid system proves that combining
distinct Machine Learning methods it is possible to improve the transcription process granting it more autonomy.
In this case, the proposed system ensures the same transcription accuracy of the Probabilistic Latent Component
Analysis method, while adding a higher degree of autonomy in the process, obtained through the automatic
instrument identification performed by the trained neural network.
Keywords - Automatic Music Transcription, Machine Learning, Probabilistic Latent Component Analysis, Deep
Learning, Convolutional Neural Networks, Hybrid system
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables xi
List of Figures xiii
List of Acronyms xv
List of Symbols xvii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Multi-Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Note Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Instrument Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theoretical Background 6
2.1 Sound Perception and Musical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Spatial location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.6 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Constant-Q Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 CQT Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Probabilistic Latent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Multi-Sample Shift Invariant Probabilistic Latent Component Analysis 20
ix
3.1 MSSIPLCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Unknown Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Template Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Convolutional Neural Network 34
4.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 CNN layers and architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 CNN in Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Network’s Architecture and Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Hybrid System 44
5.1 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Conclusion 50
6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A Musical Notes 53
Bibliography 57
x
List of Tables
3.1 MSSIPLCA model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Generic instruments of the Template Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Template library considered in Module 1 performance evaluation . . . . . . . . . . . . . . . . . . 29
3.4 Module 1 evaluation test results: Percentage of notes correctly transcribed and resulting error . . . 31
4.1 CNN classifier’s layers and filter sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Instruments considered in the classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Numeric results of the provided transcription examples . . . . . . . . . . . . . . . . . . . . . . . . 49
A.1 Notes, frequencies and wavelengths with the correspondent MIDI scale number . . . . . . . . . . 53
xi
xii
List of Figures
2.1 CQT of 2 notes, played by 2 different instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Artificial neural networks, and its nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Example of a neural network with the notation used in Section 2.4.2 applied . . . . . . . . . . . . 17
3.1 Shift-Invariant PLCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Example of a CQT spectrogram of a piano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Diagram of the System’s Module 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Module 1 evaluation test results: Graphic of the percentage of notes correctly transcribed . . . . . 30
3.5 Module 1 evaluation test results: Graphic of the percentage of false positive transcribed notestranscribed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Example of transcription results with different template library sizes . . . . . . . . . . . . . . . . . 33
4.1 Example of input volume and Neuron arrangement in a convolutional layer . . . . . . . . . . . . . 36
4.2 Example of Max Pooling on an input depth level . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Diagram of the implemented CNN’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Diagram of the developed Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Module 2 performance evaluation: Graphic of the classification accuracy . . . . . . . . . . . . . . 42
4.6 Intermediate steps of a classification performed by module 2 . . . . . . . . . . . . . . . . . . . . 43
5.1 Diagram of the proposed hybrid system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 System’s performance evaluation graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Log-spectrogram of an example input file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Intermediate steps of a classification performed by module 2 . . . . . . . . . . . . . . . . . . . . 48
5.5 System’s performance evaluation: Transcription results with µ = 0.005 . . . . . . . . . . . . . . . 48
5.6 System’s performance evaluation: Transcription results with µ = 0.020 . . . . . . . . . . . . . . . 49
5.7 System’s performance evaluation: Transcription results with µ = 0.035 . . . . . . . . . . . . . . . 49
xiii
xiv
List of Acronyms and Abbreviations
AMT Automatic Music Transcription
CNN Convolutional Neural Networks
CQT Constant-Q Transform
DBN Deep Belief Network
DFT Discrete Fourier Transform
EM Expectation-Maximization
MFCC Mel-frequency Cepstral Coefficients
MIDI Musical Instrument Digital Interface
MIR Music Information Retrieval
MIREX Music Information Retrieval Evaluation eXchange
ML Maximum Likelihood
MLP Multilayer Perceptrons
MSSIPLCA Multi-Sample Shift Invariant Probabilistic Latent
Component Analysis
NMF Non-Negative Matrix Factorization
PLCA Probabilistic Latent Component Analysis
PLSI Probabilistic Latent Semantic Indexing
ReLU Rectifier Linear Units
SGD Stochastic Gradient Descent
SVM Support Vector Machines
xv
xvi
List of Symbols
Greek letters
α Sparsity parameter for MSSIPLCA method
η Learning rate parameter for Gradient Descent
γ Momentum parameter for Stochastic Gradient Descent
λ Instrument index utilized in the MSSIPLCA Module
µ Classification threshold parameter utilized in the CNN Module
∇ Gradient operator
ω Log-frequency
σ Transcription threshold parameter
Roman letters
Bt Batch size in a CNN’s learning process
fs Sample Frequency
K Number of filters in a Convolutional layer
P Zero-Padding added in a CNN layer
S Stride in a Convolutional layer
t Time
xvii
xviii
Chapter 1
Introduction
1.1 Motivation
Musical transcription is the process of converting a piece of music into some form of musical notation, which
will display the musical notes played across time. Some examples of musical notations are scores, piano-roll
representations or rhythmic sequence of chords [1]. Even for those with musical training, listening to a piece of
music and manually trying to transcribe it presents itself as a very challenging task. There are several obstacles
one may find while performing this task, such as detecting which instrument plays each note or detecting the
tempo/beat of each note, but the main challenge is to detect the the note’s pitch.
In the late 70’s, audio researchers such as James Moorer, Martin Piszczalski or Bernard Galler, dedicated their
research to musical signal analysis. In 1977 Piszczalski and Galler introduced the concept of Automatic Music
Transcription (AMT) [1]. This is the process of automatically converting a musical sound signal into its represen-
tation as musical notation, through digital analysis of the musical signal. This process has been the target of
much research since it was introduced, and nowadays it covers a wide range of subtasks. These subtasks are
representation of the challenges we face, with the added difficulty of removing human intuition and perception out
of the equation. As such, AMT can be viewed as one of the main technology enabling concepts in music signal
processing [2].
In the past decades, another field of study has made great achievements in having computers learning from
representations of our world, trying to mimic Human learning processes. This field of study — Machine Learning
— focus on adding cognitive skills to machines by modelling learning processes [3]. AMT and Machine Learning
are intrinsically connected, in the sense that the high-level goal is to make computer perceive and interpret music
on its own. As such, many methods developed to address AMT tasks are based on Machine Learning algorithms.
Solving the automatic transcription challenge will allow any group of musicians to play or improvise freely without
the fear of losing their creations, by saving a record of each note played by whom. Other applications of these
methods include the register of music genre where no score exists such as traditional oral music or jazz, or
enabling machine participation in musical live performances. A transcription algorithm could be applied to a large
library of musical pieces, which would be out of reach for a manual approach. This could enable musical search
through an audio input. Lastly, an AMT algorithm could also integrate a musical tutoring platform granting the
platform the ability to interpret and correct the user when necessary.
1
1.2 Goals
In this thesis the task of automatically transcribing a music signal is addressed. An automatic transcription system
is fully designed and implemented. This system will generate a piano-roll representation of notes of one or more
instruments performing in a given input file. This system aims to provide support to every musician in their
frequent transcription tasks, by automatically registering their music pieces.
The proposed system will combine distinct methods applied to the AMT process, to generate a fully functional
system. This hybrid approach aims to explore the benefit of combining distinct approaches to specific AMT tasks,
in order to improve the transcription process. Thus, in this thesis an hybrid system is proposed to automatically
transcribe recorded music fragments. This hybrid system will combine two distinct Machine Learning methods,
each one addressing a distinct AMT subtask: Multi-Pitch Estimation and Note Tracking, and Instrument Identifi-
cation.
1.3 State-of-the-Art
As was mentioned above AMT can be divided into multiple subtasks, each one with different proposed methods
and approaches. Multi-pitch estimation is considered the fundamental subtask of AMT, as it focus on identifying
and distinguishing concurrent pitches. In order to correctly detect a note’s pitch the note event has to be identified
in time, as such the Note tracking subtask focus on providing a temporal representation of the notes being played.
To properly detect when a note event starts or finishes, the previous subtasks utilizes methods form the Onset
and Offset detection task. As music is mainly composed by several instruments, it is often necessary to detect
which instrument is playing which note, this is the focus of the Instrument identification subtask. Lastly, there are
additional and more specific subtasks, as Key detection, which detects the musical key of the hole music piece,
and Beat detection, that characterizes temporally the analysed music piece [2].
In this section a review of three main subtasks — Multi-Pitch estimation, Note tracking and Instrument identifica-
tion — and their propose methods is presented. An extensive state-of-the-art can be seen in [2], which served
as a guide for the following review.
1.3.1 Multi-Pitch Estimation
When AMT arose, Piszczalski and Galler focused on transcribing monophonic pieces of music –– signals from
one instrument source only [1] — but after three decades of research this problem is considered solved [2]. The
current challenge resides in automatically transcribing polyphonic music signals —- music with several instru-
ments. As such, in polyphonic music we are interested in detecting concurrent pitches, from the same instrument
or from multiple distinct instruments. This challenge is referred to as Multi-Pitch Estimation.
There are three main groups of techniques dedicated to this matter: Feature-based models, Statistical Model-
based Estimation and Spectrogram Factorization. Feature-based models are fundamentally signal processing
methods, where the notes are detected by extracting audio features of the signal’s time-frequency representation.
When using Statistical Model-based Estimation, the problem is formulated as a Maximum A Posteriori Estimation
Problem, where all combinations of fundamental frequencies are considered in order to compute a final estimate.
Spectrogram Factorization appeared in the most recent literature and it has been gaining a lot of attention. It
consists in decomposing a spectrogram from the input signal into two components relative to each tone, the
spectral base, and the temporal activity [2].
2
A large subset of the state-of-the-art Multi-Pitch Estimation methods focus on two techniques: Non-Negative
Matrix Factorization (NMF) and Probabilistic Latent Component Analysis (PLCA). Both these methods can be
included in the Spectrogram Factorization group. NMF is a matrix factorization technique where the matrices have
no negative values, characteristic that is exploited the factorization process. It is a robust and computationally
inexpensive method [4]. NMF algorithms can be implemented as spectral decomposition models applied to
musical signals [5], taking advantage of the non-negativity of a spectrogram. PLCA takes a probabilistic approach
in the spectral factorization task, having achieved state-of-the-art results. In [6] Benetos, Ewert and Weyde
propose a PLCA based model for transcribing jointly pitched and unpitched sounds (the latter can be viewed
as percussive sounds), showing the effectiveness of this technique in regular western music inputs. In [7] an
algorithm for Shift-Invariant PLCA is presented, the implemented method could tolerate variations of the spectral
envelopes (tuning deviations). The aforementioned model is implemented using monophonic signals, but the
authors prove that it can be extended to polyphonic signals.
Every year, a contest named Music Information Retrieval Evaluation eXchange (MIREX) takes place, where the
contestants submit their methods to solve certain tasks within the scope of Music Information Retrieval (MIR). One
of these tasks is named Multiple-F0 Estimation, which corresponds do Multiple-Pitch Detection. In 2014 Elowsson
and Friberg proposed a method which turned out to be the most accurate [8]. This method included Deep Layered
Learning techniques in MIR tasks, showing once again the benefits of using state-of-the-art Machine Learning
methods in music signal analysis.
1.3.2 Note Tracking
In order to correctly analyse a time-frequency representation for further pitch estimation, one must detect where
the note starts and ends (onset and offset time respectively). This processing stage is defined as Note Tracking.
As its definition implies, it is closely related to Multi-Pitch Estimation. There are several methods utilized in
Note Tracking: Hidden Markov Models may be considered in a post-processing stage for temporal smoothing
[9], Dynamic Bayesian Networks can be applied to address this task [10], as the simpler approach of Minimum
Duration Pruning technique [11].
The large majority of the approaches jointly performs Note Tracking and Multi-Pitch Detection. As such, despite
Note Tracking being considered an additional processing stage in some cases, in this thesis it will be considered
an implicit step in the Multi-Pitch Detection process.
1.3.3 Instrument Identification
Given a polyphonic music, where multiple instruments play at the same time, the task of identifying which in-
strument is playing consists on one of the main challenges inside the scope of AMT. Traditional MIR methods
focus mainly on two stages: feature extraction and semantic interpretation. Extracting good features is very time
consuming, but ultimately it will lead to a good representation of the input signal. These features extractions
tend to be task specific and hard to optimize. As such, MIR researchers tend to adopt more powerful semantic
interpretation strategies, like Multilayer Perceptrons (MLP) and Support Vector Machines (SVM) [12, 13]. In In-
strument Identification (and in other main AMT tasks), multiple feature extractions approaches were implemented
and perfected in order to achieve better data representations. This is the case of the widely utilized Mel-frequency
Cepstral Coefficients (MFCC) [14], which consist on an attempt to define and characterize the timbre of an in-
strument. Combining these features extractions with the previously mentioned semantic interpreters achieved
satisfying results [15].
3
However, recent studies prove that combining traditional shallow methods with Deep Learning techniques, thus
obtaining deeper architectures, allows better high-level representations and, in the end, better results [16]. Deep
Learning is a Machine Learning technique, based on Neural Networks, that provides high-level concept learning
through multiple layer learning (hence deep). The layers are hierarchically stacked, and the high-level learned
concepts are inferred using the hierarchic lower layer learnt concepts, granting new levels of understating and
abstraction [17].
Deep Learning techniques are conquering their space in Instrument Identification, with several methods proving
to be more accurate than traditional shallow approaches. Hamel, Wood and Eck presented in [18] a comparison
between a Deep Belief Network (DBN), a MLP and a SVM, on Instrument class classification. The first is a
Deep Learning technique where a Neural Network is pre-trained in an unsupervised manner in order to represent
the input data more efficiently, and then trained in a supervised manner to tune the network to the desired
classification. The remaining methods are Machine Learning techniques, that can be viewed as low-level layers
of a deep neural network. This comparison showed that the DBN performs as well as the other methods, and
outperforms them when the feature set is limited, as well as the instrument classes. Another approach using
DBN in Classification and MIR tasks is presented in [19], this time applied to music genre classification. In this
paper feature extraction is performed using a DBN, yielding better results than the standard MFCC feature-based
approach.
Convolutional Neural Networks (CNN) are a specific type of Neural Networks that is widely used in Image Recog-
nition, due to it’s great performance in this task. These Neural Nets exploit the convolution operation properties
in order to reduce memory usage and improve performance. Li, Chan, and Chun stated that musical patterns
can be captured using CNN due to the similarities between musical data and image data [20]. The authors im-
plemented a CNN for Music Genre Classification. Their implementation required minimal prior knowledge to be
constructed and was complemented with the usage of classic features like MFCC.
1.4 Thesis Outline
In this thesis, a hybrid system is developed to address the AMT task. This system has two modules, in the first
Benetos, Ewert and Weyde’s PLCA-based method [6] is implemented for Multi-Pitch Estimation and Note Track-
ing. In the second module, a complementary CNN Classifier is designed and conceived to perform Instrument
Identification. The thesis is organized as follows:
In Chapter 2 a description of harmonic sound signals and their properties is detailed, as well as a signal trans-
form which is utilized to obtain a suitable time-frequency representation for musical signal analysis. Also the
basic techniques and models utilized by the two aforementioned modules are summarized (PLCA and Neural
Networks).
In Chapter 3 the first module of the hybrid system is addressed. The implementation of the PLCA-based method
is detailed: the mathematical model is explored as well as the implementation process. In this chapter it will also
be explained the need for adding a classification module to the designed system.
In Chapter 4 the classification module of the system is detailed. Convolutional Neural Networks and their specific
characteristics are addressed. The CNN Classifier will also be presented, including it’s design, implementation,
training phase and limitations.
In Chapter 5 the integration of the distinct methods resulting in the developed system is detailed. In this chapter
the interaction between the two modules is explored, and the overall performance of the system is evaluated. Its
4
achievements and limitations are presented, as well as an end-to-end transcription example.
In Chapter 6 the thesis’s conclusion is presented, and possible future work directions are provided.
5
Chapter 2
Theoretical Background
In this chapter a theoretical introduction is made regarding the fundamental methods and models exploited by
this thesis. In Section 2.1, an introductory explanation of the particularities of musical sound signals is presented.
In Section 2.2 the Constant-Q Transform, a time-frequency representation suitable for musical signal analysis is
addressed. In Section 2.3 the original PLCA method is summarized. Lastly, in Section 2.4 an introduction to
Neural Networks and its learning algorithms can be found.
2.1 Sound Perception and Musical Characteristics
In physics, sound can be defined as mechanic waves of pressure that propagate through a compressible medium
(e.g. air or water). In order to be listened to, this wave must reach the ear. After reaching the ear, it can be ignored
or it can be processed and perceived by the brain. Thus, hearing can be defined as the perception of a sound by
the brain. Concerning harmonic sounds, there are several characteristics that allows the brain to perceive them:
loudness, duration, texture, pitch, spatial location and timbre [21]. In this section these characteristics and their
impact on musical sound signal analysis will be presented. This review will mainly focus on pitch and timbre, as
an extensive review over the remain elements and their influence over signal analysis is out of the scope of this
thesis.
2.1.1 Loudness
The loudness of a sound is related to the physical strength of the sound (the amplitude of the signal). It refers to
how loud or how quiet the sound appears to the receptor. It is a subjective measure and, as such, it is not solely
related with the amplitude of the sound signal. However, for the sake of simplicity it will be interpreted as that for
the remainder of this thesis.
2.1.2 Duration
A sound’s duration is related to how long a sound takes from the moment it is noticed until the moment it dis-
sipates. Duration is also a subjective measure, since noise and attention can affect greatly the perception of a
sound’s duration. In music, the duration of a sound, can affect the beat and the rhythm of a piece of music. In this
6
thesis, duration is interpreted as the time interval from the sound’s start to it’s dissipation (onset and offset times
respectively).
2.1.3 Spatial location
Spatial location is the perception of the spacial placement of the sound source in the acoustic environment,
(physical distance). In this thesis spatial location will not be considered, the sound source will be constant as all
sound files used in the experiments are monaural.
2.1.4 Texture
Sound texture is a very wide concept and, as such, it has several definitions. In [22] it is defined that "a sound
texture should exhibit similar characteristics over time. It can have local structure and randomness but the char-
acteristics of the fine structure must remain constant on the large scale". The number of instruments, their
characteristics and the acoustic ambient are all factors that define a sound texture. In particular, the sound heard
in a cafeteria has a different texture that the sound of two individuals speaking in a living room, or the texture of
an orchestra is different from the texture of rock concert. The notion of sound texture will not be considered in the
transcription process as the signals utilized consists solely on digital instruments performing, as it will be seen in
further sections.
2.1.5 Pitch
Fourier’s Theorem states that a steady-state wave is composed by a series of sinusoidal components — har-
monics. Thus, sound as a wave can be described by the amplitude, phase and frequency of its harmonics. The
fundamental frequency is the lowest frequency of the harmonics. The remainder of the harmonics vibrate at (in-
teger) multiples of this fundamental frequency. In real conditions, when a musician plays an instrument, distortion
is added to the resulting signal through performance nuances like tuning deviations or vibratos. These make the
estimation of the fundamental frequency even more difficult, in such conditions [23].
Pitch is a subjective measure, strongly related with the perception of the fundamental frequency of the sound. It
implies a scale ordered from low to high in which the sounds can be placed hierarchically. Identifying the pitch of a
sound is a major step towards distinguishing different sound sources. Also, pitch is also an important information
while trying to group the individual harmonics of the same vibrating source [24].
In western music an octave is used as a pitch interval and it is split into 12 individual notes. The tuning system
convention is the equal temperament in which the frequency of each note,Pi, is obtained by multiplying the fre-
quency of the previous note by the twelfth root of 2, leading to the expression Pi = 2112Pi−1 [25]. Also, a note of
one octave has the exact double frequency as the same note in the previous octave. As was mentioned before,
pitch has a strong relation with the fundamental frequency of the sound, as the latter often corresponds to the
pitch of the note. However, the fundamental frequency does not have to be the strongest harmonic of the sound
[23]. A list of the considered notes in this thesis, with the respective frequency and wavelength can be seen in
the Appendix A.
7
2.1.6 Timbre
The aforementioned sound characteristics are all closely related to some physical property of the sound, making
them measurable to a certain extent. Timbre, on the other hand, is not. It is hard to define, and the existing
definitions are purely subjective. Thus making measuring timbre an arduous process.
The American National Standards Institute defines timbre as an "attribute of auditory sensation in terms of which
a listener can judge two sounds similarly presented and having the same loudness and pitch as being dissimilar"
[26]. With this definition, one can understand why timbre is usually referred to as the colour of a sound. Devel-
oping methods to evaluate timbre became a crucial task in Instrument Classification. In order to describe timbre
and to identify a musical instrument, several timbre features were developed [23].
Temporal Features
Temporal features mainly focus on measuring the energy of a sound across time, which generates a represen-
tation of the temporal shape of a sound. Calculating the root mean square of a temporal envelope generates
a feature that allows measuring the energy present on each note. It provides information regarding the attack
and release time of a note, which is different for each instrument. Temporal Residual Envelope is obtained by
the difference between the original temporal envelope and the root mean squared temporal envelope. It displays
smaller amplitude variations, which provides information on instrument noise and on the player’s technique (e.g.
vibrato).
Spectral Features
Spectral features describe the fluctuations of the sound in terms of frequency. These features base their descrip-
tions on time-frequency representations — sound spectra, that can be obtained from Fourier transformations
such as Discrete Fourier Transform (DFT). Harmonic sounds have an important property, the log-frequency dis-
tance between the harmonics is constant independent of the fundamental frequency [27]. This property will be
further explored in Section 2.2.
An instrument with a rich timbre contain more harmonics than a pure tone. Considering this fact, measuring
the Number of Spectral Peaks is a feature that can be used to differentiate two distinct instruments, with distinct
timbres. Centroid Envelope is a measure of the physical distribution of power in the frequency frames. It provides
information on where in the frequency spectrum, a sound has most of its power, which is characteristic of each
instrument.
The previously mentioned MFCC features have been utilized in speech recognition and more recently in music
analysis. They are a set of coefficients that are calculated through a series of steps [14], and that classify the
sound in terms of a Mel scale [28]. This scale is based on human perception of distance between pitch and, as
such, is based in human hearing.
8
2.2 Constant-Q Transform
As mentioned in Section 2.1.6, sounds composed of harmonic frequency components have a distinct property:
the distances between these components is constant and independent of the fundamental frequency when plotted
against log-frequency [27]. Their overall position depends on the fundamental frequency, but their positions
relative to each other is the same. The first distance (between the first two harmonic components) is log 2, while
the next distance is log(
32
), maintaining this pattern for all the harmonics. The pattern formed by the harmonics
and their amplitudes will differ, reflecting different timbres, thus this pattern is useful for describing timbre and
consequently identifying an instrument [23].
In Signal Processing, the Constant-Q Transform (CQT) is a transform that generates a time-frequency repre-
sentation from a time-domain signal. It falls under the same category as the well-known DFT. The difference
between these two transforms is that while the DFT gives us a linear frequency representation, the CQT plots the
signal into a log-frequency scale [27]. With this scale this transform has a similar behaviour as the human ear,
it has higher frequency resolution in low frequencies and higher temporal resolution in higher frequencies. This
particularity makes the CQT well suited to deal with musical sound signals. The aforementioned harmonic sound
property will be evidently displayed in a log-frequency scale, allowing a description of the existing sound timbres
(different harmonic patterns).
In this thesis, the CQT is the chosen method to obtain the spectrograms from the input signals. As it will be
seen in further chapters, all the techniques developed or implemented in the context of this thesis require a log-
spectrogram to perform correctly.
2.2.1 Mathematical Model
As presented in [29], the CQT XCQ(k, n) of a discrete time-domain signal x(n) is calculated by
XCQ(k, n) =
n+bNk/2c∑j=n−bNk/2c
x(j)a∗k(j − n+Nk/2) (2.1)
where k = 1, 2, . . . ,K are the frequency bins of the transform, b.c is the floor operator which corresponds to the
highest integer lower or equal than the argument and a∗k(n) is the complex conjugate of ak(n). The latter are
referred to as time-frequency atoms and are complex-valued waveforms. They are defined by
ak(n) =1
Nkw
(n
Nk
)e−i2πn fk
fs (2.2)
where fk is the center frequency of the bin k, fs is the sampling rate and w(t) is a continuous windowed function
sampled at points determined by t, (zero-valued outside the range t ∈ [0, 1]). The Q-factor can be defined as the
ratio of the center frequency to band-widths, and for it to be constant in every bin the window lengths Nk ∈ R,
are inversely proportional to fk.
The CQT presented in this article [27] has its center frequencies fk placed according to the following rule:
9
fk = f12k−1B (2.3)
where f1 denotes the center frequency of the lowest frequency bin and B is the number of bins per octave. This
parameter B will determine the time-frequency resolution.
The Q-factor is constant for each bin, and can be calculated as follows:
Q =fk
∆fk=Nkfk∆ωfs
(2.4)
where ∆fk denotes the -3dB bandwidth of the frequency response of the atom ak(n), and ∆ω is the -3dB
bandwidth of the mainlobe of the spectrum of the window function w(t).
In order to reduce frequency smearing it is desirable to make the bandwidth ∆fk as low as possible. This is
achieved by having a large Q-factor. However the Q-factor cannot be arbitrarily set, as it would exclude portions
of the spectrum between bins of the analysis. A value of Q that allows signal reconstruction while introducing
minimal frequency smearing is given by
Q =q
∆ω(21B − 1)
(2.5)
where q ∈ [0, 1] is a scaling factor, typically set as q ≈ 1. Setting q with values smaller than 1 will improve the
time resolution while decreasing the frequency resolution. With Equations (2.4) and (2.5) we now have:
Nk =qfs
fk(21B − 1)
(2.6)
where the dependency on ∆ω disappears.
To reduce the calculation effort of the CQT while allowing signal reconstruction from the CQT coefficients, the
atoms can be placed Hk samples apart. Hk is referred to as Hop size. Typical values for the Hop size are
0 < Hk <≈ 12Nk.
2.2.2 CQT Application
Due to the varying number of samples considered in each frequency bin, the CQT is hard to efficiently calculate.
In [29], Schörkhuber and Klapuri propose an efficient method to calculate the CQT. This method is based on the
algorithm proposed by Brown and Puckette [30]. It is a less computational expensive algorithm and it allows the
calculation of the inverse CQT (which was not possible with the Brown and Puckette solution).
Schörkhuber and Klapuri also developed a MatLab Toolbox with their algorithm implemented. This toolbox will
10
be used to compute the CQT and as such, the computation of the CQT is out of the scope of this thesis and can
be further explored in [29]. The CQT will always be computed with the same fixed parameters. The maximum
and minimum frequency considered correspond to the notes A0 and C8 frequencies respectively, as these are
the minimum and maximum notes considered. 60 frequency bins are considered, and a sample frequency
fs = 44100Hz is set. The Hop size Hk is set at 0.3, while the scaling factor q is set at 0.8.
As will be explained in Subsection 3.2.3, the resulting transform can be further manipulated to represent the pitch
across time over a MIDI scale. Musical Instrument Digital Interface (MIDI) is a technical standard that allows
manipulation and control over digital instruments. A MIDI file contains information about what note is played,
when it is played, and the pitch of the note that it is played. It can be applied to any digital instrument, and
through the use of digital audio manipulation programs, an audio file can be created with the digital instrument
performing as is indicated in the MIDI file. A MIDI scale, can be seen as a zero-one representation of the notes
that are being played across time, giving a close approximation of a Piano-Roll representation. In Appendix A the
MIDI scale is presented for the notes considered. Also, the scale considered to identify the notes throughout this
thesis is displayed.
(a) Oboe’s log-spectrograms. (b) Violin’s log-spectrograms.
Figure 2.1: CQT of a Violin and an Oboe playing an A3 and a C4 note.
To demonstrate the aforementioned property of harmonic signals, the CQT of a Violin and an Oboe playing the
A3 and C4 notes for 1 second was calculated (with 40ms steps). The result can be observed in Figure 2.1,
where in Figure 2.1a the resulting log-spectrograms for the oboe are displayed and in Figure 2.1b the resulting
log-spectrograms for the violin are displayed. By inspecting these figures, it can be seen that each instrument
produces its own pattern of harmonics, with different intensities and amplitudes. This forms a pattern that can be
interpreted as timbre. On another note, it can also be seen that the harmonics do not correspond exactly to the
theoretical harmonic notes due to tuning deviations, thus allowing the usage of different temperaments. Different
temperaments consider different intervals between notes, creating different tuning systems.
11
2.3 Probabilistic Latent Component Analysis
PLCA is a statistic model utilized for acoustic spectra decomposition; it falls into the category of Spectrogram
Factorization techniques. It was first introduced by Smaragdis, Raj and Shashanka [31] as an extension of another
technique utilized in text and language analysis for automatic document indexing- Probabilistic Latent Semantic
Indexing (PLSI) [32]. This method defines the fundamentals of the implemented technique Multi-Sample Shift
Invariant Probabilistic Latent Component Analysis (MSSIPLCA), which will be further explored in chapter 3.
As described in [31], the base model for PLCA is defined as
P (x) =∑z
P (z)
N∏j=1
P (xj |z) (2.7)
where P (x) is an N-dimensional distribution of the random variable x = x1, x2, . . . , xN and P (xj |z) are one
dimensional distributions. z is a latent variable. Latent variables (or hidden variables) are variables that cannot
be directly observed and that are inferred from observable variables. Thus, this model aims to approximate an
N-dimensional distribution with the product of marginal distributions.
To estimate the marginal distributions, this technique uses the Expectation-Maximization (EM) algorithm [33].
This algorithm introduces hidden variables (unobserved variables) in a Maximum Likelihood (ML) Estimation,
defining unobserved data. ML Estimation computes parameters that maximize the probability of occurrence of a
given measurement of a random variable distributed by a probability density function [34]. The Likelihood function
can be defined as
L(Θ) = p(y|Θ) (2.8)
were y = (y1, . . . , yN )T is a measurement vector of a random variable Y and Θ is a parameter that defines the
probability density function. It is common to maximize the log-Likelihood function which can be easier to compute
and provides the same result, since the logarithm is a strictly increasing function.
The EM algorithm considers the log-Likelihood of the complete data x, which consists on the incomplete observed
data y and the unobserved data z:
x = (yT, zT)T (2.9)
Resulting in:
L(Θ) = p(x|Θ) (2.10)
This is an iterative algorithm that is divided into two distinct steps: an Expectation and a Maximization step,
which are alternated. In the Expectation step the contribution of the latent variable z is estimated, estimating the
12
log-Likelihood function [34]. This estimation is computed as such:
E[L(Θ)|y,Θ(i)] (2.11)
were E[x] denotes the expected value of x. The expected value of a random variable X with a probability density
function f(x) can be calculated like:
E[X] =
∫ −∞∞
xf(x)dx (2.12)
In the Maximization step, the previously obtained estimation is maximized through the following equation:
Θ(i+1) = arg maxΘ
E[L(Θ)|y,Θ(i)] (2.13)
To perform this iteration an initial estimate Θ(i) must be provided. Both these steps are alternated in an iterative
manner until a stopping criteria is reached. The stopping criteria can lead to an optimal solution, or it can get
stuck local minimums, the number of iterations as well as the stopping criteria must be fine-tuned to provide better
results [34].
As in [31], applying EM to the PLCA method yields the following equations:
R(x, z) =P (z)
∏Nj=1 P (xj |z)∑
z′ P (z′)
∏Nj=1 P (xj |z
′)(2.14)
P (z) =
∫P (x)R(x, z)dx (2.15)
P (xj |z) =
∫. . .∫P (x)R(x, z)dxk,∀k 6= j
P (z)(2.16)
Equation (2.14) denotes the likelihood function to be maximized, and Equations (2.15) and (2.16) show how to
apply the expectation and maximization steps to PLCA respectively.
As mentioned above, by alternating these steps repeatedly the estimates will converge to an approximate solution.
In the end it will generate a good approximation for the P (xj |z), which represents a latent marginal distribution
across the dimension of the variable xj , and for the P (z), which contains the latent variable prior.
The base model for PLCA is presented above and, as referred in [31], it can be extended to allow invariance to
transformations. This method and it’s properties will be further explored in chapter 3.
2.4 Artificial Neural Networks
Artificial neural networks are a branch of Machine Learning methods and algorithms that are broadly utilized in
pattern recognition. These algorithms where inspired by the neural structure of the brain. They are based on
13
networks that contain a series of computational nodes or neurons (inspired by human neurons). To improve the
MSSIPLCA method implemented in chapter 3, a CNN was developed. As such, in the remainder of this section
the fundamentals of Artificial Neural Networks will be explored and the developed CNN will be detailed in chapter
4.
A node receives input data and combines it with it’s own set of coefficients which can emphasize or lessen
this data relevance. This weighted input is then summed and passed through an activation function, yielding
the output of the node, as can be seen in Figure 2.2a. This output determines if the input data should be
"considered" to the output of the net. The set of coefficients or weights of a node can be dynamically changed in
order to emphasize specific data, in a learning process. We can then combine several nodes into a layer, like in
Figure 2.2b. A neural network consists on one or more node layers. Given a specific data set, a neural network
can be trained to correctly classify input data [35].
(a) Neural node example (b) Artificial neural network example.
Figure 2.2: An example of a generic neural node is presented at the left [35]. At the right a generic artificial neuralnetwork architecture is displayed.
There are multiple types of neuron. A perceptron is a simple neuron, in which the output is zero or one if the
summed weighted input is lesser or greater than a given threshold. Although simple, the zero or one approach of
this neuron makes its learning process harder because small changes in the input can cause drastic changes in
the output. To solve this issue sigmoid neurons are utilized. These neurons have sigmoid functions as activation
functions, thus removing the drastic response to small input changes [36].
A neural network can have several architectures. Feed-forward networks are networks with more than one layer,
in which each layer receives as an input the previous layer’s output. This allows the network to infer higher
degrees of complexity, since the one layer learns over the previous layers gained knowledge. An MLP is an
example of such networks. Deep neural networks are a type of artificial neural networks in which multiple hidden
layers are stacked to create a neural network, hence deep. Thus, a deep-learning layer is capable of modelling
complex data with non-linear relationships [37].
Despite the vast combination of network types and architectures, all of them must undergo a learning process,
in order to learn from the training data. There are two types of learning: supervised and unsupervised. In
supervised learning the data set utilized is correctly labelled and identified. In unsupervised learning the data set
has no label, so the network cannot compare its classification or prediction with the real one [37]. In this thesis
only supervised learning will be considered and the learning algorithm considered will be the Backpropagation
algorithm combined with the stochastic gradient descent algorithm with momentum .
14
2.4.1 Stochastic Gradient Descent
In machine learning, specifically in artificial neural networks, a cost function is a function that returns an indicator
(scalar) of the network’s performance. It compares the output of the network with the correct desired value. An
example of a cost function is the quadratic cost function or mean squared error:
C(θ) =1
2n
∑x
||y(x)− a||2 (2.17)
were θ is a parameter vector that includes the weights and biases of the network, n the number of training inputs,
a all outputs of the network when x is the input and y(x) is the desired output. Minimizing the cost function is the
goal of the learning process. Calculating the gradient of a cost function, is an important step towards minimizing
it. The gradient is a vector containing the partial derivatives of the considered function. Thus, the gradient of a
scalar cost function is defined as:
∇C(θ) =
(∂C(θ)
∂θ1
,∂C(θ)
∂θ2
, . . . ,∂C(θ)
∂θi
)T(2.18)
One way to interpret the gradient is the variation ∆C, when the ∆θ variation is very small. We can then consider
that the gradient relates the variation in the parameters with the variation in the cost function [38]. Then, to
minimize the function C(θ) we want to make ∆C < 0, decreasing the cost function value. It can then be defined:
∆θ ≈ −η∆C (2.19)
where η is a small positive parameter which is called the learning rate. With Equation (2.19), a variation ∆θ can
be chosen which enforces ∆C < 0. The parameter η has to be small enough to maintain the approximation made
in Equation (2.19) but not to small otherwise it will generate very small variations ∆θ, turning the minimization
process very slow. Thus by iteratively applying Equation (2.19), we achieve a successively smaller value of C(θ)
[36]. This iterative process is the Gradient Descent method, and it can be summed up in the following update
equation:
θi+1 = θi − η∆C (2.20)
As it is seen in Equation (2.17) a Cost function output is the average of the Cost function applied to every input.
As such, given a large data set (frequent in neural networks) the cost function gradient will have to be computed
for every input, which can be a slow process. An extension to the Gradient Descent algorithm was introduced to
address this issue. This extension is called the Stochastic Gradient Descent (SGD) algorithm. The idea behind
this method is to calculate the average gradient of a small set of input data, and use this average to estimate the
overall average of the input gradients [36]. The following update equation represents the SGD algorithm:
15
θi+1 = θi − η∑Mm=1 ∆CmM
(2.21)
wherem = 1, . . . ,M represents the randomly chosen small set of input data and ∆Cm is the gradient of the cost
function for the input data m. To further improve the cost function minimization process, a momentum technique
can be included in the SGD algorithm. The momentum technique alters the update rule to account the previous
update ∆θ, which can be interpreted as the "speed of the descent". A momentum parameter γ is introduced
allowing the next update to consider the previous update, thus maintaining the "speed" (hence momentum). This
technique can be summarized in the following equations:
θi+1 = θi −∆θ (2.22a)
∆θ = −η∇Ci(θ) + γ (2.22b)
θi+1 = θi − η∇Ci(θ) + γ (2.22c)
Equation (2.22c), defines the update rule for the SGD algorithm with momentum.
2.4.2 Backpropagation
The Backpropagation algorithm was introduced in the 1970’s, but it was not until 1986 that its importance in
neural network learning was fully appreciated [36]. In 1986, D. Rumelhart, G. Hinton and R. Williams argued that
the Backpropagation algorithm provided a faster learning process that the remainder learning approaches [39].
Thus, this algorithm is the basis of modern neural network learning processes.
The goal of the Backpropagation algorithm is to calculate partial derivatives of a cost function with respect to
any weight or bias on the network. The algorithm provides insight on how changing the weights and biases of
the network impacts it’s the overall outcome. It is utilized to calculate the necessary partial derivatives in order
to execute the SGD algorithm, which in its turn will minimize the cost function. This interaction between both
aforementioned algorithms provides the necessary tools to train a neural network. In [36] a detailed explanation
of the Backpropagation algorithm is provided. It will be used as a guide through the remainder of this section.
Throughout the following explanation of the Backpropagation algorithm the following notation will be utilized:
• wljk denotes the connection from the neuron k of the layer (l − 1) to the neuron k of layer l;
• alj denotes the activation of neuron j in layer l;
• blj denotes the bias of neuron j in layer l;
• Matrices are written in bold upper-case letters as vectors are written in bold lower-case letters.
In the Figure (2.3) an example of the application of this notation is provided. Using this notation we can define an
16
Figure 2.3: Exemple of a neural network with the notation used in Section 2.4.2 applied.
activation of a neuron and the weighted input of a neuron in a layer zlj as:
alj = σ
(∑k
wljkal−1j + blj
)(2.23a)
zlj =∑k
wljkal−1j + blj (2.23b)
where σ(.) represents a sigmoid function, as described at the beginning of the Section 2.4. Rewriting these
equations in a matrix form provides a better overall insight, as the equations become lighter due to lesser indexes.
al = σ(W lal−1 + bl) (2.24a)
zl = W lal−1 + bl (2.24b)
with the sigmoid function applied element-wise. Thus, al is the activation vector that contains all activations alj ,
bl is the bias vector that contains all bias blj , Wl is the weight matrix for the layer l containing all the weights of
the layer and finally zl is the weighted input to the neurons in layer l.
In order to be properly utilized by the Backpropagation algorithm, the cost function has two constrains. Since
the algorithm will calculate partial derivatives of individual training examples, it must be possible to write the cost
function as an average of the cost functions of individual training samples and as a function of the outputs of the
network. These constrains are presented in Equations (2.25).
C =1
n
∑x
Cx (2.25a)
C = C(aL) (2.25b)
where L is the output layer. After these definitions, the four main equations of the Backpropagation algorithm can
17
be presented, where the operator � denotes an element-wise product of two vectors.
δL = ∆aC � σ′(zL) (2.26a)
δl = ((W l+1)T δl+1)� σ′(zl) (2.26b)
dC
dblj= σlj (2.26c)
dC
dwljk= al−1
k σlj (2.26d)
As mentioned above, the main goal of the algorithm is to calculate the quantities in Equations 2.26c and 2.26d.
In order to do so the quantity δlj is calculated. This quantity is the error in the neuron j in the layer l.
In Equation (2.26a) the error in the output layer L is computed. The quantity ∆aC is a vector containing all partial
derivatives dC\daLj , which can be interpreted as the rate of change of the cost function regarding the output
activations. In σ′(zL), the rate of change of the activation function is measured at zL. As in Equation (2.26b)
provides the insight to calculate the error of a layer l regarding the error of the next layer l+ 1. By taking the error
in the next layer, δl+1, and multiplying it by the transpose of the weighted input matrix at layer l + 1 and then
performing an element-wise product with the quantity σ′(zl) we are passing the error through the net backwards,
hence backpropagation.
Combining Equations (2.26a) and (2.26b), the error at every layer can now be computed. Starting by computing
the error at the output layer (equation 2.26a), the error at the layer L−1 can now be computed (Equation (2.26b)),
and so on. With the error at each layer, the partial derivatives dC\dblj and dC\dwljk can now be calculated,
through Equations (2.26c) and (2.26d), as intended. Proof of these four fundamental equations is provided in
[36].
2.4.3 Learning Algorithm
As shown with the Backpropagation (equations (2.26)), the partial derivatives of a cost function for a input example
can be computed. To train a neural network we then combine the Backpropagation algorithm with a learning
algorithm such as SGD, where the partial derivatives for multiple training examples are calculated. We can now
define the Learning algorithm [36]:
1. Input training data
2. For each training data x:
(a) Activation: Set the activation ax,1.
(b) Feedforward:
For each layer (L ∈ [2, . . . , L]) compute:
i. zx,l = W lax,l−1 + bl
ii. ax,l = σ′(zx,l)
(c) Output error:
Compute δx,L = ∆aCx � σ′(zx,l)
18
(d) Backpropagate the error:
For each layer (L ∈ [2, . . . , L]) compute δx,L = ((W l+1)T δx,l+1)� σ′(zx,l)
3. Gradient Descent:
For each layer (L ∈ [2, . . . , L]) update the weights and the biases according to the update rules defined in
equation 2.22
As its shown in the algorithm above the error is propagated backwards through the net, as it is calculated from
the last layer to the first. This will provide insight to the network on how the input data affects the output data.
Thus, by selecting small sets of input data and iterating through this algorithm, the cost function will be minimized
and the network will learn.
19
Chapter 3
Multi-Sample Shift Invariant Probabilistic
Latent Component Analysis
In this Chapter, the MSSIPLCA module of the program developed in this thesis will be explored. This module
consists on the implementation of the MSSIPLCA method developed by Benetos, Ewert and Weyde to perform
automatic transcription of pitched and unpitched sounds [6]. In Section 3.1 the MSSIPLCA model is presented as
an extension of the PLCA model summarized in Section 2.3. In Section 3.2 the implementation of this model is
addressed, including the pre-processing data stage, the instrument template creation and the overall performance
of the implemented module.
3.1 MSSIPLCA Model
To achieve the following model, several modifications and extensions were proposed by Benetos et al. to the
basis PLCA method described in Section 2.3. Shift-invariance across log-frequency was added in order to detect
tuning changes and frequency modulations, the usage of multiple spectral templates per instrument and per
pitch was implemented and each source contribution was enforced to be time and pitch-dependant [40]. A
diagram of the model with these extensions is presented in Figure 3.1. Sparsity constraints were added to control
the polyphony level and the instrument contribution to the resulting transcription, and the spectral templates
were pre-extracted and pre-shifted across log-frequency to reduce the computational effort [41]. Benetos, Ewert
and Weyde’s proposed model adds the feature to detect unpitched sounds (sounds produced by percussive
instruments such as drums) [6]. In this Section, the latter model will be described as in [6], as well as the
aforementioned properties.
3.1.1 Mathematical Model
The input to the model is a log-frequency spectrogram. In [6], it is interpreted as a probability distribution across
log-frequency ω and across time t, which is a strong assumption as it directly interprets the energy of a spec-
trogram as a probability value. The log-frequency spectrogram is then represented as Vω,t and a probability
distribution is represented as P (ω, t). The probability distribution is then decomposed into the known quantity
of the frame probability P (t) and into the conditional distribution over log-frequency bins P (ω|t) (resulting from
dividing the entire log-frequency range into consecutive and non-overlapping frequency intervals):
20
Figure 3.1: Shift-Invariant PLCA model with support to multiple templates per instrument and per pitch, presentedin [40].
Vω,t ≈ P (ω, t) = P (t)P (ω|t) (3.1)
The conditional distribution over log-frequency bins, P (ω, t) is then decomposed into two components: a pitched
component and an unpitched component. The resulting decomposition is described in the following Equation:
P (ω|t) = P (r = h|t)Ph(ω|t) + P (r = u|t)Pu(ω|t) (3.2)
where Ph(ω|t) is the spectrogram approximation to the pitched component and Pu(ω|t) is the spectrogram
approximation to the unpitched component. The probability P (r|t) weights the respective component over time,
having r ∈ {h, u} for the pitched (h) and unpitched (u) components respectively.
Considering only the pitched component Ph(ω|t), as in [40], a latent variable p that represents the pitch (using
the MIDI scale for pitch), is added to the model. The resulting pitched component is:
Ph(ω|t) =
pmax∑pmin
P (ω|p, t)P (p, t) (3.3)
Additionally, a latent variable for instrument sources s which represents the instrument index; and a latent variable
for pitch shifting across log-frequency f (also referred to as the shifting parameter), are also added to the model.
Obtaining:
Ph(ω|t) =∑p,s
Ph(ω|s, p) ∗ω Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.4)
where Ph(ω|s, p) represent the spectral templates for a given pitch p and for a specific instrument s, Ph(f |p, t)represents the time-varying log-frequency shift per pitch which is convolved with Ph(ω|s, p) across ω (opera-
tor ∗ω), Ph(s|p, t) represents the instrument contribution per pitch across time and Ph(p|t) is the pitch activation
across time.
To obtain Equation (3.4) the chain rule is applied. This rule states how can a probability distribution can be
21
represented in terms of conditional probabilities. It is described in Equation (3.5a), and in Equation (3.5b) a
decomposition of a 4 variable probability distribution is performed where the result of repeatedly applying the
chain rule to the final term of the decomposition can be seen.
P (An, . . . , A1) = P (An|An−1, . . . , A1) · P (An−1, . . . , A1) (3.5a)
P (A4, A3, A2, A1) = P (A4|A3, A2, A1) · P (A3|A2, A1) · P (A2|A1) · P (A1) (3.5b)
Finally, removing the convolution operator in Equation (3.4), we get the following model for the pitched component:
Ph(ω|t) =∑p,f,s
Ph(ω − f |s, p)Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.6)
In order to reduce the computational effort of the following steps of parameter estimation (Section 3.1.2), the use
of pre-extracted and pre-shifted templates is introduced [6, 41]. Thus, with this modification, the proposed model
for the pitched component is described as follows:
Ph(ω|t) =∑p,f,s
Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.7)
where Ph(ω|s, p, f) are the spectral templates per pitch p and instrument s, shifted across log-frequency ac-
cording to f ; Ph(f |p, t) represents the time-varying log-frequency per pitch; Ph(s|p, t) represents the instrument
contribution per pitch across time and Ph(p|t) is the pitch activation. The time-frequency representation, as in
[6], has a spectral resolution of 5 bins per semi-tone, thus having f ∈ {1, . . . , 5} allowing the templates to by
shifted by ± 12 semi-tones and having the ideal tuning position at f = 3.
For the unpitched component, 2 latent variables were added: d which denotes the drum kit component utilized
and z which is the index for the templates used for each component. Applying the same process as the pitched
component to the unpitched component yields the following decomposition:
Pu(ω|t) =∑d,z
Pu(ω|d, z)Pu(d|t)Pu(z|d, t) (3.8)
where Pu(ω|d, z) denotes the z-th spectral template for the drum component d, Pu(d|t) represents the drum
component activation and Pu(z|d, t) denotes the template contribution per drum component over time.
The overall mathematical model is obtained when both components are considered (Equations (3.7) and (3.8)):
22
Vω,t ≈ P (ω, t) = P (t)P (r = h|t)Ph(ω|t) + P (t)P (r = u|t)Pu(ω|t) =
= P (t)P (r = h|t)∑p,f,s
Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t)
+ P (t)P (r = u|t)∑d,z
Pu(ω|d, z)Pu(d|t)Pu(z|d, t) (3.9)
3.1.2 Unknown Parameter Estimation
The mathematical model presented in Equation (3.9) has several parameters, some of which are fixed and known
while the others are unknown. The next step of the proposed MSSIPLCA method proposed in [6], is to estimate
these unknown parameters. In table 3.1, the parameters of the previously mentioned model are detailed. As men-
tioned, the parameters Ph(ω|s, p, f) and Pu(ω|d, z), are fixed and known, they correspond to the pre-extracted
and pre-shifted templates.
Table 3.1: Parameters used in the implemented MSSIPLCA model, proposed in [6].
Parameter Component State Description
P (t) known Spectrogram energy
P (r|t) uknown Weights a component contribution over time
Ph(ω|s, p, f) pitched known Spectral templates per pitch and instrument, shifted according to f
Ph(f |p, t) pitched uknown Log-frequency per pitch, over time
Ph(s|p, t) pitched uknown Instrument contribution per pitch, over time
Ph(p|t) pitched uknown Pitch activation, over time
Pu(ω|d, z) unpitched known Spectral template per drum component
Pu(z|d, t) unpitched uknown Template contribution per drum component, over time
Pu(d|t) unpitched uknown Drum component activation
To estimate the unknown parameters, the EM algorithm is used [33] as in Section 2.3. The model’s log-likelihood
is defined as:
L =∑ω,t
Vω,t log(P (ω, t)) (3.10)
Again the EM is divided into two distinct steps. In the Expectation step the contribution of the latent variables is
estimated by a weighting function. This process results in the following Equations, for the pitched component and
for the unpitched component:
P (s, p, f, r = h|ω, t) =P (r = h|t)Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t)
P (ω|t)(3.11a)
P (d, z, r = u|ω, t) =P (r = u|t)Pu(d|t)Pu(z|d, t)
P (ω|t)(3.11b)
In the Maximization step, the marginals will be re-estimated, but this time with the estimations calculated in the
23
Expectation step. Thus, resulting in the following Equations for the pitched component:
P (r = h|t) ∝∑s,p,f,ω
Vω,tP (s, p, f, r = h|ω, t) (3.12a)
Ph(f |p, t) =
∑ω,s Vω,tP (s, p, f, r = h|ω, t)∑ω,s,f Vω,tP (s, p, f, r = h|ω, t)
(3.12b)
Ph(s|p, t) =
∑f,ω Vω,tP (s, p, f, r = h|ω, t)∑f,ω,s Vω,tP (s, p, f, r = h|ω, t)
(3.12c)
Ph(p|t) =
∑s,f,ω Vω,tP (s, p, f, r = h|ω, t)∑s,f,ω,p Vω,tP (s, p, f, r = h|ω, t)
(3.12d)
(3.12e)
and the following Equations for the unpitched component:
P (r = u|t) ∝∑d,z,ω
Vω,tP (d, z, r = u|ω, t) (3.13a)
Pu(d|t) =
∑z,ω Vω,tP (d, z, r = u|ω, t)∑z,ω,d Vω,tP (d, z, r = u|ω, t)
(3.13b)
Pu(z|d, t) =
∑ω Vω,tP (d, z, r = u|ω, t)∑ω,z Vω,tP (d, z, r = u|ω, t)
(3.13c)
(3.13d)
In music, generally, only few notes are active at the same time, and these notes in this small time interval are
produced by few instrument sources. With the pre-extracted templates, the model described above has more
information than its input. Thus, to control the polyphony level and the instrument contribution over time, sparsity
is enforced [6, 40, 41]. Sparsity is enforced on all parameters, through the use of a scaling factor α in the update
Equations — Equations (3.12) and (3.13). When this scaling factor is greater than 1, the probability densities
are sharpened, which will lead to more weights being close to 0 and less weights being close to 1, this enforcing
sparsity. Below are the new constrained update Equations:
Ph(f |p, t) =(∑ω,s Vω,tP (s, p, f, r = h|ω, t))α∑
f (∑ω,s Vω,tP (s, p, f, r = h|ω, t))α
(3.14a)
Ph(s|p, t) =(∑f,ω Vω,tP (s, p, f, r = h|ω, t))α∑
s(∑f,ω Vω,tP (s, p, f, r = h|ω, t))α
(3.14b)
Ph(p|t) =(∑s,f,ω Vω,tP (s, p, f, r = h|ω, t))α∑
p(∑s,f,ω Vω,tP (s, p, f, r = h|ω, t))α
(3.14c)
Pu(d|t) =(∑z,ω Vω,tP (d, z, r = u|ω, t))α∑
d(∑z,ω Vω,tP (d, z, r = u|ω, t))α
(3.14d)
Pu(z|d, t) =(∑ω Vω,tP (d, z, r = u|ω, t))α∑
z(∑ω Vω,tP (d, z, r = u|ω, t))α
(3.14e)
24
(3.14f)
In the overall model only the parameter of the pitch activation - Ph(p|t), was enforced with sparsity. The remaining
parameters have α = 1 while Ph(p|t) has α = 1.1, as in [6].
25
3.2 Implementation
This module was developed in Matlab due to its toolbox integration characteristics. To implement the mathemati-
cal model described in Section 3.1, several Matlab toolboxes were utilized. To implement the MSSIPLCA method,
the toolbox provided in [6] was used (MSSIPLCA Toolbox). This toolbox contained a demo of an implementa-
tion of the aforementioned model, with a template library for the pitched and unpitched components. This demo
was adapted to consider dynamic libraries and was parametrized with the desired sample frequency, spectral
resolution and audio input size. Furthermore, a template extraction module was developed based on the demo
code provided by Emmanouil Benetos, to create template libraries of the instruments considered in this thesis’s
audio data sets. To calculate the CQT, the Matlab toolbox provided in [29] is used (CQT Toolbox). This toolbox
contained the tools to efficiently calculate a CQT spectrogram from a sound signal. Finally, to generate MIDI files
for testing, the toolbox provided in [42] is used (MIDI Toolbox). This toolbox contains the tools to create, modify
and extract information of MIDI files.
The developed module only concerns the pitched component of the MSSIPLCA method presented in [6], un-
pitched sound signals are out of the scope of this thesis. Although the unpitched component is implemented and
properly functioning, no tests or modifications were made. If an unpitched sound signal is provided, the module
will transcribe it using the pre-extracted templates provided by the demo implementation of the MSSIPLCA Tool-
box. Since no unpitched sounds will be provided to the module or to the overall system throughout this thesis,
from now on no mention of the unpitched component will be made and the pitced component will be the only one
considered.
3.2.1 Template Extraction
As mentioned in Section 3.1.1, the model uses pre-extracted and pre-shifted spectral templates per pitch and per
instrument. In order to grant instrument diversity to the transcription process, multiple templates were extracted.
Using a digital instrument library, 82 templates were extracted combining 42 different instruments sources with
8 different playing characteristics (e.g. vibrato and staccato) obtaining a total of 17 generic instruments (e.g. for
a piano we can have distinct piano sources, generating one generic instrument). In this thesis an interval of 88
notes was considered, limiting the notes played to the following interval p ∈ {21, . . . , 108}, in MIDI scale values.
In table 3.2 the generic instruments considered are displayed. For each of the above mentioned combinations,
an audio file was created with every note played individually, as in Figure 3.2.
To extract a pitched template from an audio file, a variant of the PLCA algorithm was used with only one latent
variable component. This latent variable component denotes the pitch, as in Equation (3.3). Applying this algo-
rithm yields the same output as applying a NMF algorithm with beta divergence, as it is stated in [43]. Through
the use of the NMFlib Toolbox [44], a direct implementation of the NMF algorithm with beta divergence is applied
(based on the code provided by Emmanouil Benetos, the author of [6]). This method was applied to input files
of the digital instruments playing all their range of notes individually and following the note’s scale. The log-
spectrogram of the audio signal was computed, again using the CQT, maintaining the same log-frequency as in
Section 3.2.2, but using a temporal step of 100ms. Then the NMF algorithm was applied, thus obtaining a pitch
template of the audio file.
Finally, after extracting the pitched template through the process described above the templates where shifted
across the log-frequency, as it is described in Section 3.1.1, Thus, obtaining a pre-extracted and pre-shifted
template per pitch and per instrument. After computing all the templates all of them where saved in a matrix,
creating a template library.
26
Table 3.2: Generic instruments presented on the developed template library.
Instrument
Bass
Brass
Cello
Clarinet
Contra Bassoon
Double Bass
Flute
French Horn
Guitar
Oboe
Piano
Sax
Trombone
Trumpet
Tuba
Viola
Violin
3.2.2 Pre-processing
The first process executed in this module is the pre-processing stage, where the log-spectrogram is obtained
using CQT from the input signal. The CQT is performed with a log-frequency resolution of 60 bins per octave,
having 545 frequency bins. Also the log-spectrogram is sampled with a 40ms step. As mentioned in Section
2.2, in this time-log-frequency representation the relative distance between the harmonics is constant, this can
be seen in Figure 3.2. In this Figure, the log-spectrogram of a piano performing every note on its keyboard
individually is displayed.
3.2.3 Post-processing
After applying the MSSIPLCA and estimating the unknown parameters the transcriptions can now be extracted.
The transcriptions are extracted as MIDI-scales. The total pitched component transcription and the unpitched
component transcription can be extracted as, respectively:
Ph(p, t) = P (t)P (r = h|t)Ph(p|t) (3.15a)
Pu(d, t) = P (t)P (r = u|t)Pu(d|t) (3.15b)
To extract the transcription for each instrument source, the latent variable s should be fixed to the target instrument
index λ, and the the following calculation should be performed:
27
Figure 3.2: Example of a CQT of a piano performing every note individually. Again it can be seen that the relativedistance between the harmonics remains constant, as mentioned in Section 2.2.
Ph(s, p, t) = P (t)P (r = h|t)Ph(p|t)Ph(s = λ|p, t) (3.16)
Performing this calculation yields a piano-roll like matrix, which contains a raw transcription output. In order
to obtain a good piano-roll transcription, some post-processing steps are performed. The first step consists in
normalizing the raw output. The normalization is performed with the following operation:
Ph(s, p, t)
max(∑
s,p,t |Ph(s, p, t)|) (3.17)
After having a normalized raw transcription thresholding is performed. Given a threshold parameter σ, the tran-
scription is converted into a ones and zeros matrix. If the value of the raw transcription surpasses the σ value the
output is 1 and 0 otherwise. Lastly, since the minimum duration of a note is defined to be 0.2 seconds, events
with duration inferior to 80ms are removed in an effort to eliminate small transcription errors.
3.2.4 Performance Evaluation
An overview of the implemented module can be seen in the system diagram presented in Figure 3.3. To evaluate
the module’s performance a test experiment was conducted. In this Section this experiment is described.
The aim of this experiment is to test the performance of the implemented module, facing different sizes of template
libraries. For each audio file considered, the model was executed incrementing the template library size. The
full library size considered included 8 templates, of 8 distinct instruments. In table 3.3 these templates and their
28
Figure 3.3: System diagram of the developed Module 1, adapted from [40].
characteristics are presented. The template library is incremented by first choosing the present templates in the
audio file. For instance, if an audio file has instrument A and B performing the the library starts with a template of
A or B, and then the template of B or A is added. After, the remainder templates are added randomly.
Test data
To create a plausible data set for testing, the audio files were created using random MIDI files. These MIDI files
were generated by an auxiliary module, developed with the use of the MIDI Toolbox mentioned at the beginning
of the chapter. This module generates a random MIDI file, under the constraints of total file time, number of notes
per file, minimum and maximum note duration time and polyphonic control (if the notes were allowed to be played
in the same time frame). The notes were randomly selected and spread across time. For this test the MIDI files
used had the following characteristics: 30 second total duration, note duration in [0.2s, 4s] and polyphony was
allowed.
For this test, 30 different audio files were generated. They were divided into 3 polyphony levels (3 sets of 10
files). In the first level, the audio files have only one channel of one of the 8 instruments performing accordingly
to a random MIDI file, with the characteristics presented above. In the second level, 2 channels were added with
two distinct instruments performing at the same time, according to a MIDI file assigned individually. In the third
level another distinct instrument was added, generating audio files with 3 channels.
Table 3.3: Template library considered in this experiment. The pitch activation is represented in MIDI scalenumbers.
Index Instrument Playing Style Pitch Activity (Range)
19 Bass Open [23 64]
30 French Horn Normal [29 65]
33 Trumpet Normal [40 76]
43 Sax Legato [21 96]
46 Bb Clarinet Normal [36 79]
58 Oboe Normal [63 99]
68 Cello Normal [24 67]
80 Violin Normal [43 89]
29
Metrics
In order to evaluate the accuracy of the transcription the following metric was applied. The time interval of each
note in the ground truth is inspected in the transcription result, with a tolerance of 40ms. If a note is present in this
time interval, with a duration of over δ(%) of the original interval, then this detected note is considered correctly
transcribed. This process detects accurate transcribed notes as well as false negative transcriptions.
To detect false positive transcriptions the same process was implemented but this time by inspecting the time
interval of each detected note of the transcription result, in the ground truth file. Through a fine-tuning process the
parameter δ was fixed at 75% for both metric processes, in this experiment. This value of δ provides a plausible
accuracy consideration, as it is not excessively high, ignoring notes not fully transcribed, nor excessively low,
considering small transcription errors.
To prevent false detections due to temporal synchronization issues, prior to the evaluation, both the ground truth
and the transcription are aligned temporally. This alignment is performed based on the onset times detected
in notes with the same pitch, where the onset difference between the ground truth and the transcription of all
detected notes is processed to to generate an overall shift value that aligns both time frames.
Results
After executing the model for all the audio files of the test data set, the results were grouped by the polyphony
level of the data set. A percentage of correctly transcribed notes for each level and each library size, considering
the number of notes in the ground truth and the positive notes transcribed detected, is presented in table 3.4. The
correspondent plot can be observed in Figure 3.4. Also the percentage of false positives detected, in relation to
the total number of ground truth notes can be seen in Figure 3.5.
Figure 3.4: Module 1 evaluation test results: Percentage of positive notes transcribed plotted against the size ofthe template library considered.
After inspecting the results in table 3.4, it can be concluded that adding unnecessary templates to the library
causes the module to have worse performance in the transcription task. Visually this can be seen in the graphic
30
Table 3.4: Module 1 evaluation test results: Percentage of notes correctly transcribed and resulting transcriptionerror.
Library size: 1 2 3 4 5 6 7 8 Average
Level 1: 0.7424 0.7264 0.6852 0.6835 0.6304 0.5941 0.5687 0.4848 0.6394
Level 2: 0.6044 0.6185 0.5760 0.5518 0.5319 0.4836 0.4398 0.4307 0.5295
Level 3: 0.5427 0.5423 0.5620 0.5222 0.4975 0.4464 0.4094 0.3636 0.4857
Accuracy: 0.6298 0.6290 0.6077 0.5858 0.5533 0.5080 0.4726 0.4264 0.5516
Error: 0.3702 0.3709 0.3923 0.4142 0.4467 0.4920 0.5274 0.5736 0.4484
in Figure 3.4, where the percentage of correctly transcribed notes decreases as the templates of instruments nor
present in the audio file are added. On the other hand, if there are not enough templates to match the instruments
present in the audio file, the transcription also has worse performance. This can be seen, while observing the
graphic for level 2 and 3, where the library size is smaller than 2 and 3, respectively.
Another conclusion that can be inferred is that the overall performance of the module decreases when the
polyphony of the audio file increases. Observing the ideal transcription for each level (level 1 with library size
of 1, level 2 with library size of 2 and level 3 with library size of 3), the average accuracy obtained is 74.24%
for audio files with only one instrument. For audio files with 2 instruments performing, the average accuracy is
61,85% and for audio files with 3 instruments performing at the same time the average accuracy is 56,20%.
Figure 3.5: Module 1 evaluation test results: Percentage of false positive note transcriptions plotted against thetemplate library size.
In the plot of Figure 3.5, the percentage of false positive notes detected in respect with the total number of notes
in the ground truth can be observed. Inspecting this graph yields yet another conclusion. The number of false
positives increases with the polyphony level, when trying to transcribe a polyphonic music with a template library
size inferior to the number of instruments performing. In the aforementioned graph, when the template library
size is equal to the number of instruments the percentage of false positives decreases significantly. This can be
explained by the module’s attempt to assign notes that are not performed by the instrument considered to the
31
existing instrument in the template library. Thus, when the library is inferior to the existing number of instruments
the number of false positives can be superior to the number of notes that the instrument performed, (it takes into
account notes from other instruments).
In order to visualize the conclusions inferred above, a transcription example is presented in Figure 3.6. In this
example the audio file has two instruments performing. In Figure 3.6a, the ground truth transcription for one of the
instruments is presented. In Figures 3.6b, 3.6c and 3.6d transcription results of that instrument are presented.
In Figure 3.6b, the template library has only the template of the instrument considered. As such, the result
transcription includes notes that are from the second instrument in the audio file (false positives). In Figure 3.6c
the template library has 2 templates, one for each instrument present in the audio file. The factorization performed
by the module now has all the factors it "needs", so the false positives disappear from the transcription of the
instrument considered. Finally, in Figure 3.6d, the template library has 8 templates including the correct ones.
Here we can visualize the impact of adding unnecessary templates to the library. The factorization performed
by the module tries to distribute the spectrogram energy by elements that did not contribute to it, damaging the
transcription for the instruments that in fact are performing in the input audio file.
3.2.5 Proposed Solution
As seen above, the library size directly influences the performance of the module. Choosing a large library or too
small will result in higher transcription errors. This makes the module directly dependant of the library. In order
to grant autonomy to the module and to remove human interaction from this transcription process, an instrument
classifier was developed. This instrument classifier detects instruments in an audio file and providing insight to
this module on how many templates should it use. Thus, an attempt to automatically infer the proper size of the
template library is performed, granting autonomy to the module and removing human interaction from its process.
The instrument classifier is implemented in a second module, that is described in Chapter 4. It consists on a
Convolutional Neural Network that classifies the musical instruments detected in an input log-spectrogram.
32
(a) Ground truth. (b) Library size = 1 template.
(c) Library size = 2 templates. (d) Library size = 8 templates.
Figure 3.6: Example of transcription results with different library sizes, for an audio file with 2 instruments per-forming.
33
Chapter 4
Convolutional Neural Network
To address the limitation imposed by the usage of a static and user-defined template library on the MSSIPLCA
module presented in Chapter 3, a second module was developed — CNN module. It is a classification module
that performs classification through a Convolutional Neural Network. In this Chapter this module will be detailed,
as well as the fundamentals about CNNs. In section 4.1 an explanation regarding these type of networks is pre-
sented. In section 4.2 the module’s implementation specifics are presented, including the network’s architecture,
the training phase and its performance evaluation.
4.1 Convolutional Neural Networks
Convolutional Neural Networks are feed-forward deep neural networks. They are inspired by the architecture of
the visual cortex of animals, namely the cat’s visual cortex (as in [45]). CNNs are considered to be among the
best pattern recognition systems [37]. This can be seen in the handwritten character recognition task, where in
1998 LeCun et al. developed a benchmark system with state-of-the art performance [46].
Regular neural networks, as the MLP, take the input data and, through its propagation on the hidden layers,
generate an output. These hidden layers, as seen in section 2.4, consist in several neurons (e.g. the perceptron).
These neurons are fully-connected to the previous layer neurons, as in Figure 2.2a. When the input of the net is
an image, it isn’t hard to see that this fully-connected architecture won’t scale properly, with the parameters adding
up through the layers. This particularity makes training an arduous and computationally expensive process [47].
In CNNs the inputs are interpreted as images, having 3 dimensions: width, height and depth (whith the latter
corresponding to the red, green and blue channels when dealing with real images). Thus, the neurons in a
CNN will also have these 3 dimensions. The neurons also have the particularity of only being connected with
a specific spatial region of the previous layer. This architecture scales well with input images and allows the
net to be trained, unlike fully-connected networks where training in this conditions would be very difficult or even
impossible [37].
34
4.1.1 CNN layers and architecture
There are three main types of layers when building a CNN: Convolutional layers, Pooling layers and Fully-
connected layers. In the following section these fundamental layers will be described alongside with other layer
types that can be applied to CNNs. Stacking multiple layers with different type combinations will generate a fully
functioning CNN architecture.
Convolutional layers
Convolutional layers are the fundamental piece of CNNs. They are composed by a set of parameters that consist
in a set of weights, often called filters or kernels. As in neural networks, these weights can be updated in a
training process to learn different representations of data. A filter has a small width and height compared to the
input, but it has the same depth. In the forward pass step of the learning process (Subsection 2.4.3), each filter
is convolved across the width and the height of the input, hence the name convolutional network. As the filter
passes through the input image, the filters will be updated in order to be activated when a certain feature arises in
a specific spatial location. This process creates an activation map. These activation maps may also be followed
by an element-wise activity function, such as the Rectifier Linear Units (ReLU)s, which will be explained below.
Each neuron’s output can be interpreted as the result of a neuron analysing a small spatial location [47]. This
spatial location is called the receptive field, which dictates the size of the spacial location to be analysed by the
neuron. This denotes an important property of CNN: the neurons are locally connected.
The full output of the layer consists in stacked activation maps, creating a 3-dimensional output. The width and
the height of this output is given by the convolution operation between the filters and the input. The depth of this
output is a chosen quantity. It denotes how many neurons it is desired to analyse the same spatial location. This
group of neurons can be interpreted as a depth column, as seen in grey in Figure 4.1.
Each depth column has a spatial location assigned. These spatial locations often overlap, causing different depth
columns to analyse the same partial spatial location. This overlap is dictated by the stride quantity. For example,
if the stride is set to 1, a new depth column will have as a spatial location, a spatial location 1 spatial unit apart
from the previous one. As the convolution operation changes the size of the input image, zero-padding can be
triggered to prevent this from happening. Zero-padding consists in adding zeros to the spatial borders of the
input. It allows the control of the output dimensions [47].
The previous quantities denote the hyper-parameters of a convolutional layer. A hyper-parameter can be in-
terpreted as a high-level parameter that will influences the model’s performance. The output size can then be
computed with the hyper-parameters and the filer information [48], as follows:
Oh,w =Ih,w − Fh,w + 2P
S+ 1 (4.1a)
Od = K (4.1b)
where Oh,w denotes the output height and width (that are calculated equally),Od denotes the output depth, Ih,wis the input height and width, Fh,w is the filter’s height and width, P is the amount of zero-padding used, S is the
stride and K is the number of filters used.
Another important property of convolutional layers is their parameter sharing characteristic. This characteristic is
35
Figure 4.1: Example of input volume and neuron arrangement in a convolutional layer.
based on the assumption that if it’s useful to calculate a set of features in one position, it is also useful to calculate
it in the remaining positions. This means that, for a fixed depth, all neurons share weights and bias, thus reducing
the number of parameters and facilitating the learning process. In practice, during the Backpropagation phase
each neuron calculates its weight’s gradient, but in the end all gradients will be added up in each existing depth
level. Since all the neurons in a depth level share the same weights, then the forward-pass of the layer in each
depth level is the convolution between the neuron’s weights (filters) and the input volume. This process results
in an activation map, and the set of all activation maps for all depth levels creates the output volume of the layer
[47].
Pooling layers
A Pooling layer is a layer that is inserted with the intent to achieve spatial invariance by reducing the spacial
size of its input. It downsamples the input resolution. A direct consequence of this resolution downgrade is
reducing the number of parameters, thus reducing the computational effort of training the the network. Reducing
the parameter number also provides overfitting control [49]. Again, this downgrade is independent for each depth
level, maintaining the depth resolution. There are multiple types of pooling operations: Max Pooling, Subsampling
and Average Pooling, to name a few .
Figure 4.2: Example of Max Pooling on an input depth level.
Due to it’s success in capturing invariances in image-like data, Max Pooling is the most commonly applied pooling
36
operation [49]. Max pooling applies filters with stride to the input function. The filters define the spatial region
where the maximum operator will be used. In Figure 4.2, an example of this operation is provided. It can be
observed that, (2× 2) filters were used, with a stride of 2. Then, the Max pooling operation selects the maximum
of each filter’s spatial location. The output map has 34 less activations than the input map, thus reducing resolution.
In Equations (4.2) the output size of the Pooling layer is computed.
Oh,w =Ih,w − Fh,w
S+ 1 (4.2a)
Od = Id (4.2b)
As mentioned above, the map output is reduced, thus down-sampling the image (Equation (4.2a)), while it’s depth
remains constant (Equation (4.2b)).
ReLU Layers
ReLU are neurons with the non-saturating non-linearity activity function f(x) = max(0, x) where x is the neuron’s
input [50]. These layers apply the mentioned non-linearity element-wise, thus leaving the input size unaltered.
These layers are stacked after convolution layers, as they provide a faster learning process than the typical
sigmoid non-linearities [51].
Dropout Layers
One major concern while dealing with large and complex deep neural networks is overfitting. Overfitting occurs
when the network has enough complexity to memorize the training data, losing its generalization capabilities.
Overfitting occurrence probability increases with the increasing size of the network [52].
The dropout technique was introduced to address the overfitting concern in artificial neural networks [53]. With
this technique overfitting is prevented by temporarily removing some neurons and their connections, in the training
process. The dropped units are randomly chosen and each unit is retained with a fixed probability (usually 0.5),
independent of other units. After removing the units, the resulting network is a thinner sample of the original
network. As this process repeats itself in each training iteration, these sampled thinner networks always consist
in different neurons and connections of the original network. Each sample network is then trained very rarely.
In the end of the training iteration an averaging technique is performed. The outgoing weights of a retained unit
are multiplied by the aforementioned fixed probability in order to combine all the sample networks into one single
network [53].
As stated in [53], neural networks using the dropout technique can be trained in a similar manner as regular
neural networks. The only difference is that the Forward and Backpropagation pass steps are applied to the
sample networks instead of the original network. Using this technique leads to a significantly lower generalization
error, thus preventing overfitting.
37
Fully-connected layers
Fully-connected layers, as the name implies, are layers whose neurons are fully connected to the previous layer
neurons. These are standard neural network layers as seen in Section 2.4. They are usually employed in the last
layers of the CNN architecture, to provide a high-level insight of the input data.
It’s important to mention that the difference between a convolutional layer and a fully-connected layer is the
locally connectivity and parameter sharing properties of the convolutional layer. Setting the convolutional layer
filter’s size to match the input image (height and depth) results in an output size of (1× 1×K), where K is the
number of filters. Thus, this layer will act as a fully-connected layer with K neurons [47].
Softmax Loss Layers
A Loss layer is the last layer of a neural network architecture. It generates the final output, thus the classification.
It consists on a Fully-connected layer with an applied loss function. The standard function is the Softmax loss
function [54].
f(z)j =xzj∑Kk e
zkj = 1, . . . ,K (4.3)
The Softmax loss function (or cost function) provides a probabilistic insight of the resulting classification, as it
outputs normalized class probabilities. This function is displayed in Equation (4.3). Given a K-dimensional
vector z of arbitrary of arbitrary scores, it outputs a vector with the corresponding values in an [0, 1] interval, with
the total sum of the output vector equal to 1 [47].
With a Softmax loss layer as the last network layer, the output of the CNN is a vector containing the probabilities
of each class, given the input data.
4.1.2 CNN in Music
In the past years, Convolution neural networks have been increasingly applied to music related tasks, mainly due
to a set of properties that they possess. Their weight sharing property allows the training of deeper architectures
with a high number of parameters, thus being capable of modelling the complex data contained in a musical
signal. Their shift-invariance allows pattern recognition across time and frequency, providing the capability of
interpreting a time-frequency representation and recognizing a pattern along each of these dimensions.
This type of neural networks has been applied to several tasks of the AMT process. CNNs achieved state-of-
the-art results on Onset detection, as in [55]. In source separation tasks CNNs yielded again positive results. In
[56] a CNN was trained to analyse a spectrogram and automatically separate the vocals of a musical mixture.
Considering the classification task, CNN classifiers also achieved great results [57, 58, 59]. These classifiers
were trained to analyse extracted musical features from an input signal to classify its musical genre.
Specifically in instrument classification, CNNs were applied successfully, as in [60]. In the latter proposed model,
the classifier received as an input not only the extracted features but also the signal’s spectrogram. The afore-
mentioned examples and the harmonic sound property mentioned in Section 2.1 were the core fundamentals in
the decision of designing a CNN classifier to address the problem of automatically choosing a library size for the
proposed module 1 (Chapter 3).
38
4.2 Implementation
This module was also developed in Matlab to facilitate the integration with the previous module and due to the
fast prototyping characteristic of Matlab regarding neural networks. The CNN was developed and trained using
the MatConvNet Toolbox [61]. Once again, to compute the log-spectrogram, the CQT Matlab toolbox provided in
[29].
4.2.1 Network’s Architecture and Learning Process
A CNN classifier was trained to detect notes of one of three chosen instruments in log-spectrograms of 1.2
seconds. Then this classifier was applied to a musical signal as a windowed function. After a normalization
process, and given a classification threshold µ, an output classification vector is generated. This vector consists
of three binary outputs each one corresponding to an instrument, and has a 0 or 1 value whether the instrument
is present or not in the input signal.
Figure 4.3: Diagram of the implemented CNN’s architecture.
Table 4.1: CNN classifier’s layers and filter sizes.
Layer: Filter size Stride Padding
Convolutional layer 1 2× 3 2 1
Convolutional layer 2 2× 5 2 1
Convolutional layer 3 2× 9 1 0
Max-pooling layer 1 2× 2 2 0
Convolutional layer 4 1× 5 1 0
Convolutional layer 5 1× 11 1 0
Max-pooling layer 2 2× 2 2 0
Convolutional layer 6 2× 25 1 0
According to what is reported by several authors like the ones in [60, 62] where Convolutional Neural Networks
were trained to receive raw spectrograms and then classify, the proposed CNN will also receive only raw spectro-
grams as an input. This proved to be a challenging task, as the input data is very complex. Several attempts to
design a net were performed. Shallow nets did not achieve good results, as the nets hadn’t enough capacity to
39
learn the complex data. The net that achieved the better results, and that was chosen as a classifier, is presented
in Figure 4.3. It has 12 layers: 3 Convolutional layers, 1 Max-pooling layer, 2 Convolutional layers, 1 Max-pooling
layer, 1 Convolutional layer, 1 Fully-connected layer, 1 Dropout layer, 1 Fully-connected layer and finally 1 Soft-
max layer (layers presented from the shallowest to the deepest). All Convolutional and Fully-connected layers
were each followed by a ReLU non-linearity.
In Figure 4.3 the feature map sizes can be seen, and in table 4.1 the size of the filters used in the Convolu-
tional and Pooling layers are presented. In an effort to provide a better classification, the filters considered are
rectangular-shaped to maintain a high frequency resolution, as can be seen in table 4.1.
Training data, Validation data and Test data
Table 4.2: Instruments considered in the classification task.
Index Instrument Playing Style Pitch Activity Selected Range
19 Bass Open [23 64] [29 64]
58 Oboe Normal [63 99] [63 98]
80 Violin Normal [43 89] [43 78]
Since analysing raw spectrograms requires much complexity, the classifier was trained to only classify among
three instruments. These instruments were chosen by their digital audio quality, by their sustain capability and by
their distinct sound characteristics. In table 4.2, the selected instruments are presented.
To create the training dataset, 36 distinct notes from each instruments were selected (as can be seen in table
4.2). These notes have the duration of 1 second and are correctly labelled. Then the CQT was computed
creating a spectrogram with 1 second duration, containing only the selected note. Through data augmentation
tools, using minimal frequency and temporal shifts, each note was multiplied creating the full data set of 3240
log-spectrograms each with 1.2 seconds.
The full data set was then separated into a training data set and into a validation data set, ensuring that in both
cases the 3 classes were always equally represented. The training data set contained 23 of the full data set, and
the validation set contained the remainder of the full data set. Also to create a test data set, for each instruments
2 notes from outside the selected range were chosen and suffered the same process mentioned above. This
generated a test data set of 60 spectrograms.
Learning
The CNN was then submitted to a learning process. Through an extensive fine tuning process the batch size and
learning rate were set at Bt = 100 and η = 0.002 respectively. The filters were randomly initialized, and the net
passed through the training and validation data set 10 times (10 iterations). Since the net is very deep, thus being
capable of modelling very complex data, the number of iterations is relatively low to prevent overtraining the net
and a dropout layer was introduced to prevent overfitting. The learning process took approximately 6 hours to
complete. Then the trained CNN was used to classify the test data set composed of unseen notes from the three
instruments considered. The resulting test error was 23.33%, the classifier correctly classified 46 of the 60 notes
contained in the test data set.
40
4.2.2 Pre-processing
As in Section 3.2, the log-spectrogram input signal was obtained using the CQT. The frequency resolution was
60 bins per octave, having 545 frequency bins and the log-spectrogram is again sampled with a 40 ms step. The
input signal was then sampled in 1.2 seconds segments, which were fed to the classifier. For each segment the
classifier will output a classification probability for each of the three classes, acting as a windowed function. The
overall output is a 3 × N matrix where N is the number of samples. Each of the 3 components of this matrix
contains the presence probability over time of the respective instrument.
4.2.3 Post-processing
In a polyphonic input signal, the notes may (and probably will) overlap, in a given time frame. It may induce the
classifier in error, as it was trained to detect single notes. The output matrix of the classifier’s windowed function
like process is then normalized by subtracting the mean classification of each component. This process aims to
enhance the occurrence of the highest probability classifications, thus ignoring the average classification which
may induce in error.
The normalized output is then submitted to a classification threshold µ. If the classification output of a given class
surpasses this threshold µ then the instrument is considered to be present in the input signal.
4.2.4 Performance Evaluation
In Figure 4.4 the diagram of the implemented module can be observed. Once again, a test experiment was
conducted to evaluate the module’s performance. In this Section this experiment is described.
Figure 4.4: System diagram of the developed Module 2.
In this performance evaluation, the influence of the classification threshold parameter µ over the overall classifi-
cation will be studied. The module will classify a test data set with different values of µ.
Test Data
The data set for this experiment consists of 30 random sound files. These sound files were created according
to the same procedure as in Subsection 3.2.4, using the auxiliary MIDI module. Three levels of polyphony
were considered, with 13 of the data set corresponding to each level. This time the songs had the duration of 20
seconds, and only three instruments were considered. The mentioned instruments correspond to the instruments
that the classifier was trained to identify, but there were no restrictions to their pitch activity range. The note’s
duration was a random value in [0.2s, 4s].
41
Metrics
In this experiment the metric considered is rigorous as the output classification is considered correct when it
identifies all instruments present in the input file, without false positive or false negative classifications.
Results
After classifying all the test data files with different µ values from 0 to 0.05, the accuracy of the classifier was
plotted against the µ value — Figure 4.5. In this plot the influence of the µ parameter in the classification task can
be visualized, a lower value of µ represents a more permissive classifier which will detect the instruments even
if their presence has a small probability value. A high value of µ represents a more conservative classifier, that
only detects instruments with a higher presence probability value. The best result was achieved with µ = 0.02,
resulting in 96,67% correct classifications.
Figure 4.5: Module 2 performance evaluation: Graphic of the influence of µ in the classification accuracy.
Despite the input signal’s complexity, the classifier achieved a high level of accuracy. One reason for this overall
performance is related to the data used for training and for evaluation. All the sound files were generated with a
digital instrument library and through MIDI files. In reality an instrument playing a note will always be different,
which does not happens when considering digital instruments. The digital instruments don’t vary, performing
always in the same manner. This eliminates the performance dynamics and the small tuning deviations from the
data sets considered, thus decreasing the complexity of the classification task.
The considered instrument’s range (pitch activity) may also affect the classification process. As can be seen in
Table 4.2, instruments have different ranges and these ranges don’t fall into the same range interval. This means
that some notes can be played only by two of the considered instruments, or even exclusively by one. Once
again, this reduces the complexity of the classification task.
Also, considering only three instruments and providing one note log-spectrograms at a time in the training phase
provided suitable conditions to facilitate the learning stage. In the first learning stage attempts log-spectrograms
42
with several notes were provided to the classifier, with longer durations. With these conditions the leaning at-
tempts were successively unsuccessful. Thus, to achieve these results only three instruments were considered
and the classifier was trained with short duration log-spectrograms containing only one note at a time.
(a) Representation of the MIDI file, denoting the ground truth.Instrument 1 is displayed in green and instrument 2 in pink.
(b) Output of using the classifier as a windowed function.
Figure 4.6: In the left, the MIDI file that originated the input sound file is presented. In the right, the output ofusing the classifier as a windowed function is presented.
To provide better insight of the classification process, in Figure 4.6 an example of the intermediate steps of
this process is displayed. In Figure 4.6a the MIDI file that originated the input sound signal is displayed. The
notes performed by instrument 1 (Class 1) are presented in green, the notes of the instrument 2 (Class 2) are
presented in pink. In Figure 4.6b the output of using the classifier as a windowed function is presented, after the
normalization step. Inspecting this Figure provides insight on the classification performed by this module. Note
that even though only 2 instruments are performing in the input music piece, the classifier is prepared to detect
the presence of the 3 instruments learnt in the training process. Using only 2 instruments in the input music
piece ensures a degree of uncertainty in the classification, forcing the classifier to detect which 2 instruments are
playing among the 3 possible ones. As such, in the plot 4.6b 3 classification results are presented, with Class 3
(corresponding to instrument 3, which is not performing) achieving a low presence as expected.
When only a note is detected (e.g. the note of instrument 1 in the first time steps), the correspondent class will
have higher normalized probability values. Also, when several notes overlap it can be seen that the classifier
detects not only the correct instrument, but some low valued "noise" classification for the remainder instruments
(e.g. detecting the instrument 3 in this input file). The effect of the normalization performed is clear in this figure.
It allows to enhance the detected classification, even when multiple notes are being played at the same time.
Both the normalization and thresholding processes help the classifier ignore these misclassification events, thus
considering only the correct instruments.
43
Chapter 5
Hybrid System
In this Chapter the integration of the distinct methods resulting in the developed system is addressed. The overall
process of transcription is detailed, considering both modules developed in this thesis. In Section 5.1 the system
is detailed and the transcription process is explained from beginning to end. In Section 5.2 the performance of
the system is evaluated, following by an example of a transcription process.
5.1 System description
The complete system is composed by both Modules addressed earlier. Given an input signal, its CQT is computed
and then is analysed by the classifier in the CNN module. This classifier will segment the log-spectrogram
produced, and it will identify one of three instruments in each segment. This will produce a matrix containing
the probability of the presence of each instrument in the segments considered acting as a windowed function.
This matrix is then normalized by removing the mean probability for each instrument. Then if the obtained value
exceeds a classification parameter µ, the instrument is considered in this input file. The final output of this Module
is a binary vector of three values, one for each instrument, determining if an instrument is present (1 valued) or
not (0 valued) in the input file.
The output of Module 2 and the log-spectrogram of the input file are both received by the Module 1. This Module
will then use the classification vector to determine the size of its template library. The library will only contain
templates for the instruments classified as present in the input file. The transcription is the performed as in
Chapter 3, using this dynamically set template library.
A diagram of the overall system can be observed in Figure 5.1. The transcription performed by this system is
autonomous, and it only depends on the hyperparameters that affect both modules. After tuning this parameters,
it does not need human interaction to perform transcription.
5.2 Performance Evaluation
To evaluate the performance of the overall method another test experiment was conducted, as for the modules
presented in Chapters 3 and 4. The aim of this experiment is to evaluate the transcription performance when the
contribution of module 2 is considered. Thus, the transcription process will consider the same hyper parameters
as in Chapter 3 and the classification threshold will be varying to address its influence in the transcription process.
44
Figure 5.1: Diagram of the proposed hybrid system.
Test Data
The data set created for this experiment consists of 30 random sound files based on random MIDI files created
by the auxiliary random MIDI file generator module. Again, the same three levels of polyphony were considered:
sound files with 1 instrument, with 2 instruments and with 3 instruments. Each polyphony level represents 13 of the
dataset. The instruments considered were the 3 instruments that the classifier was trained to identify, presented
in Table 4.2. The sound files had the duration of 20 seconds and the note’s duration was a random value in
[0.2s, 4s].
Metrics
To evaluate the system, the calculations made in the performance evaluation of Chapter 3 were again applied.
The transcription’s false negative, false positive and correctly transcribed notes were calculated, using the same
parameter δ = 75%. To evaluate the overall performance, the sum of false negatives,∑FN and false positives,∑
FP , was divided by the total correct notes, N , thus generating an error measure, ε.
ε =
∑FN +
∑FP
N(5.1)
This designed formula for measuring error (Equation 5.1) consists on a simple arithmetic computation which
takes into account both types of errors, false negatives and false positives, and provides a ratio of the total errors
occurred versus the number of existing notes to be detected in the input music piece.
45
Results
The experiment ran for the 30 files with different values of µ ∈ [0, 0.05], (with µ denoting the classification
threshold parameter introduced in the CNN Module). The overall result can be observed in Figure 5.2. Inspecting
the graph in Figure 5.2b that plots the mean classification error against the different µ values, in can be seen that
the best result was obtained for µ = 0.02, corresponding to an error of ε = 0.433. This µ value is similar to the
optimal µ value obtained in the performance evaluation performed in Chapter 4, as expected.
(a) Transcription error for the 3 polyphony levels considered. (b) Mean Transcription error obtained for all polyphony levels.
Figure 5.2: System’s performance evaluation: graph of the transcription error plotted against the parameter µ.
In Figure 5.2a, the graph of the mean classification error is presented for each polyphony level. As mentioned
earlier, a low level of µ provides a permissive classifier. Observing the aforementioned graphic, it can be seen that
a permissive classifier highly impacts the transcription process, especially for sound files with only one instrument.
A low value of µ will consider instruments with low probability values, which will lead to consider instruments that
have been misclassified and have a residual probability value. Considering these instruments, will add to the final
transcription falsely detected notes as seen in Chapter 3. Due to the chosen metric, these falsely detected notes
will be considered errors (false positives) as well as the remainder errors generated during the transcription of the
instruments that actually are present in the music piece, (with high probability values). Thus, the ratio between
all the errors considered versus the number of existing notes can achieve values superior to 1. For the level 2
sound files, the error is smaller as less non-existing instruments are considered. For the level 3 sound files, a low
value of µ provides the lowest transcription error as it considers all three instruments which are all performing in
the sound file.
Setting a high value of µ provides a conservative classifier, that will only consider instruments with high probability
values. Once again, inspecting the graph in Figure 5.2a, the impact of a conservative classifier in the transcription
process can be observed. A high value of µ will also impact negatively the transcription. It will discard instruments
that although having a lower than µ probability value, are present in the sound file. Discarding existing instruments
will increase the false negative transcription notes as all the notes performed by the discarded instrument are not
transcribed, as seen in Chapter 3. This effect can be seen especially in level 3 sound files. The high value of µ will
force the classifier to discard multiple existing instruments, thus disregarding all their notes. These disregarded
notes plus the transcription errors of the considered instruments cause an error of over 100%. Again, in level 2
sound files, this impact is not so substantial, as less performing instruments are disregarded. For level 1 sound
files, this allows a low error, as it only considers instruments with high probability values. In these sound files,
as no other instrument is performing besides the one considered, its classification process will result in a high
probability value. On the other hand, a higher value of µ can even disregard all instruments and consider no
46
instrument in the transcription process, resulting in the increasing error obtained for the higher values of µ, even
for level 1 sound files.
As seen in Chapter 3 using an implementation of the state-of-the-art MSSIPLCA algorithm proves that transcrib-
ing polyphonic signals, even when the instruments are priorly known, is a very challenging task. In the perfor-
mance evaluation of the algorithm the lowest transcription error obtained for a 1-instrument music piece was
25.76% correctly notes transcribed. As for 2-instrument music pieces it was 38.15% and for 3-instrument music
pieces it was 43.80%. Averaging a 35.90% transcription error, always considering that the instruments being
played are known. In the proposed hybrid system no prior information regarding which instrument is playing was
used. Instead the CNN module detects which instruments are playing, choosing the corresponding instrument
from it’s instrument library. Thus, this improvement while adding autonomy to the system also adds uncertainty.
As mentioned above, the best overall result was obtained for µ = 0.02, which can be seen in Figure 5.2b. The
obtained ε = 0.433 error is approximately close to the obtained mean transcription error in Chapter 3. Although
not as low as the best average error obtained — 35.90% — the system can now detect which instruments are
playing among it’s template library, increasing the average error by 7.4%. Thus, the hybrid system, achieved a
close mean transcription error but with the added feature of automatically detecting which are the instruments
performing in the input file, and adapting the template library to it.
As in Chapter 3 and 4, to provide better insight into the overall transcription process of the hybrid system, an
example will be provided. In Figure 5.3, the log-spectrogram obtained by computing the CQT transform for the
chosen example input file is presented. This input file consists on a performance of two instruments: instrument
1 and instrument 3.
Figure 5.3: Log-spectrogram of the input file considered in the following example.
The notes played by each instrument can be seen in Figure 5.4a, were a representation of the MIDI files used
to create this sound file are displayed. The notes played by instrument 1 are displayed in green, and the notes
played by instrument 3 are displayed in pink. In Figure 5.4b the classification matrix can be observed. It can be
easily inferred that a value of µ = 0.02 will correspond to a correct classification, as only the instrument 1 and 3
have probability values superior to this µ value.
47
(a) Representation of the MIDI file, denoting the ground truth. (b) Output of using the classifier as a windowed function.
Figure 5.4: In the left, the MIDI file that originated the input sound file is presented. Instrument 1 is displayed ingreen and instrument 3 in pink. In the right, the output of using the classifier as a windowed function is presented.
(a) In the left the ground truth and inthe right the transcription obtained forinstrument 1.
(b) In the left the ground truth and inthe right the transcription obtained forinstrument 2.
(c) In the left the ground truth and inthe right the transcription obtained forinstrument 3.
Figure 5.5: Transcription results for the three instruments considered with µ = 0.005.
Different transcription results with different µ values will now be presented, as a visual example of the aforemen-
tioned described results. In Figure 5.5 the transcription results, using a permissive classifier with µ = 0.005, are
displayed. In Figure 5.5a the transcription result for instrument 1 is presented, with the ground truth in the left
and the transcription output in the right. The transcription result for instrument 2 is presented in Figure 5.5b and
for instrument 3 in Figure 5.5c, both with the same layout as the results presented for instrument 1.
With a low value of µ all three instruments are considered in the transcription process. This can be observed
inspecting Figure 5.5b, were instrument 2 is wrongly considered generating multiple false positive notes. These
false positives correspond to a wrong attempt to assign notes played by instrument 3 to instrument 2. As these
notes are wrongly considered, their transcription is not accurate, and instead of a long note, it creates small
segmented notes. Thus, one note wrongly assigned to an instrument can create multiple false positives. This
explains the strong negative impact of a permissive classifier in the transcription process.
In Figure 5.6 the transcription results, using µ = 0.02 are displayed. As mentioned above, this value of µ ensures
that only the correct instruments are considered in the transcription process. This ensures that no false positives
are created due to wrongly assigning notes to an instrument as can be seen in Figure 5.6b. The error in the
transcription corresponds to the false negatives created in the regular transcription process, (Figures 5.6a and
5.6a).
Finally, in Figure 5.7 the transcription results using µ = 0.035 are displayed. This value of µ represents an
48
(a) In the left the ground truth and inthe right the transcription obtained forinstrument 1.
(b) In the left the ground truth and inthe right the transcription obtained forinstrument 2.
(c) In the left the ground truth and inthe right the transcription obtained forinstrument 3.
Figure 5.6: Transcription results for the three instruments considered with µ = 0.020.
excessively conservative classifier. It will ignore instrument 3 in the transcription process, as can be seen in
Figure 5.7c. Although the instrument 2 is correctly not considered (Figure 5.7b), ignoring instrument 3 will add
multiple false negative notes to the overall transcription.
(a) In the left the ground truth and inthe right the transcription obtained forinstrument 1.
(b) In the left the ground truth and inthe right the transcription obtained forinstrument 2.
(c) In the left the ground truth and inthe right the transcription obtained forinstrument 3.
Figure 5.7: Transcription results for the three instruments considered with µ = 0.035
In table 5.1, the numeric results of the examples presented are displayed. It can be observed that the best
transcription result is obtained with µ = 0.02, as expected. This will lead to an overall transcription error of
45.5%, with 12 of the 22 considered notes being correctly transcribed. Again, this result was achieved with
δ = 75%. Thus, notes that are transcribed but do not have a duration of at least 75% of the original notes are
considered wrongly transcribed.
Table 5.1: Numeric results of the provided transcription examples
µ value: 0.005 0.020 0.035
Existing Notes 22 22 22
Positive Transcriptions 7 12 8
False Negative Transcriptions 15 10 14
False Positive Transcriptions 22 0 0
Error 168.2% 45.5% 63.6%
49
Chapter 6
Conclusion
6.1 Achievements
In this master’s thesis an Automatic Music Transcription system is proposed. This system consists in a hybrid im-
plementation of two distinct methods. The first method implemented is a state-of-the-art spectrogram factorization
technique developed by Benetos et al. [6], named Multi Sample Shift Invariant Probabilistic Latent Component
Analysis. This method uses a pre-extracted template library (of instruments and their notes) to perform Multi-
Pitch Detection as well as Note Tracking. The method is successfully implemented in the system’s MSSIPLCA
module.
After evaluation the performance of the aforementioned module, it was found that the size of the template library
considered in the transcription process would impact the resulting transcription. Given a sound file, considering
more or less instruments than the existing ones (adding or removing templates) would result in a worse tran-
scription process. To address this issue and to automatically select the appropriate templates, a classifier was
designed to perform instrument identification.
The designed classifier is a Convolutional Neural Network, a Machine Learning technique. CNNs are a Deep
Learning method that excel in classification tasks and were chosen to address this task due to their shift-
invariance and and shared weights properties. Thus, a CNN was designed with 12 layers and it was successfully
trained to identify individual notes of 3 distinct instruments. The proposed system was then assembled, using a
module to perform Multi-Pitch Detection and Note Tracking (MSSIPLCA module), but this time with a template
library defined by the classification output of another module containing the developed CNN (CNN module). The
system’s overall result is a transcription error of 43.30%. Using only an implementation of the state-of-the-art
MSSIPLCA algorithm, with prior information regarding which instrument is present in the considered music piece,
a mean transcription error of 35.90% was achieved, showing the difficulty of transcribing polyphonic music sig-
nals. The proposed module removes the need of this prior information regarding which instruments are playing,
while increasing the average transcription error by a small percentage (7.40%).
Thus, the proposed hybrid system successfully performs an automatic transcription of a given input file. It
achieves a transcription error approximately similar to the transcription error presented by Benetos et al. method
[6]. Although it does not particularly improve the transcription error of this method, it additionally performs In-
strument Identification via a CNN. With this new task considered, the hybrid system now combines two distinct
methods in order to improve the transcription process. It now can automatically detect the proper size of the
template library, by identifying the performing instruments in the input file. There is no longer the need for the
system’s user to define a static template library. The system can now decide on its own the instruments to be
50
considered, providing a more automatic transcription process. This shows that an hybrid approach to the AMT
task is able to improve the overall transcription process.
6.2 Future Work
Despite the successful implementation of the two aforementioned Machine Learning methods for Automatic Mu-
sic Transcription, there’s still a large margin for improvement. Since it was proved that different methods can
be combined to provide better transcriptions, different methods than the ones considered in this thesis can be
combined in order to improve not only the transcription process but also the transcription error. In the author’s
opinion the future improvements should focus on the classifier.
The proposed classification module its trained to identify only three instruments, due to the complexity of the data.
With more computational power, more instruments could be considered in this classification. This would allow to
evaluate the system with increasing levels of polyphony. Another characteristic of the classifier is that it is trained
to identify isolated notes of each instrument. An approach to identifying overlapping notes of the same or distinct
instruments would be an interesting feature to add to the classifier, making its classification process more robust.
This would also remove the necessity of using the classifier as a windowed function. Considering an analogical
data set could also be an interesting approach. As mentioned above, the digital data set is composed by digital
instruments, which perform in the exact same way every time. This removes the artist’s performance skills out of
the scope of the classifiers analysis, as well as small frequency changes due different tunings. Considering an
analogical data set would provide an insight on the classifiers capability of dealing with real data.
51
52
Appendix A
Musical Notes
Below, the notes, their frequencies and wavelengths are displayed. This corresponds to an equal temperament,
with a tuning of A4 = 440Hz. Also, the correspondent MIDI scale number is indicated alongside with the scale
considered in this thesis. The number in the note’s name corresponds to the octave to which the note belongs.
Table A.1: Notes, frequencies and wavelengths with the correspondent MIDI scale number.
Note name Frequency (Hz) Wavelength (cm) MIDI scale number Considered scale
A0 27.50 1254.55 21 1
A#0/Bb0 29.14 1184.13 22 2
B0 30.87 1117.67 23 3
C1 32.70 1054.94 24 4
C#1/Db1 34.65 995.73 25 5
D1 36.71 939.85 26 6
D#1/Eb1 38.89 887.10 27 7
E1 41.20 837.31 28 8
F1 43.65 790.31 29 9
F#1/Gb1 46.25 745.96 30 10
G1 49.00 704.09 31 11
G#1/Ab1 51.91 664.57 32 12
A1 55.00 627.27 33 13
A#1/Bb1 58.27 592.07 34 14
B1 61.74 558.84 35 15
C2 65.41 527.47 36 16
C#2/Db2 69.30 497.87 37 17
D2 73.42 469.92 38 18
D#2/Eb2 77.78 443.55 39 19
E2 82.41 418.65 40 20
F2 87.31 395.16 41 21
F#2/Gb2 92.50 372.98 42 22
G2 98.00 352.04 43 23
G#2/Ab2 103.83 332.29 44 24
A2 110.00 313.64 45 25
Continued on next page
53
Table A.1 – continued from previous page
Note name Frequency (Hz) Wavelength (cm) MIDI scale number Considered scale
A#2/Bb2 116.54 296.03 46 26
B2 123.47 279.42 47 27
C3 130.81 263.74 48 28
C#3/Db3 138.59 248.93 49 29
D3 146.83 234.96 50 30
D#3/Eb3 155.56 221.77 51 31
E3 164.81 209.33 52 32
F3 174.61 197.58 53 33
F#3/Gb3 185.00 186.49 54 34
G3 196.00 176.02 55 35
G#3/Ab3 207.65 166.14 56 36
A3 220.00 156.82 57 37
A#3/Bb3 233.08 148.02 58 38
B3 246.94 139.71 59 39
C4 261.63 131.87 60 40
C#4/Db4 277.18 124.47 61 41
D4 293.66 117.48 62 42
D#4/Eb4 311.13 110.89 63 43
E4 329.63 104.66 64 44
F4 349.23 98.79 65 45
F#4/Gb4 369.99 93.24 66 46
G4 392.00 88.01 67 47
G#4/Ab4 415.30 83.07 68 48
A4 440.00 78.41 69 49
A#4/Bb4 466.16 74.01 70 50
B4 493.88 69.85 71 51
C5 523.25 65.93 72 52
C#5/Db5 554.37 62.23 73 53
D5 587.33 58.74 74 54
D#5/Eb5 622.25 55.44 75 55
E5 659.25 52.33 76 56
F5 698.46 49.39 77 57
F#5/Gb5 739.99 46.62 78 58
G5 783.99 44.01 79 59
G#5/Ab5 830.61 41.54 80 60
A5 880.00 39.20 81 61
A#5/Bb5 932.33 37.00 82 62
B5 987.77 34.93 83 63
C6 1046.50 32.97 84 64
C#6/Db6 1108.73 31.12 85 65
D6 1174.66 29.37 86 66
D#6/Eb6 1244.51 27.72 87 67
E6 1318.51 26.17 88 68
F6 1396.91 24.70 89 69
Continued on next page
54
Table A.1 – continued from previous page
Note name Frequency (Hz) Wavelength (cm) MIDI scale number Considered scale
F#6/Gb6 1479.98 23.31 90 70
G6 1567.98 22.00 91 71
G#6/Ab6 1661.22 20.77 92 72
A6 1760.00 19.60 93 73
A#6/Bb6 1864.66 18.50 94 74
B6 1975.53 17.46 95 75
C7 2093.00 16.48 96 76
C#7/Db7 2217.46 15.56 97 77
D7 2349.32 14.69 98 78
D#7/Eb7 2489.02 13.86 99 79
E7 2637.02 13.08 100 80
F7 2793.83 12.35 101 81
F#7/Gb7 2959.96 11.66 102 82
G7 3135.96 11.00 103 83
G#7/Ab7 3322.44 10.38 104 84
A7 3520.00 9.80 105 85
A#7/Bb7 3729.31 9.25 106 86
B7 3951.07 8.73 107 87
C8 4186.01 8.24 108 88
55
56
Bibliography
[1] M. Piszczalski and B. A. Galler, “Automatic music transcription,” Computer Music Journal, vol. 1, no. 4, pp.
24–31, 1997.
[2] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music transcription: challenges
and future directions,” Journal of Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, 2013.
[3] J. Carbonell, R. Michalski, and T. Mitchell, An Overview of Machine Learning. Springer, 1983.
[4] N. Bertin, R. Badeau, and G. Richard, “Blind signal decompositions for automatic transcription of polyphonic
music: NMF and K-SVD on the benchmark,” ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing - Proceedings, vol. 1, 2007.
[5] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estima-
tion,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 528–537, 2010.
[6] E. Benetos, S. Ewert, and T. Weyde, “Automatic transcription of pitched and unpitched sounds from poly-
phonic music,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Pro-
ceedings, no. May, pp. 3107–3111, 2014.
[7] B. Fuentes, R. Badeau, and G. Richard, “Adaptive harmonic time-frequency decomposition of audio using
shift-invariant PLCA,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
- Proceedings, no. 1, pp. 401–404, 2011.
[8] A. Elowsson and A. Friberg, “Polyphonic Transcription with Deep Layered Learning,” MIREX 2014, no. of 52,
pp. 25–26, 2014.
[9] G. E. Poliner and D. P. W. Ellis, “A discriminative model for polyphonic piano transcription,” Eurasip Journal
on Advances in Signal Processing, vol. 2007, pp. 1–16, 2007.
[10] S. A. Raczy¿ski, N. Ono, and S. Sagayama, “Note detection with dynamic bayesian networks as a post-
analysis step for nmf-based multiple pitch estimation techniques,” in 2009 IEEE Workshop on Applications
of Signal Processing to Audio and Acoustics, Oct 2009, pp. 49–52.
[11] A. Dessein, A. Cont, and G. Lemaitre, “Real-time polyphonic music transcription with non-negative matrix
factorization and beta-divergence,” International Conference on Music Information Retrieval, no. 5, pp. 3–5,
2010.
[12] J. Shen, J. Shepherd, and A. H. H. Ngu, “Towards effective content-based music retrieval with multiple
acoustic feature combination,” IEEE Transactions on Multimedia, vol. 8, no. 6, pp. 1179–1189, Dec 2006.
[13] C. N. S. Jr., A. L. Koerich, and C. A. A. Kaestner, “Feature selection in automatic music genre classification,”
in Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, Dec 2008, pp. 39–44.
57
[14] F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC,” Journal of Computer
Science and Technology, vol. 16, no. 6, pp. 582–589, 2001.
[15] S. Essid, G. Richard, and B. David, “Musical instrument recognition by pairwise classification strategies,”
IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1401–1412, 2006.
[16] E. J. Humphrey, J. P. Bello, and Y. LeCun, “Moving Beyond Feature Design: Deep Architectures and Auto-
matic Feature Learning in Music Informatics,” International Society for Music Information Retrieval Confer-
ence (ISMIR), pp. 403–408, 2012.
[17] L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and Trends in Signal Process-
ing, vol. 7, no. 3-4, pp. 197—-387, 2013.
[18] P. Hamel, S. Wood, and D. Eck, “Automatic Identification of Instrument Classes in Polyphonic and Poly-
Instrument Audio.” International Society for Music Information Retrieval Conference (ISMIR), pp. 399–404,
2009.
[19] P. Hamel and D. Eck, “Learning Features from Music Audio with Deep Belief Networks,” International Society
for Music Information Retrieval Conference (ISMIR), pp. 339–344, 2010.
[20] T. L. H. Li, A. B. Chan, and A. H. W. Chun, “Automatic Musical Pattern Feature Extraction Using Convolutional
Neural Network,” Proceedings of the International MultiConference of Engineers and Computer Scientists,
vol. I, no. November, pp. 546–550, 2010.
[21] R. Burton, “The elements of music: What are they and who cares?” [Online]. Available:
http://asme2015.com.au/the-elements-of-music-what-are-the-and-who-cares/
[22] N. Saint-arnaud and K. Popat, “Analysis and Synthesis of Sound Textures,” Readings in Computational
Auditory Scene Analysis, pp. 125—-131, 1995.
[23] B. L. Róisín, “Musical Instrument Identification with Feature Selection Using Evolutionary Methods,” Ph.D.
dissertation, University of Limerick, 2009.
[24] C. J. Plack, R. R. Fay, A. J. Oxenham, and A. N. Popper, Pitch: Neural Coding and Perception. Springer,
2005, vol. 24.
[25] N. Lenssen and D. Needell, “An Introduction to Fourier Analysis with Applications to Music,” Journal of
Humanistic Mathematics, vol. 4, no. 1, pp. 72–91, 2014.
[26] A. N. S. Institute, M. Sonn, and A. S. of America, American National Standard Psychoacoustical Terminol-
ogy. American National Standards Institute, 1973.
[27] J. C. Brown, “Calculation of a constant Q spectral transform,” The Journal of the Acoustical Society of
America, vol. 89, no. January 1991, p. 425, 1991.
[28] J. V. S. S. Stevens, “The relation of pitch to frequency: A revised scale,” The American Journal of Psychology,
vol. 53, no. 3, pp. 329–353, 1940.
[29] C. Schörkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing,” 7th Sound and Music
Computing Conference, no. JANUARY, pp. 3–64, 2010.
[30] J. C. Brown, “An efficient algorithm for the calculation of a constant Q transform,” The Journal of the Acous-
tical Society of America, vol. 92, no. 5, p. 2698, 1992.
[31] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” Ad-
vances in models for acoustic . . . , no. 1, 2006.
58
[32] T. Hofmann, “Probabilistic latent semantic indexing,” Proceedings of the 22nd annual international ACM
SIGIR conference on Research and development in information retrieval, pp. 50–57, 1999.
[33] A. A. Dempster, N. N. Laird, and D. D. B. Rubin, “Maximum likelihood from incomplete data via the EM
algorithm,” Journal of the Royal Statistical Society Series B Methodological, vol. 39, no. 1, pp. 1–38, 1977.
[34] M. Blume, “Expectation maximization: A gentle introduction,” Technical University of Munich-Institute for
Computer Science Press: Munich, Germany, 2002.
[35] DL4J, “Introduction to deep neural networks.” [Online]. Available: http://deeplearning4j.org/
neuralnet-overview.html#element
[36] M. Nielsen, “Neural networks and deep learning.” [Online]. Available: http://neuralnetworksanddeeplearning.
com/
[37] Y. Bengio, Learning Deep Architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1.
[38] U. de Montréal, “Introduction to gradient-based learning.” [Online]. Available: http://www.iro.umontreal.ca/
~pift6266/H10/notes/gradient.html
[39] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,”
Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[40] E. Benetos and S. Dixon, “A shift-invariant latent variable model for automatic music transcription,” Computer
Music Journal, vol. 36, no. 4, pp. 81–94, 2012.
[41] E. Benetos, S. Cherla, and T. Weyde, “An Effcient Shift-Invariant Model for Polyphonic Music Transcription,”
Proceedings of the 6th International Workshop on Machine Learning and Music, 2013.
[42] K. Schutte, “Midi toolbox.” [Online]. Available: http://kenschutte.com/midi
[43] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latent variable models as nonnegative factorizations.”
Computational intelligence and neuroscience, vol. 2008, p. 947438, 2008.
[44] G. Grindlay, “Nmflib toolbox.” [Online]. Available: http://www.ee.columbia.edu/~grindlay/code.html#NMFlib
[45] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s
visual cortex,” The Journal of Physiology, vol. 160, no. 1, pp. 106–154, 1962.
[46] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”
Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[47] Stanford CS class, “Cs231n convolutional neural networks for visual recognition.” [Online]. Available:
http://cs231n.github.io/neural-networks-1/
[48] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High Performance
Convolutional Neural Networks for Image Classification,” Ijcai, pp. 1237–1242, 2011.
[49] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for
object recognition,” in International Conference on Artificial Neural Networks. Springer, 2010, pp. 92–101.
[50] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” Proceedings of
the 27th International Conference on Machine Learning, no. 3, pp. 807–814, 2010.
[51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural net-
works,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
59
[52] I. V. Tetko, D. J. Livingstone, and A. I. Luik, “Neural network studies. 1. comparison of overfitting and over-
training,” Journal of Chemical Information and Computer Sciences, vol. 35, no. 5, pp. 826–833, 1995.
[53] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A Simple Way
to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research (JMLR), vol. 15, pp.
1929–1958, 2014.
[54] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets.” in AISTATS, vol. 2, no. 3,
2015, p. 6.
[55] Jan Schlüter and Sebastian Böck, “Improved Musical Onset Detection with Convolutional Neural Networks,”
Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP 2014), 2014.
[56] A. J. R. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke: Extracting vocals from musical mixtures
using a convolutional deep neural network,” Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9237, pp. 429–436, 2015.
[57] T. Nakashika, C. Garcia, T. Takiguchi, and I. D. Lyon, “Local-feature-map Integration Using Convolutional
Neural Networks for Music Genre Classification,” Interspeech, pp. 1–4, 2012.
[58] S. Dieleman, P. Brakel, and B. Schrauwen, “Audio-based music classification with a pretrained convolutional
network,” . . . International Society for Music . . . , pp. 669–674, 2011.
[59] T. L. H. Li, A. B. Chan, and A. H. W. Chun, “Automatic Musical Pattern Feature Extraction Using Convolutional
Neural Network,” Proceedings of the International MultiConference of Engineers and Computer Scientists,
vol. I, no. November, pp. 546–550, 2010.
[60] T. Park and T. Lee, “Musical instrument sound classification with deep convolutional neural network using
feature fusion approach,” arXiv:1512.07370 [cs], 2015.
[61] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd
ACM international conference on Multimedia. ACM, 2015, pp. 689–692.
[62] D. Nouri, “Using deep learning to listen for whales.” [Online]. Available: http://danielnouri.org/notes/2014/01/
10/using-deep-learning-to-listen-for-whales/
60