hybrid system for automatic music transcription · hybrid system for automatic music ... professor...

Hybrid System for Automatic Music Transcription

Vasco Salema Cordeiro Aboim de Barros

Thesis to obtain the Master of Science Degree in

Electrotechnical and Computer Engineering

Supervisor: Professor Rodrigo Martins de Matos Ventura

Examination Committee

Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Rodrigo Martins de Matos VenturaMember of the Committee: Professor Pedro Manuel Quintas Aguiar

May 2017

Acknowledgments

I would like to thank my supervisor Prof. Rodrigo Ventura, for supporting me into exploring the somewhat different

yet interesting topic of Automatic Music Transcription.

I express my gratitude to Emmanouil Benetos for the aid and guidance provided in the implementation of his

method.

Finally, I would also like to thank my family and friends for the motivation and support provided during this the-

sis, especially to Duarte Rondão and Bernardo Marchante, my colleagues who have accompanied me on this

journey.

Lisbon, Portugal

15/04/2017Vasco Barros

iii

Resumo

Transcrever automaticamente uma peça de música é uma tarefa muito desafiante. Para tal, é necessário uma

perceção e interpretação do som e da música que se tem revelado difícil de replicar numa máquina. No entanto,

já existem diversos métodos que resolvem subproblemas desta tarefa. Nesta tese é proposto um sistema híbrido

para Transcrição Automática de Música, que combina duas técnicas distintas de Aprendizagem Automática. É

implementado um método de factorização de espectrogramas baseado na técnica “Probabilistic Latent Compo-

nent Analysis”. Este método utiliza uma biblioteca de "templates" de instrumentos e notas pré-extraídos, bib-

lioteca esta que terá um grande impacto no processo de transcrição. Como tal, é desenvolvida e treinada uma

“Deep Neural Network” para identificação de instrumentos contidos num dado ficheiro de som. Combinando

os dois métodos mencionados anteriormente, é então criado um sistema híbrido que elimina a necessidade de

manualmente determinar o correto tamanho da biblioteca de "templates" ao transcrever um dado ficheiro de som.

Este sistema híbrido demostra que através da combinação de métodos distintos de Aprendizagem Automática

é possível garantir maior autonomia no processo de transcrição. Neste caso, o sistema proposto garante a

precisão de transcrição do método “Probabilistic Latent Component Analysis” adquirindo uma maior autonomia

na transcrição, pois através da rede neuronal treinada é feita uma identificação automática dos instrumentos

musicais presentes na música a transcrever.

Palavras-chave - Transcrição Automática de Música, Aprendizagem Automática, Probabilistic Latent Component

Analysis, Deep Learning, Convolutional Neural Networks, Sistema híbrido

v

Abstract

The task of automatically transcribing a piece of music is a very challenging one. It implies sound and music

perceptiveness which has been proving hard to replicate into machines. There are multiple methods to address

sub-problems within this task, achieving successful results. In this thesis a hybrid system for Automatic Music

Transcription is proposed, combining two distinct Machine Learning techniques. A state-of-the-art spectrogram

factorization technique based on Probabilistic Latent Component Analysis is implemented. This method uses a

pre-extracted template library of instruments and their notes to perform the transcription. The template library

greatly impacts the transcription process. As such, to automatically determine the correct library size to be

used, a Deep Neural Network was trained as a classifier, to identify instruments performing in a sound file.

By combining both mentioned techniques, a hybrid transcription system is created that eliminates the need for

a manual instrument identification for each considered sound file. This hybrid system proves that combining

distinct Machine Learning methods it is possible to improve the transcription process granting it more autonomy.

In this case, the proposed system ensures the same transcription accuracy of the Probabilistic Latent Component

Analysis method, while adding a higher degree of autonomy in the process, obtained through the automatic

instrument identification performed by the trained neural network.

Keywords - Automatic Music Transcription, Machine Learning, Probabilistic Latent Component Analysis, Deep

Learning, Convolutional Neural Networks, Hybrid system

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables xi

List of Figures xiii

List of Acronyms xv

List of Symbols xvii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Multi-Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.2 Note Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.3 Instrument Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background 6

2.1 Sound Perception and Musical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Spatial location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.5 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.6 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Constant-Q Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 CQT Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Probabilistic Latent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.3 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Multi-Sample Shift Invariant Probabilistic Latent Component Analysis 20

ix

3.1 MSSIPLCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Unknown Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Template Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.5 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Convolutional Neural Network 34

4.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 CNN layers and architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 CNN in Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Network’s Architecture and Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Hybrid System 44

5.1 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Conclusion 50

6.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A Musical Notes 53

Bibliography 57

x

List of Tables

3.1 MSSIPLCA model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Generic instruments of the Template Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Template library considered in Module 1 performance evaluation . . . . . . . . . . . . . . . . . . 29

3.4 Module 1 evaluation test results: Percentage of notes correctly transcribed and resulting error . . . 31

4.1 CNN classifier’s layers and filter sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Instruments considered in the classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1 Numeric results of the provided transcription examples . . . . . . . . . . . . . . . . . . . . . . . . 49

A.1 Notes, frequencies and wavelengths with the correspondent MIDI scale number . . . . . . . . . . 53

xi

List of Figures

2.1 CQT of 2 notes, played by 2 different instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Artificial neural networks, and its nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Example of a neural network with the notation used in Section 2.4.2 applied . . . . . . . . . . . . 17

3.1 Shift-Invariant PLCA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Example of a CQT spectrogram of a piano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Diagram of the System’s Module 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Module 1 evaluation test results: Graphic of the percentage of notes correctly transcribed . . . . . 30

3.5 Module 1 evaluation test results: Graphic of the percentage of false positive transcribed notestranscribed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Example of transcription results with different template library sizes . . . . . . . . . . . . . . . . . 33

4.1 Example of input volume and Neuron arrangement in a convolutional layer . . . . . . . . . . . . . 36

4.2 Example of Max Pooling on an input depth level . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Diagram of the implemented CNN’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Diagram of the developed Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Module 2 performance evaluation: Graphic of the classification accuracy . . . . . . . . . . . . . . 42

4.6 Intermediate steps of a classification performed by module 2 . . . . . . . . . . . . . . . . . . . . 43

5.1 Diagram of the proposed hybrid system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 System’s performance evaluation graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Log-spectrogram of an example input file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 Intermediate steps of a classification performed by module 2 . . . . . . . . . . . . . . . . . . . . 48

5.5 System’s performance evaluation: Transcription results with µ = 0.005 . . . . . . . . . . . . . . . 48



xiii

List of Acronyms and Abbreviations

AMT Automatic Music Transcription

CNN Convolutional Neural Networks

CQT Constant-Q Transform

DBN Deep Belief Network

DFT Discrete Fourier Transform

EM Expectation-Maximization

MFCC Mel-frequency Cepstral Coefficients

MIDI Musical Instrument Digital Interface

MIR Music Information Retrieval

MIREX Music Information Retrieval Evaluation eXchange

ML Maximum Likelihood

MLP Multilayer Perceptrons

MSSIPLCA Multi-Sample Shift Invariant Probabilistic Latent

Component Analysis

NMF Non-Negative Matrix Factorization

PLCA Probabilistic Latent Component Analysis

PLSI Probabilistic Latent Semantic Indexing

ReLU Rectifier Linear Units

SGD Stochastic Gradient Descent

SVM Support Vector Machines

xv

List of Symbols

Greek letters

α Sparsity parameter for MSSIPLCA method

η Learning rate parameter for Gradient Descent

γ Momentum parameter for Stochastic Gradient Descent

λ Instrument index utilized in the MSSIPLCA Module

µ Classification threshold parameter utilized in the CNN Module

∇ Gradient operator

ω Log-frequency

σ Transcription threshold parameter

Roman letters

Bt Batch size in a CNN’s learning process

fs Sample Frequency

K Number of filters in a Convolutional layer

P Zero-Padding added in a CNN layer

S Stride in a Convolutional layer

t Time

xvii

Chapter 1

Introduction

1.1 Motivation

Musical transcription is the process of converting a piece of music into some form of musical notation, which

will display the musical notes played across time. Some examples of musical notations are scores, piano-roll

representations or rhythmic sequence of chords [1]. Even for those with musical training, listening to a piece of

music and manually trying to transcribe it presents itself as a very challenging task. There are several obstacles

one may find while performing this task, such as detecting which instrument plays each note or detecting the

tempo/beat of each note, but the main challenge is to detect the the note’s pitch.

In the late 70’s, audio researchers such as James Moorer, Martin Piszczalski or Bernard Galler, dedicated their

research to musical signal analysis. In 1977 Piszczalski and Galler introduced the concept of Automatic Music

Transcription (AMT) [1]. This is the process of automatically converting a musical sound signal into its represen-

tation as musical notation, through digital analysis of the musical signal. This process has been the target of

much research since it was introduced, and nowadays it covers a wide range of subtasks. These subtasks are

representation of the challenges we face, with the added difficulty of removing human intuition and perception out

of the equation. As such, AMT can be viewed as one of the main technology enabling concepts in music signal

processing [2].

In the past decades, another field of study has made great achievements in having computers learning from

representations of our world, trying to mimic Human learning processes. This field of study — Machine Learning

— focus on adding cognitive skills to machines by modelling learning processes [3]. AMT and Machine Learning

are intrinsically connected, in the sense that the high-level goal is to make computer perceive and interpret music

on its own. As such, many methods developed to address AMT tasks are based on Machine Learning algorithms.

Solving the automatic transcription challenge will allow any group of musicians to play or improvise freely without

the fear of losing their creations, by saving a record of each note played by whom. Other applications of these

methods include the register of music genre where no score exists such as traditional oral music or jazz, or

enabling machine participation in musical live performances. A transcription algorithm could be applied to a large

library of musical pieces, which would be out of reach for a manual approach. This could enable musical search

through an audio input. Lastly, an AMT algorithm could also integrate a musical tutoring platform granting the

platform the ability to interpret and correct the user when necessary.

1

1.2 Goals

In this thesis the task of automatically transcribing a music signal is addressed. An automatic transcription system

is fully designed and implemented. This system will generate a piano-roll representation of notes of one or more

instruments performing in a given input file. This system aims to provide support to every musician in their

frequent transcription tasks, by automatically registering their music pieces.

The proposed system will combine distinct methods applied to the AMT process, to generate a fully functional

system. This hybrid approach aims to explore the benefit of combining distinct approaches to specific AMT tasks,

in order to improve the transcription process. Thus, in this thesis an hybrid system is proposed to automatically

transcribe recorded music fragments. This hybrid system will combine two distinct Machine Learning methods,

each one addressing a distinct AMT subtask: Multi-Pitch Estimation and Note Tracking, and Instrument Identifi-

cation.

1.3 State-of-the-Art

As was mentioned above AMT can be divided into multiple subtasks, each one with different proposed methods

and approaches. Multi-pitch estimation is considered the fundamental subtask of AMT, as it focus on identifying

and distinguishing concurrent pitches. In order to correctly detect a note’s pitch the note event has to be identified

in time, as such the Note tracking subtask focus on providing a temporal representation of the notes being played.

To properly detect when a note event starts or finishes, the previous subtasks utilizes methods form the Onset

and Offset detection task. As music is mainly composed by several instruments, it is often necessary to detect

which instrument is playing which note, this is the focus of the Instrument identification subtask. Lastly, there are

additional and more specific subtasks, as Key detection, which detects the musical key of the hole music piece,

and Beat detection, that characterizes temporally the analysed music piece [2].

In this section a review of three main subtasks — Multi-Pitch estimation, Note tracking and Instrument identifica-

tion — and their propose methods is presented. An extensive state-of-the-art can be seen in [2], which served

as a guide for the following review.

1.3.1 Multi-Pitch Estimation

When AMT arose, Piszczalski and Galler focused on transcribing monophonic pieces of music –– signals from

one instrument source only [1] — but after three decades of research this problem is considered solved [2]. The

current challenge resides in automatically transcribing polyphonic music signals —- music with several instru-

ments. As such, in polyphonic music we are interested in detecting concurrent pitches, from the same instrument

or from multiple distinct instruments. This challenge is referred to as Multi-Pitch Estimation.

There are three main groups of techniques dedicated to this matter: Feature-based models, Statistical Model-

based Estimation and Spectrogram Factorization. Feature-based models are fundamentally signal processing

methods, where the notes are detected by extracting audio features of the signal’s time-frequency representation.

When using Statistical Model-based Estimation, the problem is formulated as a Maximum A Posteriori Estimation

Problem, where all combinations of fundamental frequencies are considered in order to compute a final estimate.

Spectrogram Factorization appeared in the most recent literature and it has been gaining a lot of attention. It

consists in decomposing a spectrogram from the input signal into two components relative to each tone, the

spectral base, and the temporal activity [2].

2

A large subset of the state-of-the-art Multi-Pitch Estimation methods focus on two techniques: Non-Negative

Matrix Factorization (NMF) and Probabilistic Latent Component Analysis (PLCA). Both these methods can be

included in the Spectrogram Factorization group. NMF is a matrix factorization technique where the matrices have

no negative values, characteristic that is exploited the factorization process. It is a robust and computationally

inexpensive method [4]. NMF algorithms can be implemented as spectral decomposition models applied to

musical signals [5], taking advantage of the non-negativity of a spectrogram. PLCA takes a probabilistic approach

in the spectral factorization task, having achieved state-of-the-art results. In [6] Benetos, Ewert and Weyde

propose a PLCA based model for transcribing jointly pitched and unpitched sounds (the latter can be viewed

as percussive sounds), showing the effectiveness of this technique in regular western music inputs. In [7] an

algorithm for Shift-Invariant PLCA is presented, the implemented method could tolerate variations of the spectral

envelopes (tuning deviations). The aforementioned model is implemented using monophonic signals, but the

authors prove that it can be extended to polyphonic signals.

Every year, a contest named Music Information Retrieval Evaluation eXchange (MIREX) takes place, where the

contestants submit their methods to solve certain tasks within the scope of Music Information Retrieval (MIR). One

of these tasks is named Multiple-F0 Estimation, which corresponds do Multiple-Pitch Detection. In 2014 Elowsson

and Friberg proposed a method which turned out to be the most accurate [8]. This method included Deep Layered

Learning techniques in MIR tasks, showing once again the benefits of using state-of-the-art Machine Learning

methods in music signal analysis.

1.3.2 Note Tracking

In order to correctly analyse a time-frequency representation for further pitch estimation, one must detect where

the note starts and ends (onset and offset time respectively). This processing stage is defined as Note Tracking.

As its definition implies, it is closely related to Multi-Pitch Estimation. There are several methods utilized in

Note Tracking: Hidden Markov Models may be considered in a post-processing stage for temporal smoothing

[9], Dynamic Bayesian Networks can be applied to address this task [10], as the simpler approach of Minimum

Duration Pruning technique [11].

The large majority of the approaches jointly performs Note Tracking and Multi-Pitch Detection. As such, despite

Note Tracking being considered an additional processing stage in some cases, in this thesis it will be considered

an implicit step in the Multi-Pitch Detection process.

1.3.3 Instrument Identification

Given a polyphonic music, where multiple instruments play at the same time, the task of identifying which in-

strument is playing consists on one of the main challenges inside the scope of AMT. Traditional MIR methods

focus mainly on two stages: feature extraction and semantic interpretation. Extracting good features is very time

consuming, but ultimately it will lead to a good representation of the input signal. These features extractions

tend to be task specific and hard to optimize. As such, MIR researchers tend to adopt more powerful semantic

interpretation strategies, like Multilayer Perceptrons (MLP) and Support Vector Machines (SVM) [12, 13]. In In-

strument Identification (and in other main AMT tasks), multiple feature extractions approaches were implemented

and perfected in order to achieve better data representations. This is the case of the widely utilized Mel-frequency

Cepstral Coefficients (MFCC) [14], which consist on an attempt to define and characterize the timbre of an in-

strument. Combining these features extractions with the previously mentioned semantic interpreters achieved

satisfying results [15].

3

However, recent studies prove that combining traditional shallow methods with Deep Learning techniques, thus

obtaining deeper architectures, allows better high-level representations and, in the end, better results [16]. Deep

Learning is a Machine Learning technique, based on Neural Networks, that provides high-level concept learning

through multiple layer learning (hence deep). The layers are hierarchically stacked, and the high-level learned

concepts are inferred using the hierarchic lower layer learnt concepts, granting new levels of understating and

abstraction [17].

Deep Learning techniques are conquering their space in Instrument Identification, with several methods proving

to be more accurate than traditional shallow approaches. Hamel, Wood and Eck presented in [18] a comparison

between a Deep Belief Network (DBN), a MLP and a SVM, on Instrument class classification. The first is a

Deep Learning technique where a Neural Network is pre-trained in an unsupervised manner in order to represent

the input data more efficiently, and then trained in a supervised manner to tune the network to the desired

classification. The remaining methods are Machine Learning techniques, that can be viewed as low-level layers

of a deep neural network. This comparison showed that the DBN performs as well as the other methods, and

outperforms them when the feature set is limited, as well as the instrument classes. Another approach using

DBN in Classification and MIR tasks is presented in [19], this time applied to music genre classification. In this

paper feature extraction is performed using a DBN, yielding better results than the standard MFCC feature-based

approach.

Convolutional Neural Networks (CNN) are a specific type of Neural Networks that is widely used in Image Recog-

nition, due to it’s great performance in this task. These Neural Nets exploit the convolution operation properties

in order to reduce memory usage and improve performance. Li, Chan, and Chun stated that musical patterns

can be captured using CNN due to the similarities between musical data and image data [20]. The authors im-

plemented a CNN for Music Genre Classification. Their implementation required minimal prior knowledge to be

constructed and was complemented with the usage of classic features like MFCC.

1.4 Thesis Outline

In this thesis, a hybrid system is developed to address the AMT task. This system has two modules, in the first

Benetos, Ewert and Weyde’s PLCA-based method [6] is implemented for Multi-Pitch Estimation and Note Track-

ing. In the second module, a complementary CNN Classifier is designed and conceived to perform Instrument

Identification. The thesis is organized as follows:

In Chapter 2 a description of harmonic sound signals and their properties is detailed, as well as a signal trans-

form which is utilized to obtain a suitable time-frequency representation for musical signal analysis. Also the

basic techniques and models utilized by the two aforementioned modules are summarized (PLCA and Neural

Networks).

In Chapter 3 the first module of the hybrid system is addressed. The implementation of the PLCA-based method

is detailed: the mathematical model is explored as well as the implementation process. In this chapter it will also

be explained the need for adding a classification module to the designed system.

In Chapter 4 the classification module of the system is detailed. Convolutional Neural Networks and their specific

characteristics are addressed. The CNN Classifier will also be presented, including it’s design, implementation,

training phase and limitations.

In Chapter 5 the integration of the distinct methods resulting in the developed system is detailed. In this chapter

the interaction between the two modules is explored, and the overall performance of the system is evaluated. Its

4

achievements and limitations are presented, as well as an end-to-end transcription example.

In Chapter 6 the thesis’s conclusion is presented, and possible future work directions are provided.

5

Chapter 2

Theoretical Background

In this chapter a theoretical introduction is made regarding the fundamental methods and models exploited by

this thesis. In Section 2.1, an introductory explanation of the particularities of musical sound signals is presented.

In Section 2.2 the Constant-Q Transform, a time-frequency representation suitable for musical signal analysis is

addressed. In Section 2.3 the original PLCA method is summarized. Lastly, in Section 2.4 an introduction to

Neural Networks and its learning algorithms can be found.

2.1 Sound Perception and Musical Characteristics

In physics, sound can be defined as mechanic waves of pressure that propagate through a compressible medium

(e.g. air or water). In order to be listened to, this wave must reach the ear. After reaching the ear, it can be ignored

or it can be processed and perceived by the brain. Thus, hearing can be defined as the perception of a sound by

the brain. Concerning harmonic sounds, there are several characteristics that allows the brain to perceive them:

loudness, duration, texture, pitch, spatial location and timbre [21]. In this section these characteristics and their

impact on musical sound signal analysis will be presented. This review will mainly focus on pitch and timbre, as

an extensive review over the remain elements and their influence over signal analysis is out of the scope of this

thesis.

2.1.1 Loudness

The loudness of a sound is related to the physical strength of the sound (the amplitude of the signal). It refers to

how loud or how quiet the sound appears to the receptor. It is a subjective measure and, as such, it is not solely

related with the amplitude of the sound signal. However, for the sake of simplicity it will be interpreted as that for

the remainder of this thesis.

2.1.2 Duration

A sound’s duration is related to how long a sound takes from the moment it is noticed until the moment it dis-

sipates. Duration is also a subjective measure, since noise and attention can affect greatly the perception of a

sound’s duration. In music, the duration of a sound, can affect the beat and the rhythm of a piece of music. In this

6

thesis, duration is interpreted as the time interval from the sound’s start to it’s dissipation (onset and offset times

respectively).

2.1.3 Spatial location

Spatial location is the perception of the spacial placement of the sound source in the acoustic environment,

(physical distance). In this thesis spatial location will not be considered, the sound source will be constant as all

sound files used in the experiments are monaural.

2.1.4 Texture

Sound texture is a very wide concept and, as such, it has several definitions. In [22] it is defined that "a sound

texture should exhibit similar characteristics over time. It can have local structure and randomness but the char-

acteristics of the fine structure must remain constant on the large scale". The number of instruments, their

characteristics and the acoustic ambient are all factors that define a sound texture. In particular, the sound heard

in a cafeteria has a different texture that the sound of two individuals speaking in a living room, or the texture of

an orchestra is different from the texture of rock concert. The notion of sound texture will not be considered in the

transcription process as the signals utilized consists solely on digital instruments performing, as it will be seen in

further sections.

2.1.5 Pitch

Fourier’s Theorem states that a steady-state wave is composed by a series of sinusoidal components — har-

monics. Thus, sound as a wave can be described by the amplitude, phase and frequency of its harmonics. The

fundamental frequency is the lowest frequency of the harmonics. The remainder of the harmonics vibrate at (in-

teger) multiples of this fundamental frequency. In real conditions, when a musician plays an instrument, distortion

is added to the resulting signal through performance nuances like tuning deviations or vibratos. These make the

estimation of the fundamental frequency even more difficult, in such conditions [23].

Pitch is a subjective measure, strongly related with the perception of the fundamental frequency of the sound. It

implies a scale ordered from low to high in which the sounds can be placed hierarchically. Identifying the pitch of a

sound is a major step towards distinguishing different sound sources. Also, pitch is also an important information

while trying to group the individual harmonics of the same vibrating source [24].

In western music an octave is used as a pitch interval and it is split into 12 individual notes. The tuning system

convention is the equal temperament in which the frequency of each note,Pi, is obtained by multiplying the fre-

quency of the previous note by the twelfth root of 2, leading to the expression Pi = 2112Pi−1 [25]. Also, a note of

one octave has the exact double frequency as the same note in the previous octave. As was mentioned before,

pitch has a strong relation with the fundamental frequency of the sound, as the latter often corresponds to the

pitch of the note. However, the fundamental frequency does not have to be the strongest harmonic of the sound

[23]. A list of the considered notes in this thesis, with the respective frequency and wavelength can be seen in

the Appendix A.

7

2.1.6 Timbre

The aforementioned sound characteristics are all closely related to some physical property of the sound, making

them measurable to a certain extent. Timbre, on the other hand, is not. It is hard to define, and the existing

definitions are purely subjective. Thus making measuring timbre an arduous process.

The American National Standards Institute defines timbre as an "attribute of auditory sensation in terms of which

a listener can judge two sounds similarly presented and having the same loudness and pitch as being dissimilar"

[26]. With this definition, one can understand why timbre is usually referred to as the colour of a sound. Devel-

oping methods to evaluate timbre became a crucial task in Instrument Classification. In order to describe timbre

and to identify a musical instrument, several timbre features were developed [23].

Temporal Features

Temporal features mainly focus on measuring the energy of a sound across time, which generates a represen-

tation of the temporal shape of a sound. Calculating the root mean square of a temporal envelope generates

a feature that allows measuring the energy present on each note. It provides information regarding the attack

and release time of a note, which is different for each instrument. Temporal Residual Envelope is obtained by

the difference between the original temporal envelope and the root mean squared temporal envelope. It displays

smaller amplitude variations, which provides information on instrument noise and on the player’s technique (e.g.

vibrato).

Spectral Features

Spectral features describe the fluctuations of the sound in terms of frequency. These features base their descrip-

tions on time-frequency representations — sound spectra, that can be obtained from Fourier transformations

such as Discrete Fourier Transform (DFT). Harmonic sounds have an important property, the log-frequency dis-

tance between the harmonics is constant independent of the fundamental frequency [27]. This property will be

further explored in Section 2.2.

An instrument with a rich timbre contain more harmonics than a pure tone. Considering this fact, measuring

the Number of Spectral Peaks is a feature that can be used to differentiate two distinct instruments, with distinct

timbres. Centroid Envelope is a measure of the physical distribution of power in the frequency frames. It provides

information on where in the frequency spectrum, a sound has most of its power, which is characteristic of each

instrument.

The previously mentioned MFCC features have been utilized in speech recognition and more recently in music

analysis. They are a set of coefficients that are calculated through a series of steps [14], and that classify the

sound in terms of a Mel scale [28]. This scale is based on human perception of distance between pitch and, as

such, is based in human hearing.

8

2.2 Constant-Q Transform

As mentioned in Section 2.1.6, sounds composed of harmonic frequency components have a distinct property:

the distances between these components is constant and independent of the fundamental frequency when plotted

against log-frequency [27]. Their overall position depends on the fundamental frequency, but their positions

relative to each other is the same. The first distance (between the first two harmonic components) is log 2, while

the next distance is log(

32

), maintaining this pattern for all the harmonics. The pattern formed by the harmonics

and their amplitudes will differ, reflecting different timbres, thus this pattern is useful for describing timbre and

consequently identifying an instrument [23].

In Signal Processing, the Constant-Q Transform (CQT) is a transform that generates a time-frequency repre-

sentation from a time-domain signal. It falls under the same category as the well-known DFT. The difference

between these two transforms is that while the DFT gives us a linear frequency representation, the CQT plots the

signal into a log-frequency scale [27]. With this scale this transform has a similar behaviour as the human ear,

it has higher frequency resolution in low frequencies and higher temporal resolution in higher frequencies. This

particularity makes the CQT well suited to deal with musical sound signals. The aforementioned harmonic sound

property will be evidently displayed in a log-frequency scale, allowing a description of the existing sound timbres

(different harmonic patterns).

In this thesis, the CQT is the chosen method to obtain the spectrograms from the input signals. As it will be

seen in further chapters, all the techniques developed or implemented in the context of this thesis require a log-

spectrogram to perform correctly.

2.2.1 Mathematical Model

As presented in [29], the CQT XCQ(k, n) of a discrete time-domain signal x(n) is calculated by

XCQ(k, n) =

n+bNk/2c∑j=n−bNk/2c

x(j)a∗k(j − n+Nk/2) (2.1)

where k = 1, 2, . . . ,K are the frequency bins of the transform, b.c is the floor operator which corresponds to the

highest integer lower or equal than the argument and a∗k(n) is the complex conjugate of ak(n). The latter are

referred to as time-frequency atoms and are complex-valued waveforms. They are defined by

ak(n) =1

Nkw

(n

Nk

)e−i2πn fk

fs (2.2)

where fk is the center frequency of the bin k, fs is the sampling rate and w(t) is a continuous windowed function

sampled at points determined by t, (zero-valued outside the range t ∈ [0, 1]). The Q-factor can be defined as the

ratio of the center frequency to band-widths, and for it to be constant in every bin the window lengths Nk ∈ R,

are inversely proportional to fk.

The CQT presented in this article [27] has its center frequencies fk placed according to the following rule:

9

fk = f12k−1B (2.3)

where f1 denotes the center frequency of the lowest frequency bin and B is the number of bins per octave. This

parameter B will determine the time-frequency resolution.

The Q-factor is constant for each bin, and can be calculated as follows:

Q =fk

∆fk=Nkfk∆ωfs

(2.4)

where ∆fk denotes the -3dB bandwidth of the frequency response of the atom ak(n), and ∆ω is the -3dB

bandwidth of the mainlobe of the spectrum of the window function w(t).

In order to reduce frequency smearing it is desirable to make the bandwidth ∆fk as low as possible. This is

achieved by having a large Q-factor. However the Q-factor cannot be arbitrarily set, as it would exclude portions

of the spectrum between bins of the analysis. A value of Q that allows signal reconstruction while introducing

minimal frequency smearing is given by

Q =q

∆ω(21B − 1)

(2.5)

where q ∈ [0, 1] is a scaling factor, typically set as q ≈ 1. Setting q with values smaller than 1 will improve the

time resolution while decreasing the frequency resolution. With Equations (2.4) and (2.5) we now have:

Nk =qfs

fk(21B − 1)

(2.6)

where the dependency on ∆ω disappears.

To reduce the calculation effort of the CQT while allowing signal reconstruction from the CQT coefficients, the

atoms can be placed Hk samples apart. Hk is referred to as Hop size. Typical values for the Hop size are

0 < Hk <≈ 12Nk.

2.2.2 CQT Application

Due to the varying number of samples considered in each frequency bin, the CQT is hard to efficiently calculate.

In [29], Schörkhuber and Klapuri propose an efficient method to calculate the CQT. This method is based on the

algorithm proposed by Brown and Puckette [30]. It is a less computational expensive algorithm and it allows the

calculation of the inverse CQT (which was not possible with the Brown and Puckette solution).

Schörkhuber and Klapuri also developed a MatLab Toolbox with their algorithm implemented. This toolbox will

10

be used to compute the CQT and as such, the computation of the CQT is out of the scope of this thesis and can

be further explored in [29]. The CQT will always be computed with the same fixed parameters. The maximum

and minimum frequency considered correspond to the notes A0 and C8 frequencies respectively, as these are

the minimum and maximum notes considered. 60 frequency bins are considered, and a sample frequency

fs = 44100Hz is set. The Hop size Hk is set at 0.3, while the scaling factor q is set at 0.8.

As will be explained in Subsection 3.2.3, the resulting transform can be further manipulated to represent the pitch

across time over a MIDI scale. Musical Instrument Digital Interface (MIDI) is a technical standard that allows

manipulation and control over digital instruments. A MIDI file contains information about what note is played,

when it is played, and the pitch of the note that it is played. It can be applied to any digital instrument, and

through the use of digital audio manipulation programs, an audio file can be created with the digital instrument

performing as is indicated in the MIDI file. A MIDI scale, can be seen as a zero-one representation of the notes

that are being played across time, giving a close approximation of a Piano-Roll representation. In Appendix A the

MIDI scale is presented for the notes considered. Also, the scale considered to identify the notes throughout this

thesis is displayed.

(a) Oboe’s log-spectrograms. (b) Violin’s log-spectrograms.

Figure 2.1: CQT of a Violin and an Oboe playing an A3 and a C4 note.

To demonstrate the aforementioned property of harmonic signals, the CQT of a Violin and an Oboe playing the

A3 and C4 notes for 1 second was calculated (with 40ms steps). The result can be observed in Figure 2.1,

where in Figure 2.1a the resulting log-spectrograms for the oboe are displayed and in Figure 2.1b the resulting

log-spectrograms for the violin are displayed. By inspecting these figures, it can be seen that each instrument

produces its own pattern of harmonics, with different intensities and amplitudes. This forms a pattern that can be

interpreted as timbre. On another note, it can also be seen that the harmonics do not correspond exactly to the

theoretical harmonic notes due to tuning deviations, thus allowing the usage of different temperaments. Different

temperaments consider different intervals between notes, creating different tuning systems.

11

2.3 Probabilistic Latent Component Analysis

PLCA is a statistic model utilized for acoustic spectra decomposition; it falls into the category of Spectrogram

Factorization techniques. It was first introduced by Smaragdis, Raj and Shashanka [31] as an extension of another

technique utilized in text and language analysis for automatic document indexing- Probabilistic Latent Semantic

Indexing (PLSI) [32]. This method defines the fundamentals of the implemented technique Multi-Sample Shift

Invariant Probabilistic Latent Component Analysis (MSSIPLCA), which will be further explored in chapter 3.

As described in [31], the base model for PLCA is defined as

P (x) =∑z

P (z)

N∏j=1

P (xj |z) (2.7)

where P (x) is an N-dimensional distribution of the random variable x = x1, x2, . . . , xN and P (xj |z) are one

dimensional distributions. z is a latent variable. Latent variables (or hidden variables) are variables that cannot

be directly observed and that are inferred from observable variables. Thus, this model aims to approximate an

N-dimensional distribution with the product of marginal distributions.

To estimate the marginal distributions, this technique uses the Expectation-Maximization (EM) algorithm [33].

This algorithm introduces hidden variables (unobserved variables) in a Maximum Likelihood (ML) Estimation,

defining unobserved data. ML Estimation computes parameters that maximize the probability of occurrence of a

given measurement of a random variable distributed by a probability density function [34]. The Likelihood function

can be defined as

L(Θ) = p(y|Θ) (2.8)

were y = (y1, . . . , yN )T is a measurement vector of a random variable Y and Θ is a parameter that defines the

probability density function. It is common to maximize the log-Likelihood function which can be easier to compute

and provides the same result, since the logarithm is a strictly increasing function.

The EM algorithm considers the log-Likelihood of the complete data x, which consists on the incomplete observed

data y and the unobserved data z:

x = (yT, zT)T (2.9)

Resulting in:

L(Θ) = p(x|Θ) (2.10)

This is an iterative algorithm that is divided into two distinct steps: an Expectation and a Maximization step,

which are alternated. In the Expectation step the contribution of the latent variable z is estimated, estimating the

12

log-Likelihood function [34]. This estimation is computed as such:

E[L(Θ)|y,Θ(i)] (2.11)

were E[x] denotes the expected value of x. The expected value of a random variable X with a probability density

function f(x) can be calculated like:

E[X] =

∫ −∞∞

xf(x)dx (2.12)

In the Maximization step, the previously obtained estimation is maximized through the following equation:

Θ(i+1) = arg maxΘ

E[L(Θ)|y,Θ(i)] (2.13)

To perform this iteration an initial estimate Θ(i) must be provided. Both these steps are alternated in an iterative

manner until a stopping criteria is reached. The stopping criteria can lead to an optimal solution, or it can get

stuck local minimums, the number of iterations as well as the stopping criteria must be fine-tuned to provide better

results [34].

As in [31], applying EM to the PLCA method yields the following equations:

R(x, z) =P (z)

∏Nj=1 P (xj |z)∑

z′ P (z′)

∏Nj=1 P (xj |z

′)(2.14)

P (z) =

∫P (x)R(x, z)dx (2.15)

P (xj |z) =

∫. . .∫P (x)R(x, z)dxk,∀k 6= j

P (z)(2.16)

Equation (2.14) denotes the likelihood function to be maximized, and Equations (2.15) and (2.16) show how to

apply the expectation and maximization steps to PLCA respectively.

As mentioned above, by alternating these steps repeatedly the estimates will converge to an approximate solution.

In the end it will generate a good approximation for the P (xj |z), which represents a latent marginal distribution

across the dimension of the variable xj , and for the P (z), which contains the latent variable prior.

The base model for PLCA is presented above and, as referred in [31], it can be extended to allow invariance to

transformations. This method and it’s properties will be further explored in chapter 3.

2.4 Artificial Neural Networks

Artificial neural networks are a branch of Machine Learning methods and algorithms that are broadly utilized in

pattern recognition. These algorithms where inspired by the neural structure of the brain. They are based on

13

networks that contain a series of computational nodes or neurons (inspired by human neurons). To improve the

MSSIPLCA method implemented in chapter 3, a CNN was developed. As such, in the remainder of this section

the fundamentals of Artificial Neural Networks will be explored and the developed CNN will be detailed in chapter

4.

A node receives input data and combines it with it’s own set of coefficients which can emphasize or lessen

this data relevance. This weighted input is then summed and passed through an activation function, yielding

the output of the node, as can be seen in Figure 2.2a. This output determines if the input data should be

"considered" to the output of the net. The set of coefficients or weights of a node can be dynamically changed in

order to emphasize specific data, in a learning process. We can then combine several nodes into a layer, like in

Figure 2.2b. A neural network consists on one or more node layers. Given a specific data set, a neural network

can be trained to correctly classify input data [35].

(a) Neural node example (b) Artificial neural network example.

Figure 2.2: An example of a generic neural node is presented at the left [35]. At the right a generic artificial neuralnetwork architecture is displayed.

There are multiple types of neuron. A perceptron is a simple neuron, in which the output is zero or one if the

summed weighted input is lesser or greater than a given threshold. Although simple, the zero or one approach of

this neuron makes its learning process harder because small changes in the input can cause drastic changes in

the output. To solve this issue sigmoid neurons are utilized. These neurons have sigmoid functions as activation

functions, thus removing the drastic response to small input changes [36].

A neural network can have several architectures. Feed-forward networks are networks with more than one layer,

in which each layer receives as an input the previous layer’s output. This allows the network to infer higher

degrees of complexity, since the one layer learns over the previous layers gained knowledge. An MLP is an

example of such networks. Deep neural networks are a type of artificial neural networks in which multiple hidden

layers are stacked to create a neural network, hence deep. Thus, a deep-learning layer is capable of modelling

complex data with non-linear relationships [37].

Despite the vast combination of network types and architectures, all of them must undergo a learning process,

in order to learn from the training data. There are two types of learning: supervised and unsupervised. In

supervised learning the data set utilized is correctly labelled and identified. In unsupervised learning the data set

has no label, so the network cannot compare its classification or prediction with the real one [37]. In this thesis

only supervised learning will be considered and the learning algorithm considered will be the Backpropagation

algorithm combined with the stochastic gradient descent algorithm with momentum .

14

2.4.1 Stochastic Gradient Descent

In machine learning, specifically in artificial neural networks, a cost function is a function that returns an indicator

(scalar) of the network’s performance. It compares the output of the network with the correct desired value. An

example of a cost function is the quadratic cost function or mean squared error:

C(θ) =1

2n

∑x

||y(x)− a||2 (2.17)

were θ is a parameter vector that includes the weights and biases of the network, n the number of training inputs,

a all outputs of the network when x is the input and y(x) is the desired output. Minimizing the cost function is the

goal of the learning process. Calculating the gradient of a cost function, is an important step towards minimizing

it. The gradient is a vector containing the partial derivatives of the considered function. Thus, the gradient of a

scalar cost function is defined as:

∇C(θ) =

(∂C(θ)

∂θ1

,∂C(θ)

∂θ2

, . . . ,∂C(θ)

∂θi

)T(2.18)

One way to interpret the gradient is the variation ∆C, when the ∆θ variation is very small. We can then consider

that the gradient relates the variation in the parameters with the variation in the cost function [38]. Then, to

minimize the function C(θ) we want to make ∆C < 0, decreasing the cost function value. It can then be defined:

∆θ ≈ −η∆C (2.19)

where η is a small positive parameter which is called the learning rate. With Equation (2.19), a variation ∆θ can

be chosen which enforces ∆C < 0. The parameter η has to be small enough to maintain the approximation made

in Equation (2.19) but not to small otherwise it will generate very small variations ∆θ, turning the minimization

process very slow. Thus by iteratively applying Equation (2.19), we achieve a successively smaller value of C(θ)

[36]. This iterative process is the Gradient Descent method, and it can be summed up in the following update

equation:

θi+1 = θi − η∆C (2.20)

As it is seen in Equation (2.17) a Cost function output is the average of the Cost function applied to every input.

As such, given a large data set (frequent in neural networks) the cost function gradient will have to be computed

for every input, which can be a slow process. An extension to the Gradient Descent algorithm was introduced to

address this issue. This extension is called the Stochastic Gradient Descent (SGD) algorithm. The idea behind

this method is to calculate the average gradient of a small set of input data, and use this average to estimate the

overall average of the input gradients [36]. The following update equation represents the SGD algorithm:

15

θi+1 = θi − η∑Mm=1 ∆CmM

(2.21)

wherem = 1, . . . ,M represents the randomly chosen small set of input data and ∆Cm is the gradient of the cost

function for the input data m. To further improve the cost function minimization process, a momentum technique

can be included in the SGD algorithm. The momentum technique alters the update rule to account the previous

update ∆θ, which can be interpreted as the "speed of the descent". A momentum parameter γ is introduced

allowing the next update to consider the previous update, thus maintaining the "speed" (hence momentum). This

technique can be summarized in the following equations:

θi+1 = θi −∆θ (2.22a)

∆θ = −η∇Ci(θ) + γ (2.22b)

θi+1 = θi − η∇Ci(θ) + γ (2.22c)

Equation (2.22c), defines the update rule for the SGD algorithm with momentum.

2.4.2 Backpropagation

The Backpropagation algorithm was introduced in the 1970’s, but it was not until 1986 that its importance in

neural network learning was fully appreciated [36]. In 1986, D. Rumelhart, G. Hinton and R. Williams argued that

the Backpropagation algorithm provided a faster learning process that the remainder learning approaches [39].

Thus, this algorithm is the basis of modern neural network learning processes.

The goal of the Backpropagation algorithm is to calculate partial derivatives of a cost function with respect to

any weight or bias on the network. The algorithm provides insight on how changing the weights and biases of

the network impacts it’s the overall outcome. It is utilized to calculate the necessary partial derivatives in order

to execute the SGD algorithm, which in its turn will minimize the cost function. This interaction between both

aforementioned algorithms provides the necessary tools to train a neural network. In [36] a detailed explanation

of the Backpropagation algorithm is provided. It will be used as a guide through the remainder of this section.

Throughout the following explanation of the Backpropagation algorithm the following notation will be utilized:

• wljk denotes the connection from the neuron k of the layer (l − 1) to the neuron k of layer l;

• alj denotes the activation of neuron j in layer l;

• blj denotes the bias of neuron j in layer l;

• Matrices are written in bold upper-case letters as vectors are written in bold lower-case letters.

In the Figure (2.3) an example of the application of this notation is provided. Using this notation we can define an

16

Figure 2.3: Exemple of a neural network with the notation used in Section 2.4.2 applied.

activation of a neuron and the weighted input of a neuron in a layer zlj as:

alj = σ

(∑k

wljkal−1j + blj

)(2.23a)

zlj =∑k

wljkal−1j + blj (2.23b)

where σ(.) represents a sigmoid function, as described at the beginning of the Section 2.4. Rewriting these

equations in a matrix form provides a better overall insight, as the equations become lighter due to lesser indexes.

al = σ(W lal−1 + bl) (2.24a)

zl = W lal−1 + bl (2.24b)

with the sigmoid function applied element-wise. Thus, al is the activation vector that contains all activations alj ,

bl is the bias vector that contains all bias blj , Wl is the weight matrix for the layer l containing all the weights of

the layer and finally zl is the weighted input to the neurons in layer l.

In order to be properly utilized by the Backpropagation algorithm, the cost function has two constrains. Since

the algorithm will calculate partial derivatives of individual training examples, it must be possible to write the cost

function as an average of the cost functions of individual training samples and as a function of the outputs of the

network. These constrains are presented in Equations (2.25).

C =1

n

∑x

Cx (2.25a)

C = C(aL) (2.25b)

where L is the output layer. After these definitions, the four main equations of the Backpropagation algorithm can

17

be presented, where the operator � denotes an element-wise product of two vectors.

δL = ∆aC � σ′(zL) (2.26a)

δl = ((W l+1)T δl+1)� σ′(zl) (2.26b)

dC

dblj= σlj (2.26c)

dC

dwljk= al−1

k σlj (2.26d)

As mentioned above, the main goal of the algorithm is to calculate the quantities in Equations 2.26c and 2.26d.

In order to do so the quantity δlj is calculated. This quantity is the error in the neuron j in the layer l.

In Equation (2.26a) the error in the output layer L is computed. The quantity ∆aC is a vector containing all partial

derivatives dC\daLj , which can be interpreted as the rate of change of the cost function regarding the output

activations. In σ′(zL), the rate of change of the activation function is measured at zL. As in Equation (2.26b)

provides the insight to calculate the error of a layer l regarding the error of the next layer l+ 1. By taking the error

in the next layer, δl+1, and multiplying it by the transpose of the weighted input matrix at layer l + 1 and then

performing an element-wise product with the quantity σ′(zl) we are passing the error through the net backwards,

hence backpropagation.

Combining Equations (2.26a) and (2.26b), the error at every layer can now be computed. Starting by computing

the error at the output layer (equation 2.26a), the error at the layer L−1 can now be computed (Equation (2.26b)),

and so on. With the error at each layer, the partial derivatives dC\dblj and dC\dwljk can now be calculated,

through Equations (2.26c) and (2.26d), as intended. Proof of these four fundamental equations is provided in

[36].

2.4.3 Learning Algorithm

As shown with the Backpropagation (equations (2.26)), the partial derivatives of a cost function for a input example

can be computed. To train a neural network we then combine the Backpropagation algorithm with a learning

algorithm such as SGD, where the partial derivatives for multiple training examples are calculated. We can now

define the Learning algorithm [36]:

1. Input training data

2. For each training data x:

(a) Activation: Set the activation ax,1.

(b) Feedforward:

For each layer (L ∈ [2, . . . , L]) compute:

i. zx,l = W lax,l−1 + bl

ii. ax,l = σ′(zx,l)

(c) Output error:

Compute δx,L = ∆aCx � σ′(zx,l)

18

(d) Backpropagate the error:

For each layer (L ∈ [2, . . . , L]) compute δx,L = ((W l+1)T δx,l+1)� σ′(zx,l)

3. Gradient Descent:

For each layer (L ∈ [2, . . . , L]) update the weights and the biases according to the update rules defined in

equation 2.22

As its shown in the algorithm above the error is propagated backwards through the net, as it is calculated from

the last layer to the first. This will provide insight to the network on how the input data affects the output data.

Thus, by selecting small sets of input data and iterating through this algorithm, the cost function will be minimized

and the network will learn.

19

Chapter 3

Multi-Sample Shift Invariant Probabilistic

Latent Component Analysis

In this Chapter, the MSSIPLCA module of the program developed in this thesis will be explored. This module

consists on the implementation of the MSSIPLCA method developed by Benetos, Ewert and Weyde to perform

automatic transcription of pitched and unpitched sounds [6]. In Section 3.1 the MSSIPLCA model is presented as

an extension of the PLCA model summarized in Section 2.3. In Section 3.2 the implementation of this model is

addressed, including the pre-processing data stage, the instrument template creation and the overall performance

of the implemented module.

3.1 MSSIPLCA Model

To achieve the following model, several modifications and extensions were proposed by Benetos et al. to the

basis PLCA method described in Section 2.3. Shift-invariance across log-frequency was added in order to detect

tuning changes and frequency modulations, the usage of multiple spectral templates per instrument and per

pitch was implemented and each source contribution was enforced to be time and pitch-dependant [40]. A

diagram of the model with these extensions is presented in Figure 3.1. Sparsity constraints were added to control

the polyphony level and the instrument contribution to the resulting transcription, and the spectral templates

were pre-extracted and pre-shifted across log-frequency to reduce the computational effort [41]. Benetos, Ewert

and Weyde’s proposed model adds the feature to detect unpitched sounds (sounds produced by percussive

instruments such as drums) [6]. In this Section, the latter model will be described as in [6], as well as the

aforementioned properties.

3.1.1 Mathematical Model

The input to the model is a log-frequency spectrogram. In [6], it is interpreted as a probability distribution across

log-frequency ω and across time t, which is a strong assumption as it directly interprets the energy of a spec-

trogram as a probability value. The log-frequency spectrogram is then represented as Vω,t and a probability

distribution is represented as P (ω, t). The probability distribution is then decomposed into the known quantity

of the frame probability P (t) and into the conditional distribution over log-frequency bins P (ω|t) (resulting from

dividing the entire log-frequency range into consecutive and non-overlapping frequency intervals):

20

Figure 3.1: Shift-Invariant PLCA model with support to multiple templates per instrument and per pitch, presentedin [40].

Vω,t ≈ P (ω, t) = P (t)P (ω|t) (3.1)

The conditional distribution over log-frequency bins, P (ω, t) is then decomposed into two components: a pitched

component and an unpitched component. The resulting decomposition is described in the following Equation:

P (ω|t) = P (r = h|t)Ph(ω|t) + P (r = u|t)Pu(ω|t) (3.2)

where Ph(ω|t) is the spectrogram approximation to the pitched component and Pu(ω|t) is the spectrogram

approximation to the unpitched component. The probability P (r|t) weights the respective component over time,

having r ∈ {h, u} for the pitched (h) and unpitched (u) components respectively.

Considering only the pitched component Ph(ω|t), as in [40], a latent variable p that represents the pitch (using

the MIDI scale for pitch), is added to the model. The resulting pitched component is:

Ph(ω|t) =

pmax∑pmin

P (ω|p, t)P (p, t) (3.3)

Additionally, a latent variable for instrument sources s which represents the instrument index; and a latent variable

for pitch shifting across log-frequency f (also referred to as the shifting parameter), are also added to the model.

Obtaining:

Ph(ω|t) =∑p,s

Ph(ω|s, p) ∗ω Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.4)

where Ph(ω|s, p) represent the spectral templates for a given pitch p and for a specific instrument s, Ph(f |p, t)represents the time-varying log-frequency shift per pitch which is convolved with Ph(ω|s, p) across ω (opera-

tor ∗ω), Ph(s|p, t) represents the instrument contribution per pitch across time and Ph(p|t) is the pitch activation

across time.

To obtain Equation (3.4) the chain rule is applied. This rule states how can a probability distribution can be

21

represented in terms of conditional probabilities. It is described in Equation (3.5a), and in Equation (3.5b) a

decomposition of a 4 variable probability distribution is performed where the result of repeatedly applying the

chain rule to the final term of the decomposition can be seen.

P (An, . . . , A1) = P (An|An−1, . . . , A1) · P (An−1, . . . , A1) (3.5a)

P (A4, A3, A2, A1) = P (A4|A3, A2, A1) · P (A3|A2, A1) · P (A2|A1) · P (A1) (3.5b)

Finally, removing the convolution operator in Equation (3.4), we get the following model for the pitched component:

Ph(ω|t) =∑p,f,s

Ph(ω − f |s, p)Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.6)

In order to reduce the computational effort of the following steps of parameter estimation (Section 3.1.2), the use

of pre-extracted and pre-shifted templates is introduced [6, 41]. Thus, with this modification, the proposed model

for the pitched component is described as follows:

Ph(ω|t) =∑p,f,s

Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t) (3.7)

where Ph(ω|s, p, f) are the spectral templates per pitch p and instrument s, shifted across log-frequency ac-

cording to f ; Ph(f |p, t) represents the time-varying log-frequency per pitch; Ph(s|p, t) represents the instrument

contribution per pitch across time and Ph(p|t) is the pitch activation. The time-frequency representation, as in

[6], has a spectral resolution of 5 bins per semi-tone, thus having f ∈ {1, . . . , 5} allowing the templates to by

shifted by ± 12 semi-tones and having the ideal tuning position at f = 3.

For the unpitched component, 2 latent variables were added: d which denotes the drum kit component utilized

and z which is the index for the templates used for each component. Applying the same process as the pitched

component to the unpitched component yields the following decomposition:

Pu(ω|t) =∑d,z

Pu(ω|d, z)Pu(d|t)Pu(z|d, t) (3.8)

where Pu(ω|d, z) denotes the z-th spectral template for the drum component d, Pu(d|t) represents the drum

component activation and Pu(z|d, t) denotes the template contribution per drum component over time.

The overall mathematical model is obtained when both components are considered (Equations (3.7) and (3.8)):

22

Vω,t ≈ P (ω, t) = P (t)P (r = h|t)Ph(ω|t) + P (t)P (r = u|t)Pu(ω|t) =

= P (t)P (r = h|t)∑p,f,s

Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t)

+ P (t)P (r = u|t)∑d,z

Pu(ω|d, z)Pu(d|t)Pu(z|d, t) (3.9)

3.1.2 Unknown Parameter Estimation

The mathematical model presented in Equation (3.9) has several parameters, some of which are fixed and known

while the others are unknown. The next step of the proposed MSSIPLCA method proposed in [6], is to estimate

these unknown parameters. In table 3.1, the parameters of the previously mentioned model are detailed. As men-

tioned, the parameters Ph(ω|s, p, f) and Pu(ω|d, z), are fixed and known, they correspond to the pre-extracted

and pre-shifted templates.

Table 3.1: Parameters used in the implemented MSSIPLCA model, proposed in [6].

Parameter Component State Description

P (t) known Spectrogram energy

P (r|t) uknown Weights a component contribution over time

Ph(ω|s, p, f) pitched known Spectral templates per pitch and instrument, shifted according to f

Ph(f |p, t) pitched uknown Log-frequency per pitch, over time

Ph(s|p, t) pitched uknown Instrument contribution per pitch, over time

Ph(p|t) pitched uknown Pitch activation, over time

Pu(ω|d, z) unpitched known Spectral template per drum component

Pu(z|d, t) unpitched uknown Template contribution per drum component, over time

Pu(d|t) unpitched uknown Drum component activation

To estimate the unknown parameters, the EM algorithm is used [33] as in Section 2.3. The model’s log-likelihood

is defined as:

L =∑ω,t

Vω,t log(P (ω, t)) (3.10)

Again the EM is divided into two distinct steps. In the Expectation step the contribution of the latent variables is

estimated by a weighting function. This process results in the following Equations, for the pitched component and

for the unpitched component:

P (s, p, f, r = h|ω, t) =P (r = h|t)Ph(ω|s, p, f)Ph(f |p, t)Ph(s|p, t)Ph(p|t)

P (ω|t)(3.11a)

P (d, z, r = u|ω, t) =P (r = u|t)Pu(d|t)Pu(z|d, t)

P (ω|t)(3.11b)

In the Maximization step, the marginals will be re-estimated, but this time with the estimations calculated in the

23

Expectation step. Thus, resulting in the following Equations for the pitched component:

P (r = h|t) ∝∑s,p,f,ω

Vω,tP (s, p, f, r = h|ω, t) (3.12a)

Ph(f |p, t) =

∑ω,s Vω,tP (s, p, f, r = h|ω, t)∑ω,s,f Vω,tP (s, p, f, r = h|ω, t)

(3.12b)

Ph(s|p, t) =

∑f,ω Vω,tP (s, p, f, r = h|ω, t)∑f,ω,s Vω,tP (s, p, f, r = h|ω, t)

(3.12c)

Ph(p|t) =

∑s,f,ω Vω,tP (s, p, f, r = h|ω, t)∑s,f,ω,p Vω,tP (s, p, f, r = h|ω, t)

(3.12d)

(3.12e)

and the following Equations for the unpitched component:

P (r = u|t) ∝∑d,z,ω

Vω,tP (d, z, r = u|ω, t) (3.13a)

Pu(d|t) =

∑z,ω Vω,tP (d, z, r = u|ω, t)∑z,ω,d Vω,tP (d, z, r = u|ω, t)

(3.13b)

Pu(z|d, t) =

∑ω Vω,tP (d, z, r = u|ω, t)∑ω,z Vω,tP (d, z, r = u|ω, t)

(3.13c)

(3.13d)

In music, generally, only few notes are active at the same time, and these notes in this small time interval are

produced by few instrument sources. With the pre-extracted templates, the model described above has more

information than its input. Thus, to control the polyphony level and the instrument contribution over time, sparsity

is enforced [6, 40, 41]. Sparsity is enforced on all parameters, through the use of a scaling factor α in the update

Equations — Equations (3.12) and (3.13). When this scaling factor is greater than 1, the probability densities

are sharpened, which will lead to more weights being close to 0 and less weights being close to 1, this enforcing

sparsity. Below are the new constrained update Equations:

Ph(f |p, t) =(∑ω,s Vω,tP (s, p, f, r = h|ω, t))α∑

f (∑ω,s Vω,tP (s, p, f, r = h|ω, t))α

(3.14a)

Ph(s|p, t) =(∑f,ω Vω,tP (s, p, f, r = h|ω, t))α∑

s(∑f,ω Vω,tP (s, p, f, r = h|ω, t))α

(3.14b)

Ph(p|t) =(∑s,f,ω Vω,tP (s, p, f, r = h|ω, t))α∑

p(∑s,f,ω Vω,tP (s, p, f, r = h|ω, t))α

(3.14c)

Pu(d|t) =(∑z,ω Vω,tP (d, z, r = u|ω, t))α∑

d(∑z,ω Vω,tP (d, z, r = u|ω, t))α

(3.14d)

Pu(z|d, t) =(∑ω Vω,tP (d, z, r = u|ω, t))α∑

z(∑ω Vω,tP (d, z, r = u|ω, t))α

(3.14e)

24

(3.14f)

In the overall model only the parameter of the pitch activation - Ph(p|t), was enforced with sparsity. The remaining

parameters have α = 1 while Ph(p|t) has α = 1.1, as in [6].

25

3.2 Implementation

This module was developed in Matlab due to its toolbox integration characteristics. To implement the mathemati-

cal model described in Section 3.1, several Matlab toolboxes were utilized. To implement the MSSIPLCA method,

the toolbox provided in [6] was used (MSSIPLCA Toolbox). This toolbox contained a demo of an implementa-

tion of the aforementioned model, with a template library for the pitched and unpitched components. This demo

was adapted to consider dynamic libraries and was parametrized with the desired sample frequency, spectral

resolution and audio input size. Furthermore, a template extraction module was developed based on the demo

code provided by Emmanouil Benetos, to create template libraries of the instruments considered in this thesis’s

audio data sets. To calculate the CQT, the Matlab toolbox provided in [29] is used (CQT Toolbox). This toolbox

contained the tools to efficiently calculate a CQT spectrogram from a sound signal. Finally, to generate MIDI files

for testing, the toolbox provided in [42] is used (MIDI Toolbox). This toolbox contains the tools to create, modify

and extract information of MIDI files.

The developed module only concerns the pitched component of the MSSIPLCA method presented in [6], un-

pitched sound signals are out of the scope of this thesis. Although the unpitched component is implemented and

properly functioning, no tests or modifications were made. If an unpitched sound signal is provided, the module

will transcribe it using the pre-extracted templates provided by the demo implementation of the MSSIPLCA Tool-

box. Since no unpitched sounds will be provided to the module or to the overall system throughout this thesis,

from now on no mention of the unpitched component will be made and the pitced component will be the only one

considered.

3.2.1 Template Extraction

As mentioned in Section 3.1.1, the model uses pre-extracted and pre-shifted spectral templates per pitch and per

instrument. In order to grant instrument diversity to the transcription process, multiple templates were extracted.

Using a digital instrument library, 82 templates were extracted combining 42 different instruments sources with

8 different playing characteristics (e.g. vibrato and staccato) obtaining a total of 17 generic instruments (e.g. for

a piano we can have distinct piano sources, generating one generic instrument). In this thesis an interval of 88

notes was considered, limiting the notes played to the following interval p ∈ {21, . . . , 108}, in MIDI scale values.

In table 3.2 the generic instruments considered are displayed. For each of the above mentioned combinations,

an audio file was created with every note played individually, as in Figure 3.2.

To extract a pitched template from an audio file, a variant of the PLCA algorithm was used with only one latent

variable component. This latent variable component denotes the pitch, as in Equation (3.3). Applying this algo-

rithm yields the same output as applying a NMF algorithm with beta divergence, as it is stated in [43]. Through

the use of the NMFlib Toolbox [44], a direct implementation of the NMF algorithm with beta divergence is applied

(based on the code provided by Emmanouil Benetos, the author of [6]). This method was applied to input files

of the digital instruments playing all their range of notes individually and following the note’s scale. The log-

spectrogram of the audio signal was computed, again using the CQT, maintaining the same log-frequency as in

Section 3.2.2, but using a temporal step of 100ms. Then the NMF algorithm was applied, thus obtaining a pitch

template of the audio file.

Finally, after extracting the pitched template through the process described above the templates where shifted

across the log-frequency, as it is described in Section 3.1.1, Thus, obtaining a pre-extracted and pre-shifted

template per pitch and per instrument. After computing all the templates all of them where saved in a matrix,

creating a template library.

26

Table 3.2: Generic instruments presented on the developed template library.

Instrument

Bass

Brass

Cello

Clarinet

Contra Bassoon

Double Bass

Flute

French Horn

Guitar

Oboe

Piano

Sax

Trombone

Trumpet

Tuba

Viola

Violin

3.2.2 Pre-processing

The first process executed in this module is the pre-processing stage, where the log-spectrogram is obtained

using CQT from the input signal. The CQT is performed with a log-frequency resolution of 60 bins per octave,

having 545 frequency bins. Also the log-spectrogram is sampled with a 40ms step. As mentioned in Section

2.2, in this time-log-frequency representation the relative distance between the harmonics is constant, this can

be seen in Figure 3.2. In this Figure, the log-spectrogram of a piano performing every note on its keyboard

individually is displayed.

3.2.3 Post-processing

After applying the MSSIPLCA and estimating the unknown parameters the transcriptions can now be extracted.

The transcriptions are extracted as MIDI-scales. The total pitched component transcription and the unpitched

component transcription can be extracted as, respectively:

Ph(p, t) = P (t)P (r = h|t)Ph(p|t) (3.15a)

Pu(d, t) = P (t)P (r = u|t)Pu(d|t) (3.15b)

To extract the transcription for each instrument source, the latent variable s should be fixed to the target instrument

index λ, and the the following calculation should be performed:

27

Figure 3.2: Example of a CQT of a piano performing every note individually. Again it can be seen that the relativedistance between the harmonics remains constant, as mentioned in Section 2.2.

Ph(s, p, t) = P (t)P (r = h|t)Ph(p|t)Ph(s = λ|p, t) (3.16)

Performing this calculation yields a piano-roll like matrix, which contains a raw transcription output. In order

to obtain a good piano-roll transcription, some post-processing steps are performed. The first step consists in

normalizing the raw output. The normalization is performed with the following operation:

Ph(s, p, t)

max(∑

s,p,t |Ph(s, p, t)|) (3.17)

After having a normalized raw transcription thresholding is performed. Given a threshold parameter σ, the tran-

scription is converted into a ones and zeros matrix. If the value of the raw transcription surpasses the σ value the

output is 1 and 0 otherwise. Lastly, since the minimum duration of a note is defined to be 0.2 seconds, events

with duration inferior to 80ms are removed in an effort to eliminate small transcription errors.

3.2.4 Performance Evaluation

An overview of the implemented module can be seen in the system diagram presented in Figure 3.3. To evaluate

the module’s performance a test experiment was conducted. In this Section this experiment is described.

The aim of this experiment is to test the performance of the implemented module, facing different sizes of template

libraries. For each audio file considered, the model was executed incrementing the template library size. The

full library size considered included 8 templates, of 8 distinct instruments. In table 3.3 these templates and their

28

Figure 3.3: System diagram of the developed Module 1, adapted from [40].

characteristics are presented. The template library is incremented by first choosing the present templates in the

audio file. For instance, if an audio file has instrument A and B performing the the library starts with a template of

A or B, and then the template of B or A is added. After, the remainder templates are added randomly.

Test data

To create a plausible data set for testing, the audio files were created using random MIDI files. These MIDI files

were generated by an auxiliary module, developed with the use of the MIDI Toolbox mentioned at the beginning

of the chapter. This module generates a random MIDI file, under the constraints of total file time, number of notes

per file, minimum and maximum note duration time and polyphonic control (if the notes were allowed to be played

in the same time frame). The notes were randomly selected and spread across time. For this test the MIDI files

used had the following characteristics: 30 second total duration, note duration in [0.2s, 4s] and polyphony was

allowed.

For this test, 30 different audio files were generated. They were divided into 3 polyphony levels (3 sets of 10

files). In the first level, the audio files have only one channel of one of the 8 instruments performing accordingly

to a random MIDI file, with the characteristics presented above. In the second level, 2 channels were added with

two distinct instruments performing at the same time, according to a MIDI file assigned individually. In the third

level another distinct instrument was added, generating audio files with 3 channels.

Table 3.3: Template library considered in this experiment. The pitch activation is represented in MIDI scalenumbers.

Index Instrument Playing Style Pitch Activity (Range)

19 Bass Open [23 64]

30 French Horn Normal [29 65]

33 Trumpet Normal [40 76]

43 Sax Legato [21 96]

46 Bb Clarinet Normal [36 79]

58 Oboe Normal [63 99]

68 Cello Normal [24 67]

80 Violin Normal [43 89]

29

Metrics

In order to evaluate the accuracy of the transcription the following metric was applied. The time interval of each

note in the ground truth is inspected in the transcription result, with a tolerance of 40ms. If a note is present in this

time interval, with a duration of over δ(%) of the original interval, then this detected note is considered correctly

transcribed. This process detects accurate transcribed notes as well as false negative transcriptions.

To detect false positive transcriptions the same process was implemented but this time by inspecting the time

interval of each detected note of the transcription result, in the ground truth file. Through a fine-tuning process the

parameter δ was fixed at 75% for both metric processes, in this experiment. This value of δ provides a plausible

accuracy consideration, as it is not excessively high, ignoring notes not fully transcribed, nor excessively low,

considering small transcription errors.

To prevent false detections due to temporal synchronization issues, prior to the evaluation, both the ground truth

and the transcription are aligned temporally. This alignment is performed based on the onset times detected

in notes with the same pitch, where the onset difference between the ground truth and the transcription of all

detected notes is processed to to generate an overall shift value that aligns both time frames.

Results

After executing the model for all the audio files of the test data set, the results were grouped by the polyphony

level of the data set. A percentage of correctly transcribed notes for each level and each library size, considering

the number of notes in the ground truth and the positive notes transcribed detected, is presented in table 3.4. The

correspondent plot can be observed in Figure 3.4. Also the percentage of false positives detected, in relation to

the total number of ground truth notes can be seen in Figure 3.5.

Figure 3.4: Module 1 evaluation test results: Percentage of positive notes transcribed plotted against the size ofthe template library considered.

After inspecting the results in table 3.4, it can be concluded that adding unnecessary templates to the library

causes the module to have worse performance in the transcription task. Visually this can be seen in the graphic

30

Table 3.4: Module 1 evaluation test results: Percentage of notes correctly transcribed and resulting transcriptionerror.

Library size: 1 2 3 4 5 6 7 8 Average

Level 1: 0.7424 0.7264 0.6852 0.6835 0.6304 0.5941 0.5687 0.4848 0.6394

Level 2: 0.6044 0.6185 0.5760 0.5518 0.5319 0.4836 0.4398 0.4307 0.5295

Level 3: 0.5427 0.5423 0.5620 0.5222 0.4975 0.4464 0.4094 0.3636 0.4857

Accuracy: 0.6298 0.6290 0.6077 0.5858 0.5533 0.5080 0.4726 0.4264 0.5516

Error: 0.3702 0.3709 0.3923 0.4142 0.4467 0.4920 0.5274 0.5736 0.4484

in Figure 3.4, where the percentage of correctly transcribed notes decreases as the templates of instruments nor

present in the audio file are added. On the other hand, if there are not enough templates to match the instruments

present in the audio file, the transcription also has worse performance. This can be seen, while observing the

graphic for level 2 and 3, where the library size is smaller than 2 and 3, respectively.

Another conclusion that can be inferred is that the overall performance of the module decreases when the

polyphony of the audio file increases. Observing the ideal transcription for each level (level 1 with library size

of 1, level 2 with library size of 2 and level 3 with library size of 3), the average accuracy obtained is 74.24%

for audio files with only one instrument. For audio files with 2 instruments performing, the average accuracy is

61,85% and for audio files with 3 instruments performing at the same time the average accuracy is 56,20%.

Figure 3.5: Module 1 evaluation test results: Percentage of false positive note transcriptions plotted against thetemplate library size.

In the plot of Figure 3.5, the percentage of false positive notes detected in respect with the total number of notes

in the ground truth can be observed. Inspecting this graph yields yet another conclusion. The number of false

positives increases with the polyphony level, when trying to transcribe a polyphonic music with a template library

size inferior to the number of instruments performing. In the aforementioned graph, when the template library

size is equal to the number of instruments the percentage of false positives decreases significantly. This can be

explained by the module’s attempt to assign notes that are not performed by the instrument considered to the

31

existing instrument in the template library. Thus, when the library is inferior to the existing number of instruments

the number of false positives can be superior to the number of notes that the instrument performed, (it takes into

account notes from other instruments).

In order to visualize the conclusions inferred above, a transcription example is presented in Figure 3.6. In this

example the audio file has two instruments performing. In Figure 3.6a, the ground truth transcription for one of the

instruments is presented. In Figures 3.6b, 3.6c and 3.6d transcription results of that instrument are presented.

In Figure 3.6b, the template library has only the template of the instrument considered. As such, the result

transcription includes notes that are from the second instrument in the audio file (false positives). In Figure 3.6c

the template library has 2 templates, one for each instrument present in the audio file. The factorization performed

by the module now has all the factors it "needs", so the false positives disappear from the transcription of the

instrument considered. Finally, in Figure 3.6d, the template library has 8 templates including the correct ones.

Here we can visualize the impact of adding unnecessary templates to the library. The factorization performed

by the module tries to distribute the spectrogram energy by elements that did not contribute to it, damaging the

transcription for the instruments that in fact are performing in the input audio file.

3.2.5 Proposed Solution

As seen above, the library size directly influences the performance of the module. Choosing a large library or too

small will result in higher transcription errors. This makes the module directly dependant of the library. In order

to grant autonomy to the module and to remove human interaction from this transcription process, an instrument

classifier was developed. This instrument classifier detects instruments in an audio file and providing insight to

this module on how many templates should it use. Thus, an attempt to automatically infer the proper size of the

template library is performed, granting autonomy to the module and removing human interaction from its process.

The instrument classifier is implemented in a second module, that is described in Chapter 4. It consists on a

Convolutional Neural Network that classifies the musical instruments detected in an input log-spectrogram.

32

(a) Ground truth. (b) Library size = 1 template.

(c) Library size = 2 templates. (d) Library size = 8 templates.

Figure 3.6: Example of transcription results with different library sizes, for an audio file with 2 instruments per-forming.

33

Chapter 4

Convolutional Neural Network

To address the limitation imposed by the usage of a static and user-defined template library on the MSSIPLCA

module presented in Chapter 3, a second module was developed — CNN module. It is a classification module

that performs classification through a Convolutional Neural Network. In this Chapter this module will be detailed,

as well as the fundamentals about CNNs. In section 4.1 an explanation regarding these type of networks is pre-

sented. In section 4.2 the module’s implementation specifics are presented, including the network’s architecture,

the training phase and its performance evaluation.

4.1 Convolutional Neural Networks

Convolutional Neural Networks are feed-forward deep neural networks. They are inspired by the architecture of

the visual cortex of animals, namely the cat’s visual cortex (as in [45]). CNNs are considered to be among the

best pattern recognition systems [37]. This can be seen in the handwritten character recognition task, where in

1998 LeCun et al. developed a benchmark system with state-of-the art performance [46].

Regular neural networks, as the MLP, take the input data and, through its propagation on the hidden layers,

generate an output. These hidden layers, as seen in section 2.4, consist in several neurons (e.g. the perceptron).

These neurons are fully-connected to the previous layer neurons, as in Figure 2.2a. When the input of the net is

an image, it isn’t hard to see that this fully-connected architecture won’t scale properly, with the parameters adding

up through the layers. This particularity makes training an arduous and computationally expensive process [47].

In CNNs the inputs are interpreted as images, having 3 dimensions: width, height and depth (whith the latter

corresponding to the red, green and blue channels when dealing with real images). Thus, the neurons in a

CNN will also have these 3 dimensions. The neurons also have the particularity of only being connected with

a specific spatial region of the previous layer. This architecture scales well with input images and allows the

net to be trained, unlike fully-connected networks where training in this conditions would be very difficult or even

impossible [37].

34

4.1.1 CNN layers and architecture

There are three main types of layers when building a CNN: Convolutional layers, Pooling layers and Fully-

connected layers. In the following section these fundamental layers will be described alongside with other layer

types that can be applied to CNNs. Stacking multiple layers with different type combinations will generate a fully

functioning CNN architecture.

Convolutional layers

Convolutional layers are the fundamental piece of CNNs. They are composed by a set of parameters that consist

in a set of weights, often called filters or kernels. As in neural networks, these weights can be updated in a

training process to learn different representations of data. A filter has a small width and height compared to the

input, but it has the same depth. In the forward pass step of the learning process (Subsection 2.4.3), each filter

is convolved across the width and the height of the input, hence the name convolutional network. As the filter

passes through the input image, the filters will be updated in order to be activated when a certain feature arises in

a specific spatial location. This process creates an activation map. These activation maps may also be followed

by an element-wise activity function, such as the Rectifier Linear Units (ReLU)s, which will be explained below.

Each neuron’s output can be interpreted as the result of a neuron analysing a small spatial location [47]. This

spatial location is called the receptive field, which dictates the size of the spacial location to be analysed by the

neuron. This denotes an important property of CNN: the neurons are locally connected.

The full output of the layer consists in stacked activation maps, creating a 3-dimensional output. The width and

the height of this output is given by the convolution operation between the filters and the input. The depth of this

output is a chosen quantity. It denotes how many neurons it is desired to analyse the same spatial location. This

group of neurons can be interpreted as a depth column, as seen in grey in Figure 4.1.

Each depth column has a spatial location assigned. These spatial locations often overlap, causing different depth

columns to analyse the same partial spatial location. This overlap is dictated by the stride quantity. For example,

if the stride is set to 1, a new depth column will have as a spatial location, a spatial location 1 spatial unit apart

from the previous one. As the convolution operation changes the size of the input image, zero-padding can be

triggered to prevent this from happening. Zero-padding consists in adding zeros to the spatial borders of the

input. It allows the control of the output dimensions [47].

The previous quantities denote the hyper-parameters of a convolutional layer. A hyper-parameter can be in-

terpreted as a high-level parameter that will influences the model’s performance. The output size can then be

computed with the hyper-parameters and the filer information [48], as follows:

Oh,w =Ih,w − Fh,w + 2P

S+ 1 (4.1a)

Od = K (4.1b)

where Oh,w denotes the output height and width (that are calculated equally),Od denotes the output depth, Ih,wis the input height and width, Fh,w is the filter’s height and width, P is the amount of zero-padding used, S is the

stride and K is the number of filters used.

Another important property of convolutional layers is their parameter sharing characteristic. This characteristic is

35

Figure 4.1: Example of input volume and neuron arrangement in a convolutional layer.

based on the assumption that if it’s useful to calculate a set of features in one position, it is also useful to calculate

it in the remaining positions. This means that, for a fixed depth, all neurons share weights and bias, thus reducing

the number of parameters and facilitating the learning process. In practice, during the Backpropagation phase

each neuron calculates its weight’s gradient, but in the end all gradients will be added up in each existing depth

level. Since all the neurons in a depth level share the same weights, then the forward-pass of the layer in each

depth level is the convolution between the neuron’s weights (filters) and the input volume. This process results

in an activation map, and the set of all activation maps for all depth levels creates the output volume of the layer

[47].

Pooling layers

A Pooling layer is a layer that is inserted with the intent to achieve spatial invariance by reducing the spacial

size of its input. It downsamples the input resolution. A direct consequence of this resolution downgrade is

reducing the number of parameters, thus reducing the computational effort of training the the network. Reducing

the parameter number also provides overfitting control [49]. Again, this downgrade is independent for each depth

level, maintaining the depth resolution. There are multiple types of pooling operations: Max Pooling, Subsampling

and Average Pooling, to name a few .

Figure 4.2: Example of Max Pooling on an input depth level.

Due to it’s success in capturing invariances in image-like data, Max Pooling is the most commonly applied pooling

36

operation [49]. Max pooling applies filters with stride to the input function. The filters define the spatial region

where the maximum operator will be used. In Figure 4.2, an example of this operation is provided. It can be

observed that, (2× 2) filters were used, with a stride of 2. Then, the Max pooling operation selects the maximum

of each filter’s spatial location. The output map has 34 less activations than the input map, thus reducing resolution.

In Equations (4.2) the output size of the Pooling layer is computed.

Oh,w =Ih,w − Fh,w

S+ 1 (4.2a)

Od = Id (4.2b)

As mentioned above, the map output is reduced, thus down-sampling the image (Equation (4.2a)), while it’s depth

remains constant (Equation (4.2b)).

ReLU Layers

ReLU are neurons with the non-saturating non-linearity activity function f(x) = max(0, x) where x is the neuron’s

input [50]. These layers apply the mentioned non-linearity element-wise, thus leaving the input size unaltered.

These layers are stacked after convolution layers, as they provide a faster learning process than the typical

sigmoid non-linearities [51].

Dropout Layers

One major concern while dealing with large and complex deep neural networks is overfitting. Overfitting occurs

when the network has enough complexity to memorize the training data, losing its generalization capabilities.

Overfitting occurrence probability increases with the increasing size of the network [52].

The dropout technique was introduced to address the overfitting concern in artificial neural networks [53]. With

this technique overfitting is prevented by temporarily removing some neurons and their connections, in the training

process. The dropped units are randomly chosen and each unit is retained with a fixed probability (usually 0.5),

independent of other units. After removing the units, the resulting network is a thinner sample of the original

network. As this process repeats itself in each training iteration, these sampled thinner networks always consist

in different neurons and connections of the original network. Each sample network is then trained very rarely.

In the end of the training iteration an averaging technique is performed. The outgoing weights of a retained unit

are multiplied by the aforementioned fixed probability in order to combine all the sample networks into one single

network [53].

As stated in [53], neural networks using the dropout technique can be trained in a similar manner as regular

neural networks. The only difference is that the Forward and Backpropagation pass steps are applied to the

sample networks instead of the original network. Using this technique leads to a significantly lower generalization

error, thus preventing overfitting.

37

Fully-connected layers

Fully-connected layers, as the name implies, are layers whose neurons are fully connected to the previous layer

neurons. These are standard neural network layers as seen in Section 2.4. They are usually employed in the last

layers of the CNN architecture, to provide a high-level insight of the input data.

It’s important to mention that the difference between a convolutional layer and a fully-connected layer is the

locally connectivity and parameter sharing properties of the convolutional layer. Setting the convolutional layer

filter’s size to match the input image (height and depth) results in an output size of (1× 1×K), where K is the

number of filters. Thus, this layer will act as a fully-connected layer with K neurons [47].

Softmax Loss Layers

A Loss layer is the last layer of a neural network architecture. It generates the final output, thus the classification.

It consists on a Fully-connected layer with an applied loss function. The standard function is the Softmax loss

function [54].

f(z)j =xzj∑Kk e

zkj = 1, . . . ,K (4.3)

The Softmax loss function (or cost function) provides a probabilistic insight of the resulting classification, as it

outputs normalized class probabilities. This function is displayed in Equation (4.3). Given a K-dimensional

vector z of arbitrary of arbitrary scores, it outputs a vector with the corresponding values in an [0, 1] interval, with

the total sum of the output vector equal to 1 [47].

With a Softmax loss layer as the last network layer, the output of the CNN is a vector containing the probabilities

of each class, given the input data.

4.1.2 CNN in Music

In the past years, Convolution neural networks have been increasingly applied to music related tasks, mainly due

to a set of properties that they possess. Their weight sharing property allows the training of deeper architectures

with a high number of parameters, thus being capable of modelling the complex data contained in a musical

signal. Their shift-invariance allows pattern recognition across time and frequency, providing the capability of

interpreting a time-frequency representation and recognizing a pattern along each of these dimensions.

This type of neural networks has been applied to several tasks of the AMT process. CNNs achieved state-of-

the-art results on Onset detection, as in [55]. In source separation tasks CNNs yielded again positive results. In

[56] a CNN was trained to analyse a spectrogram and automatically separate the vocals of a musical mixture.

Considering the classification task, CNN classifiers also achieved great results [57, 58, 59]. These classifiers

were trained to analyse extracted musical features from an input signal to classify its musical genre.

Specifically in instrument classification, CNNs were applied successfully, as in [60]. In the latter proposed model,

the classifier received as an input not only the extracted features but also the signal’s spectrogram. The afore-

mentioned examples and the harmonic sound property mentioned in Section 2.1 were the core fundamentals in

the decision of designing a CNN classifier to address the problem of automatically choosing a library size for the

proposed module 1 (Chapter 3).

38

4.2 Implementation

This module was also developed in Matlab to facilitate the integration with the previous module and due to the

fast prototyping characteristic of Matlab regarding neural networks. The CNN was developed and trained using

the MatConvNet Toolbox [61]. Once again, to compute the log-spectrogram, the CQT Matlab toolbox provided in

[29].

4.2.1 Network’s Architecture and Learning Process

A CNN classifier was trained to detect notes of one of three chosen instruments in log-spectrograms of 1.2

seconds. Then this classifier was applied to a musical signal as a windowed function. After a normalization

process, and given a classification threshold µ, an output classification vector is generated. This vector consists

of three binary outputs each one corresponding to an instrument, and has a 0 or 1 value whether the instrument

is present or not in the input signal.

Figure 4.3: Diagram of the implemented CNN’s architecture.

Table 4.1: CNN classifier’s layers and filter sizes.

Layer: Filter size Stride Padding

Convolutional layer 1 2× 3 2 1



Max-pooling layer 1 2× 2 2 0



Max-pooling layer 2 2× 2 2 0


According to what is reported by several authors like the ones in [60, 62] where Convolutional Neural Networks

were trained to receive raw spectrograms and then classify, the proposed CNN will also receive only raw spectro-

grams as an input. This proved to be a challenging task, as the input data is very complex. Several attempts to

design a net were performed. Shallow nets did not achieve good results, as the nets hadn’t enough capacity to

39

learn the complex data. The net that achieved the better results, and that was chosen as a classifier, is presented

in Figure 4.3. It has 12 layers: 3 Convolutional layers, 1 Max-pooling layer, 2 Convolutional layers, 1 Max-pooling

layer, 1 Convolutional layer, 1 Fully-connected layer, 1 Dropout layer, 1 Fully-connected layer and finally 1 Soft-

max layer (layers presented from the shallowest to the deepest). All Convolutional and Fully-connected layers

were each followed by a ReLU non-linearity.

In Figure 4.3 the feature map sizes can be seen, and in table 4.1 the size of the filters used in the Convolu-

tional and Pooling layers are presented. In an effort to provide a better classification, the filters considered are

rectangular-shaped to maintain a high frequency resolution, as can be seen in table 4.1.

Training data, Validation data and Test data

Table 4.2: Instruments considered in the classification task.

Index Instrument Playing Style Pitch Activity Selected Range

19 Bass Open [23 64] [29 64]

58 Oboe Normal [63 99] [63 98]

80 Violin Normal [43 89] [43 78]

Since analysing raw spectrograms requires much complexity, the classifier was trained to only classify among

three instruments. These instruments were chosen by their digital audio quality, by their sustain capability and by

their distinct sound characteristics. In table 4.2, the selected instruments are presented.

To create the training dataset, 36 distinct notes from each instruments were selected (as can be seen in table

4.2). These notes have the duration of 1 second and are correctly labelled. Then the CQT was computed

creating a spectrogram with 1 second duration, containing only the selected note. Through data augmentation

tools, using minimal frequency and temporal shifts, each note was multiplied creating the full data set of 3240

log-spectrograms each with 1.2 seconds.

The full data set was then separated into a training data set and into a validation data set, ensuring that in both

cases the 3 classes were always equally represented. The training data set contained 23 of the full data set, and

the validation set contained the remainder of the full data set. Also to create a test data set, for each instruments

2 notes from outside the selected range were chosen and suffered the same process mentioned above. This

generated a test data set of 60 spectrograms.

Learning

The CNN was then submitted to a learning process. Through an extensive fine tuning process the batch size and

learning rate were set at Bt = 100 and η = 0.002 respectively. The filters were randomly initialized, and the net

passed through the training and validation data set 10 times (10 iterations). Since the net is very deep, thus being

capable of modelling very complex data, the number of iterations is relatively low to prevent overtraining the net

and a dropout layer was introduced to prevent overfitting. The learning process took approximately 6 hours to

complete. Then the trained CNN was used to classify the test data set composed of unseen notes from the three

instruments considered. The resulting test error was 23.33%, the classifier correctly classified 46 of the 60 notes

contained in the test data set.

40

4.2.2 Pre-processing

As in Section 3.2, the log-spectrogram input signal was obtained using the CQT. The frequency resolution was

60 bins per octave, having 545 frequency bins and the log-spectrogram is again sampled with a 40 ms step. The

input signal was then sampled in 1.2 seconds segments, which were fed to the classifier. For each segment the

classifier will output a classification probability for each of the three classes, acting as a windowed function. The

overall output is a 3 × N matrix where N is the number of samples. Each of the 3 components of this matrix

contains the presence probability over time of the respective instrument.

4.2.3 Post-processing

In a polyphonic input signal, the notes may (and probably will) overlap, in a given time frame. It may induce the

classifier in error, as it was trained to detect single notes. The output matrix of the classifier’s windowed function

like process is then normalized by subtracting the mean classification of each component. This process aims to

enhance the occurrence of the highest probability classifications, thus ignoring the average classification which

may induce in error.

The normalized output is then submitted to a classification threshold µ. If the classification output of a given class

surpasses this threshold µ then the instrument is considered to be present in the input signal.

4.2.4 Performance Evaluation

In Figure 4.4 the diagram of the implemented module can be observed. Once again, a test experiment was

conducted to evaluate the module’s performance. In this Section this experiment is described.

Figure 4.4: System diagram of the developed Module 2.

In this performance evaluation, the influence of the classification threshold parameter µ over the overall classifi-

cation will be studied. The module will classify a test data set with different values of µ.

Test Data

The data set for this experiment consists of 30 random sound files. These sound files were created according

to the same procedure as in Subsection 3.2.4, using the auxiliary MIDI module. Three levels of polyphony

were considered, with 13 of the data set corresponding to each level. This time the songs had the duration of 20

seconds, and only three instruments were considered. The mentioned instruments correspond to the instruments

that the classifier was trained to identify, but there were no restrictions to their pitch activity range. The note’s

duration was a random value in [0.2s, 4s].

41

Metrics

In this experiment the metric considered is rigorous as the output classification is considered correct when it

identifies all instruments present in the input file, without false positive or false negative classifications.

Results

After classifying all the test data files with different µ values from 0 to 0.05, the accuracy of the classifier was

plotted against the µ value — Figure 4.5. In this plot the influence of the µ parameter in the classification task can

be visualized, a lower value of µ represents a more permissive classifier which will detect the instruments even

if their presence has a small probability value. A high value of µ represents a more conservative classifier, that

only detects instruments with a higher presence probability value. The best result was achieved with µ = 0.02,

resulting in 96,67% correct classifications.

Figure 4.5: Module 2 performance evaluation: Graphic of the influence of µ in the classification accuracy.

Despite the input signal’s complexity, the classifier achieved a high level of accuracy. One reason for this overall

performance is related to the data used for training and for evaluation. All the sound files were generated with a

digital instrument library and through MIDI files. In reality an instrument playing a note will always be different,

which does not happens when considering digital instruments. The digital instruments don’t vary, performing

always in the same manner. This eliminates the performance dynamics and the small tuning deviations from the

data sets considered, thus decreasing the complexity of the classification task.

The considered instrument’s range (pitch activity) may also affect the classification process. As can be seen in

Table 4.2, instruments have different ranges and these ranges don’t fall into the same range interval. This means

that some notes can be played only by two of the considered instruments, or even exclusively by one. Once

again, this reduces the complexity of the classification task.

Also, considering only three instruments and providing one note log-spectrograms at a time in the training phase

provided suitable conditions to facilitate the learning stage. In the first learning stage attempts log-spectrograms

42

with several notes were provided to the classifier, with longer durations. With these conditions the leaning at-

tempts were successively unsuccessful. Thus, to achieve these results only three instruments were considered

and the classifier was trained with short duration log-spectrograms containing only one note at a time.

(a) Representation of the MIDI file, denoting the ground truth.Instrument 1 is displayed in green and instrument 2 in pink.

(b) Output of using the classifier as a windowed function.

Figure 4.6: In the left, the MIDI file that originated the input sound file is presented. In the right, the output ofusing the classifier as a windowed function is presented.

To provide better insight of the classification process, in Figure 4.6 an example of the intermediate steps of

this process is displayed. In Figure 4.6a the MIDI file that originated the input sound signal is displayed. The

notes performed by instrument 1 (Class 1) are presented in green, the notes of the instrument 2 (Class 2) are

presented in pink. In Figure 4.6b the output of using the classifier as a windowed function is presented, after the

normalization step. Inspecting this Figure provides insight on the classification performed by this module. Note

that even though only 2 instruments are performing in the input music piece, the classifier is prepared to detect

the presence of the 3 instruments learnt in the training process. Using only 2 instruments in the input music

piece ensures a degree of uncertainty in the classification, forcing the classifier to detect which 2 instruments are

playing among the 3 possible ones. As such, in the plot 4.6b 3 classification results are presented, with Class 3

(corresponding to instrument 3, which is not performing) achieving a low presence as expected.

When only a note is detected (e.g. the note of instrument 1 in the first time steps), the correspondent class will

have higher normalized probability values. Also, when several notes overlap it can be seen that the classifier

detects not only the correct instrument, but some low valued "noise" classification for the remainder instruments

(e.g. detecting the instrument 3 in this input file). The effect of the normalization performed is clear in this figure.

It allows to enhance the detected classification, even when multiple notes are being played at the same time.

Both the normalization and thresholding processes help the classifier ignore these misclassification events, thus

considering only the correct instruments.

43

Chapter 5

Hybrid System

In this Chapter the integration of the distinct methods resulting in the developed system is addressed. The overall

process of transcription is detailed, considering both modules developed in this thesis. In Section 5.1 the system

is detailed and the transcription process is explained from beginning to end. In Section 5.2 the performance of

the system is evaluated, following by an example of a transcription process.

5.1 System description

The complete system is composed by both Modules addressed earlier. Given an input signal, its CQT is computed

and then is analysed by the classifier in the CNN module. This classifier will segment the log-spectrogram

produced, and it will identify one of three instruments in each segment. This will produce a matrix containing

the probability of the presence of each instrument in the segments considered acting as a windowed function.

This matrix is then normalized by removing the mean probability for each instrument. Then if the obtained value

exceeds a classification parameter µ, the instrument is considered in this input file. The final output of this Module

is a binary vector of three values, one for each instrument, determining if an instrument is present (1 valued) or

not (0 valued) in the input file.

The output of Module 2 and the log-spectrogram of the input file are both received by the Module 1. This Module

will then use the classification vector to determine the size of its template library. The library will only contain

templates for the instruments classified as present in the input file. The transcription is the performed as in

Chapter 3, using this dynamically set template library.

A diagram of the overall system can be observed in Figure 5.1. The transcription performed by this system is

autonomous, and it only depends on the hyperparameters that affect both modules. After tuning this parameters,

it does not need human interaction to perform transcription.

5.2 Performance Evaluation

To evaluate the performance of the overall method another test experiment was conducted, as for the modules

presented in Chapters 3 and 4. The aim of this experiment is to evaluate the transcription performance when the

contribution of module 2 is considered. Thus, the transcription process will consider the same hyper parameters

as in Chapter 3 and the classification threshold will be varying to address its influence in the transcription process.

44

Figure 5.1: Diagram of the proposed hybrid system.

Test Data

The data set created for this experiment consists of 30 random sound files based on random MIDI files created

by the auxiliary random MIDI file generator module. Again, the same three levels of polyphony were considered:

sound files with 1 instrument, with 2 instruments and with 3 instruments. Each polyphony level represents 13 of the

dataset. The instruments considered were the 3 instruments that the classifier was trained to identify, presented

in Table 4.2. The sound files had the duration of 20 seconds and the note’s duration was a random value in

[0.2s, 4s].

Metrics

To evaluate the system, the calculations made in the performance evaluation of Chapter 3 were again applied.

The transcription’s false negative, false positive and correctly transcribed notes were calculated, using the same

parameter δ = 75%. To evaluate the overall performance, the sum of false negatives,∑FN and false positives,∑

FP , was divided by the total correct notes, N , thus generating an error measure, ε.

ε =

∑FN +

∑FP

N(5.1)

This designed formula for measuring error (Equation 5.1) consists on a simple arithmetic computation which

takes into account both types of errors, false negatives and false positives, and provides a ratio of the total errors

occurred versus the number of existing notes to be detected in the input music piece.

45

Results

The experiment ran for the 30 files with different values of µ ∈ [0, 0.05], (with µ denoting the classification

threshold parameter introduced in the CNN Module). The overall result can be observed in Figure 5.2. Inspecting

the graph in Figure 5.2b that plots the mean classification error against the different µ values, in can be seen that

the best result was obtained for µ = 0.02, corresponding to an error of ε = 0.433. This µ value is similar to the

optimal µ value obtained in the performance evaluation performed in Chapter 4, as expected.

(a) Transcription error for the 3 polyphony levels considered. (b) Mean Transcription error obtained for all polyphony levels.

Figure 5.2: System’s performance evaluation: graph of the transcription error plotted against the parameter µ.

In Figure 5.2a, the graph of the mean classification error is presented for each polyphony level. As mentioned

earlier, a low level of µ provides a permissive classifier. Observing the aforementioned graphic, it can be seen that

a permissive classifier highly impacts the transcription process, especially for sound files with only one instrument.

A low value of µ will consider instruments with low probability values, which will lead to consider instruments that

have been misclassified and have a residual probability value. Considering these instruments, will add to the final

transcription falsely detected notes as seen in Chapter 3. Due to the chosen metric, these falsely detected notes

will be considered errors (false positives) as well as the remainder errors generated during the transcription of the

instruments that actually are present in the music piece, (with high probability values). Thus, the ratio between

all the errors considered versus the number of existing notes can achieve values superior to 1. For the level 2

sound files, the error is smaller as less non-existing instruments are considered. For the level 3 sound files, a low

value of µ provides the lowest transcription error as it considers all three instruments which are all performing in

the sound file.

Setting a high value of µ provides a conservative classifier, that will only consider instruments with high probability

values. Once again, inspecting the graph in Figure 5.2a, the impact of a conservative classifier in the transcription

process can be observed. A high value of µ will also impact negatively the transcription. It will discard instruments

that although having a lower than µ probability value, are present in the sound file. Discarding existing instruments

will increase the false negative transcription notes as all the notes performed by the discarded instrument are not

transcribed, as seen in Chapter 3. This effect can be seen especially in level 3 sound files. The high value of µ will

force the classifier to discard multiple existing instruments, thus disregarding all their notes. These disregarded

notes plus the transcription errors of the considered instruments cause an error of over 100%. Again, in level 2

sound files, this impact is not so substantial, as less performing instruments are disregarded. For level 1 sound

files, this allows a low error, as it only considers instruments with high probability values. In these sound files,

as no other instrument is performing besides the one considered, its classification process will result in a high

probability value. On the other hand, a higher value of µ can even disregard all instruments and consider no

46

instrument in the transcription process, resulting in the increasing error obtained for the higher values of µ, even

for level 1 sound files.

As seen in Chapter 3 using an implementation of the state-of-the-art MSSIPLCA algorithm proves that transcrib-

ing polyphonic signals, even when the instruments are priorly known, is a very challenging task. In the perfor-

mance evaluation of the algorithm the lowest transcription error obtained for a 1-instrument music piece was

25.76% correctly notes transcribed. As for 2-instrument music pieces it was 38.15% and for 3-instrument music

pieces it was 43.80%. Averaging a 35.90% transcription error, always considering that the instruments being

played are known. In the proposed hybrid system no prior information regarding which instrument is playing was

used. Instead the CNN module detects which instruments are playing, choosing the corresponding instrument

from it’s instrument library. Thus, this improvement while adding autonomy to the system also adds uncertainty.

As mentioned above, the best overall result was obtained for µ = 0.02, which can be seen in Figure 5.2b. The

obtained ε = 0.433 error is approximately close to the obtained mean transcription error in Chapter 3. Although

not as low as the best average error obtained — 35.90% — the system can now detect which instruments are

playing among it’s template library, increasing the average error by 7.4%. Thus, the hybrid system, achieved a

close mean transcription error but with the added feature of automatically detecting which are the instruments

performing in the input file, and adapting the template library to it.

As in Chapter 3 and 4, to provide better insight into the overall transcription process of the hybrid system, an

example will be provided. In Figure 5.3, the log-spectrogram obtained by computing the CQT transform for the

chosen example input file is presented. This input file consists on a performance of two instruments: instrument

1 and instrument 3.

Figure 5.3: Log-spectrogram of the input file considered in the following example.

The notes played by each instrument can be seen in Figure 5.4a, were a representation of the MIDI files used

to create this sound file are displayed. The notes played by instrument 1 are displayed in green, and the notes

played by instrument 3 are displayed in pink. In Figure 5.4b the classification matrix can be observed. It can be

easily inferred that a value of µ = 0.02 will correspond to a correct classification, as only the instrument 1 and 3

have probability values superior to this µ value.

47

(a) Representation of the MIDI file, denoting the ground truth. (b) Output of using the classifier as a windowed function.

Figure 5.4: In the left, the MIDI file that originated the input sound file is presented. Instrument 1 is displayed ingreen and instrument 3 in pink. In the right, the output of using the classifier as a windowed function is presented.

(a) In the left the ground truth and inthe right the transcription obtained forinstrument 1.

(b) In the left the ground truth and inthe right the transcription obtained forinstrument 2.

(c) In the left the ground truth and inthe right the transcription obtained forinstrument 3.

Figure 5.5: Transcription results for the three instruments considered with µ = 0.005.

Different transcription results with different µ values will now be presented, as a visual example of the aforemen-

tioned described results. In Figure 5.5 the transcription results, using a permissive classifier with µ = 0.005, are

displayed. In Figure 5.5a the transcription result for instrument 1 is presented, with the ground truth in the left

and the transcription output in the right. The transcription result for instrument 2 is presented in Figure 5.5b and

for instrument 3 in Figure 5.5c, both with the same layout as the results presented for instrument 1.

With a low value of µ all three instruments are considered in the transcription process. This can be observed

inspecting Figure 5.5b, were instrument 2 is wrongly considered generating multiple false positive notes. These

false positives correspond to a wrong attempt to assign notes played by instrument 3 to instrument 2. As these

notes are wrongly considered, their transcription is not accurate, and instead of a long note, it creates small

segmented notes. Thus, one note wrongly assigned to an instrument can create multiple false positives. This

explains the strong negative impact of a permissive classifier in the transcription process.

In Figure 5.6 the transcription results, using µ = 0.02 are displayed. As mentioned above, this value of µ ensures

that only the correct instruments are considered in the transcription process. This ensures that no false positives

are created due to wrongly assigning notes to an instrument as can be seen in Figure 5.6b. The error in the

transcription corresponds to the false negatives created in the regular transcription process, (Figures 5.6a and

5.6a).

Finally, in Figure 5.7 the transcription results using µ = 0.035 are displayed. This value of µ represents an

48




Figure 5.6: Transcription results for the three instruments considered with µ = 0.020.

excessively conservative classifier. It will ignore instrument 3 in the transcription process, as can be seen in

Figure 5.7c. Although the instrument 2 is correctly not considered (Figure 5.7b), ignoring instrument 3 will add

multiple false negative notes to the overall transcription.




Figure 5.7: Transcription results for the three instruments considered with µ = 0.035

In table 5.1, the numeric results of the examples presented are displayed. It can be observed that the best

transcription result is obtained with µ = 0.02, as expected. This will lead to an overall transcription error of

45.5%, with 12 of the 22 considered notes being correctly transcribed. Again, this result was achieved with

δ = 75%. Thus, notes that are transcribed but do not have a duration of at least 75% of the original notes are

considered wrongly transcribed.

Table 5.1: Numeric results of the provided transcription examples

µ value: 0.005 0.020 0.035

Existing Notes 22 22 22

Positive Transcriptions 7 12 8

False Negative Transcriptions 15 10 14

False Positive Transcriptions 22 0 0

Error 168.2% 45.5% 63.6%

49

Chapter 6

Conclusion

6.1 Achievements

In this master’s thesis an Automatic Music Transcription system is proposed. This system consists in a hybrid im-

plementation of two distinct methods. The first method implemented is a state-of-the-art spectrogram factorization

technique developed by Benetos et al. [6], named Multi Sample Shift Invariant Probabilistic Latent Component

Analysis. This method uses a pre-extracted template library (of instruments and their notes) to perform Multi-

Pitch Detection as well as Note Tracking. The method is successfully implemented in the system’s MSSIPLCA

module.

After evaluation the performance of the aforementioned module, it was found that the size of the template library

considered in the transcription process would impact the resulting transcription. Given a sound file, considering

more or less instruments than the existing ones (adding or removing templates) would result in a worse tran-

scription process. To address this issue and to automatically select the appropriate templates, a classifier was

designed to perform instrument identification.

The designed classifier is a Convolutional Neural Network, a Machine Learning technique. CNNs are a Deep

Learning method that excel in classification tasks and were chosen to address this task due to their shift-

invariance and and shared weights properties. Thus, a CNN was designed with 12 layers and it was successfully

trained to identify individual notes of 3 distinct instruments. The proposed system was then assembled, using a

module to perform Multi-Pitch Detection and Note Tracking (MSSIPLCA module), but this time with a template

library defined by the classification output of another module containing the developed CNN (CNN module). The

system’s overall result is a transcription error of 43.30%. Using only an implementation of the state-of-the-art

MSSIPLCA algorithm, with prior information regarding which instrument is present in the considered music piece,

a mean transcription error of 35.90% was achieved, showing the difficulty of transcribing polyphonic music sig-

nals. The proposed module removes the need of this prior information regarding which instruments are playing,

while increasing the average transcription error by a small percentage (7.40%).

Thus, the proposed hybrid system successfully performs an automatic transcription of a given input file. It

achieves a transcription error approximately similar to the transcription error presented by Benetos et al. method

[6]. Although it does not particularly improve the transcription error of this method, it additionally performs In-

strument Identification via a CNN. With this new task considered, the hybrid system now combines two distinct

methods in order to improve the transcription process. It now can automatically detect the proper size of the

template library, by identifying the performing instruments in the input file. There is no longer the need for the

system’s user to define a static template library. The system can now decide on its own the instruments to be

50

considered, providing a more automatic transcription process. This shows that an hybrid approach to the AMT

task is able to improve the overall transcription process.

6.2 Future Work

Despite the successful implementation of the two aforementioned Machine Learning methods for Automatic Mu-

sic Transcription, there’s still a large margin for improvement. Since it was proved that different methods can

be combined to provide better transcriptions, different methods than the ones considered in this thesis can be

combined in order to improve not only the transcription process but also the transcription error. In the author’s

opinion the future improvements should focus on the classifier.

The proposed classification module its trained to identify only three instruments, due to the complexity of the data.

With more computational power, more instruments could be considered in this classification. This would allow to

evaluate the system with increasing levels of polyphony. Another characteristic of the classifier is that it is trained

to identify isolated notes of each instrument. An approach to identifying overlapping notes of the same or distinct

instruments would be an interesting feature to add to the classifier, making its classification process more robust.

This would also remove the necessity of using the classifier as a windowed function. Considering an analogical

data set could also be an interesting approach. As mentioned above, the digital data set is composed by digital

instruments, which perform in the exact same way every time. This removes the artist’s performance skills out of

the scope of the classifiers analysis, as well as small frequency changes due different tunings. Considering an

analogical data set would provide an insight on the classifiers capability of dealing with real data.

51

Appendix A

Musical Notes

Below, the notes, their frequencies and wavelengths are displayed. This corresponds to an equal temperament,

with a tuning of A4 = 440Hz. Also, the correspondent MIDI scale number is indicated alongside with the scale

considered in this thesis. The number in the note’s name corresponds to the octave to which the note belongs.

Table A.1: Notes, frequencies and wavelengths with the correspondent MIDI scale number.

Note name Frequency (Hz) Wavelength (cm) MIDI scale number Considered scale

A0 27.50 1254.55 21 1

A#0/Bb0 29.14 1184.13 22 2

B0 30.87 1117.67 23 3

C1 32.70 1054.94 24 4

C#1/Db1 34.65 995.73 25 5

D1 36.71 939.85 26 6

D#1/Eb1 38.89 887.10 27 7

E1 41.20 837.31 28 8

F1 43.65 790.31 29 9

F#1/Gb1 46.25 745.96 30 10

G1 49.00 704.09 31 11

G#1/Ab1 51.91 664.57 32 12

A1 55.00 627.27 33 13

A#1/Bb1 58.27 592.07 34 14

B1 61.74 558.84 35 15

C2 65.41 527.47 36 16

C#2/Db2 69.30 497.87 37 17

D2 73.42 469.92 38 18

D#2/Eb2 77.78 443.55 39 19

E2 82.41 418.65 40 20

F2 87.31 395.16 41 21

F#2/Gb2 92.50 372.98 42 22

G2 98.00 352.04 43 23

G#2/Ab2 103.83 332.29 44 24

A2 110.00 313.64 45 25

Continued on next page

53

Table A.1 – continued from previous page


A#2/Bb2 116.54 296.03 46 26

B2 123.47 279.42 47 27

C3 130.81 263.74 48 28

C#3/Db3 138.59 248.93 49 29

D3 146.83 234.96 50 30

D#3/Eb3 155.56 221.77 51 31

E3 164.81 209.33 52 32

F3 174.61 197.58 53 33

F#3/Gb3 185.00 186.49 54 34

G3 196.00 176.02 55 35

G#3/Ab3 207.65 166.14 56 36

A3 220.00 156.82 57 37

A#3/Bb3 233.08 148.02 58 38

B3 246.94 139.71 59 39

C4 261.63 131.87 60 40

C#4/Db4 277.18 124.47 61 41

D4 293.66 117.48 62 42

D#4/Eb4 311.13 110.89 63 43

E4 329.63 104.66 64 44

F4 349.23 98.79 65 45

F#4/Gb4 369.99 93.24 66 46

G4 392.00 88.01 67 47

G#4/Ab4 415.30 83.07 68 48

A4 440.00 78.41 69 49

A#4/Bb4 466.16 74.01 70 50

B4 493.88 69.85 71 51

C5 523.25 65.93 72 52

C#5/Db5 554.37 62.23 73 53

D5 587.33 58.74 74 54

D#5/Eb5 622.25 55.44 75 55

E5 659.25 52.33 76 56

F5 698.46 49.39 77 57

F#5/Gb5 739.99 46.62 78 58

G5 783.99 44.01 79 59

G#5/Ab5 830.61 41.54 80 60

A5 880.00 39.20 81 61

A#5/Bb5 932.33 37.00 82 62

B5 987.77 34.93 83 63

C6 1046.50 32.97 84 64

C#6/Db6 1108.73 31.12 85 65

D6 1174.66 29.37 86 66

D#6/Eb6 1244.51 27.72 87 67

E6 1318.51 26.17 88 68

F6 1396.91 24.70 89 69

Continued on next page

54

Table A.1 – continued from previous page


F#6/Gb6 1479.98 23.31 90 70

G6 1567.98 22.00 91 71

G#6/Ab6 1661.22 20.77 92 72

A6 1760.00 19.60 93 73

A#6/Bb6 1864.66 18.50 94 74

B6 1975.53 17.46 95 75

C7 2093.00 16.48 96 76

C#7/Db7 2217.46 15.56 97 77

D7 2349.32 14.69 98 78

D#7/Eb7 2489.02 13.86 99 79

E7 2637.02 13.08 100 80

F7 2793.83 12.35 101 81

F#7/Gb7 2959.96 11.66 102 82

G7 3135.96 11.00 103 83

G#7/Ab7 3322.44 10.38 104 84

A7 3520.00 9.80 105 85

A#7/Bb7 3729.31 9.25 106 86

B7 3951.07 8.73 107 87

C8 4186.01 8.24 108 88

55

Bibliography

[1] M. Piszczalski and B. A. Galler, “Automatic music transcription,” Computer Music Journal, vol. 1, no. 4, pp.

24–31, 1997.

[2] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music transcription: challenges

and future directions,” Journal of Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, 2013.

[3] J. Carbonell, R. Michalski, and T. Mitchell, An Overview of Machine Learning. Springer, 1983.

[4] N. Bertin, R. Badeau, and G. Richard, “Blind signal decompositions for automatic transcription of polyphonic

music: NMF and K-SVD on the benchmark,” ICASSP, IEEE International Conference on Acoustics, Speech

and Signal Processing - Proceedings, vol. 1, 2007.

[5] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estima-

tion,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 528–537, 2010.

[6] E. Benetos, S. Ewert, and T. Weyde, “Automatic transcription of pitched and unpitched sounds from poly-

phonic music,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Pro-

ceedings, no. May, pp. 3107–3111, 2014.

[7] B. Fuentes, R. Badeau, and G. Richard, “Adaptive harmonic time-frequency decomposition of audio using

shift-invariant PLCA,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

- Proceedings, no. 1, pp. 401–404, 2011.

[8] A. Elowsson and A. Friberg, “Polyphonic Transcription with Deep Layered Learning,” MIREX 2014, no. of 52,

pp. 25–26, 2014.

[9] G. E. Poliner and D. P. W. Ellis, “A discriminative model for polyphonic piano transcription,” Eurasip Journal

on Advances in Signal Processing, vol. 2007, pp. 1–16, 2007.

[10] S. A. Raczy¿ski, N. Ono, and S. Sagayama, “Note detection with dynamic bayesian networks as a post-

analysis step for nmf-based multiple pitch estimation techniques,” in 2009 IEEE Workshop on Applications

of Signal Processing to Audio and Acoustics, Oct 2009, pp. 49–52.

[11] A. Dessein, A. Cont, and G. Lemaitre, “Real-time polyphonic music transcription with non-negative matrix

factorization and beta-divergence,” International Conference on Music Information Retrieval, no. 5, pp. 3–5,

2010.

[12] J. Shen, J. Shepherd, and A. H. H. Ngu, “Towards effective content-based music retrieval with multiple

acoustic feature combination,” IEEE Transactions on Multimedia, vol. 8, no. 6, pp. 1179–1189, Dec 2006.

[13] C. N. S. Jr., A. L. Koerich, and C. A. A. Kaestner, “Feature selection in automatic music genre classification,”

in Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, Dec 2008, pp. 39–44.

57

[14] F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC,” Journal of Computer

Science and Technology, vol. 16, no. 6, pp. 582–589, 2001.

[15] S. Essid, G. Richard, and B. David, “Musical instrument recognition by pairwise classification strategies,”

IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1401–1412, 2006.

[16] E. J. Humphrey, J. P. Bello, and Y. LeCun, “Moving Beyond Feature Design: Deep Architectures and Auto-

matic Feature Learning in Music Informatics,” International Society for Music Information Retrieval Confer-

ence (ISMIR), pp. 403–408, 2012.

[17] L. Deng and D. Yu, “Deep learning: Methods and applications,” Foundations and Trends in Signal Process-

ing, vol. 7, no. 3-4, pp. 197—-387, 2013.

[18] P. Hamel, S. Wood, and D. Eck, “Automatic Identification of Instrument Classes in Polyphonic and Poly-

Instrument Audio.” International Society for Music Information Retrieval Conference (ISMIR), pp. 399–404,

2009.

[19] P. Hamel and D. Eck, “Learning Features from Music Audio with Deep Belief Networks,” International Society

for Music Information Retrieval Conference (ISMIR), pp. 339–344, 2010.

[20] T. L. H. Li, A. B. Chan, and A. H. W. Chun, “Automatic Musical Pattern Feature Extraction Using Convolutional

Neural Network,” Proceedings of the International MultiConference of Engineers and Computer Scientists,

vol. I, no. November, pp. 546–550, 2010.

[21] R. Burton, “The elements of music: What are they and who cares?” [Online]. Available:

http://asme2015.com.au/the-elements-of-music-what-are-the-and-who-cares/

[22] N. Saint-arnaud and K. Popat, “Analysis and Synthesis of Sound Textures,” Readings in Computational

Auditory Scene Analysis, pp. 125—-131, 1995.

[23] B. L. Róisín, “Musical Instrument Identification with Feature Selection Using Evolutionary Methods,” Ph.D.

dissertation, University of Limerick, 2009.

[24] C. J. Plack, R. R. Fay, A. J. Oxenham, and A. N. Popper, Pitch: Neural Coding and Perception. Springer,

2005, vol. 24.

[25] N. Lenssen and D. Needell, “An Introduction to Fourier Analysis with Applications to Music,” Journal of

Humanistic Mathematics, vol. 4, no. 1, pp. 72–91, 2014.

[26] A. N. S. Institute, M. Sonn, and A. S. of America, American National Standard Psychoacoustical Terminol-

ogy. American National Standards Institute, 1973.

[27] J. C. Brown, “Calculation of a constant Q spectral transform,” The Journal of the Acoustical Society of

America, vol. 89, no. January 1991, p. 425, 1991.

[28] J. V. S. S. Stevens, “The relation of pitch to frequency: A revised scale,” The American Journal of Psychology,

vol. 53, no. 3, pp. 329–353, 1940.

[29] C. Schörkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing,” 7th Sound and Music

Computing Conference, no. JANUARY, pp. 3–64, 2010.

[30] J. C. Brown, “An efficient algorithm for the calculation of a constant Q transform,” The Journal of the Acous-

tical Society of America, vol. 92, no. 5, p. 2698, 1992.

[31] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” Ad-

vances in models for acoustic . . . , no. 1, 2006.

58

http://asme2015.com.au/the-elements-of-music-what-are-the-and-who-cares/

[32] T. Hofmann, “Probabilistic latent semantic indexing,” Proceedings of the 22nd annual international ACM

SIGIR conference on Research and development in information retrieval, pp. 50–57, 1999.

[33] A. A. Dempster, N. N. Laird, and D. D. B. Rubin, “Maximum likelihood from incomplete data via the EM

algorithm,” Journal of the Royal Statistical Society Series B Methodological, vol. 39, no. 1, pp. 1–38, 1977.

[34] M. Blume, “Expectation maximization: A gentle introduction,” Technical University of Munich-Institute for

Computer Science Press: Munich, Germany, 2002.

[35] DL4J, “Introduction to deep neural networks.” [Online]. Available: http://deeplearning4j.org/

neuralnet-overview.html#element

[36] M. Nielsen, “Neural networks and deep learning.” [Online]. Available: http://neuralnetworksanddeeplearning.

com/

[37] Y. Bengio, Learning Deep Architectures for AI. Now Publishers Inc., 2009, vol. 2, no. 1.

[38] U. de Montréal, “Introduction to gradient-based learning.” [Online]. Available: http://www.iro.umontreal.ca/

~pift6266/H10/notes/gradient.html

[39] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,”

Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[40] E. Benetos and S. Dixon, “A shift-invariant latent variable model for automatic music transcription,” Computer

Music Journal, vol. 36, no. 4, pp. 81–94, 2012.

[41] E. Benetos, S. Cherla, and T. Weyde, “An Effcient Shift-Invariant Model for Polyphonic Music Transcription,”

Proceedings of the 6th International Workshop on Machine Learning and Music, 2013.

[42] K. Schutte, “Midi toolbox.” [Online]. Available: http://kenschutte.com/midi

[43] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latent variable models as nonnegative factorizations.”

Computational intelligence and neuroscience, vol. 2008, p. 947438, 2008.

[44] G. Grindlay, “Nmflib toolbox.” [Online]. Available: http://www.ee.columbia.edu/~grindlay/code.html#NMFlib

[45] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s

visual cortex,” The Journal of Physiology, vol. 160, no. 1, pp. 106–154, 1962.

[46] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”

Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.

[47] Stanford CS class, “Cs231n convolutional neural networks for visual recognition.” [Online]. Available:

http://cs231n.github.io/neural-networks-1/

[48] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High Performance

Convolutional Neural Networks for Image Classification,” Ijcai, pp. 1237–1242, 2011.

[49] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for

object recognition,” in International Conference on Artificial Neural Networks. Springer, 2010, pp. 92–101.

[50] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” Proceedings of

the 27th International Conference on Machine Learning, no. 3, pp. 807–814, 2010.

[51] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural net-

works,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou,

and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

59

http://deeplearning4j.org/neuralnet-overview.html#element

http://deeplearning4j.org/neuralnet-overview.html#element

http://neuralnetworksanddeeplearning.com/

http://neuralnetworksanddeeplearning.com/

http://www.iro.umontreal.ca/~pift6266/H10/notes/gradient.html

http://www.iro.umontreal.ca/~pift6266/H10/notes/gradient.html

http://kenschutte.com/midi

http://www.ee.columbia.edu/~grindlay/code.html#NMFlib

http://cs231n.github.io/neural-networks-1/

[52] I. V. Tetko, D. J. Livingstone, and A. I. Luik, “Neural network studies. 1. comparison of overfitting and over-

training,” Journal of Chemical Information and Computer Sciences, vol. 35, no. 5, pp. 826–833, 1995.

[53] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A Simple Way

to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research (JMLR), vol. 15, pp.

1929–1958, 2014.

[54] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets.” in AISTATS, vol. 2, no. 3,

2015, p. 6.

[55] Jan Schlüter and Sebastian Böck, “Improved Musical Onset Detection with Convolutional Neural Networks,”

Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP 2014), 2014.

[56] A. J. R. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke: Extracting vocals from musical mixtures

using a convolutional deep neural network,” Lecture Notes in Computer Science (including subseries Lecture

Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9237, pp. 429–436, 2015.

[57] T. Nakashika, C. Garcia, T. Takiguchi, and I. D. Lyon, “Local-feature-map Integration Using Convolutional

Neural Networks for Music Genre Classification,” Interspeech, pp. 1–4, 2012.

[58] S. Dieleman, P. Brakel, and B. Schrauwen, “Audio-based music classification with a pretrained convolutional

network,” . . . International Society for Music . . . , pp. 669–674, 2011.

[59] T. L. H. Li, A. B. Chan, and A. H. W. Chun, “Automatic Musical Pattern Feature Extraction Using Convolutional

Neural Network,” Proceedings of the International MultiConference of Engineers and Computer Scientists,

vol. I, no. November, pp. 546–550, 2010.

[60] T. Park and T. Lee, “Musical instrument sound classification with deep convolutional neural network using

feature fusion approach,” arXiv:1512.07370 [cs], 2015.

[61] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd

ACM international conference on Multimedia. ACM, 2015, pp. 689–692.

[62] D. Nouri, “Using deep learning to listen for whales.” [Online]. Available: http://danielnouri.org/notes/2014/01/

10/using-deep-learning-to-listen-for-whales/

60

http://danielnouri.org/notes/2014/01/10/using-deep-learning-to-listen-for-whales/

http://danielnouri.org/notes/2014/01/10/using-deep-learning-to-listen-for-whales/

hybrid system for automatic music transcription · hybrid system for automatic music ... professor...

Documents