recognition of musical symbols in scores using neural ......recognition of musical symbols in scores...

Recognition of musical symbols in scores using neural networks

A Degree Thesis

Submitted to the Faculty of the

Escola Tècnica d'Enginyeria de Telecomunicació de Barcelona

Universitat Politècnica de Catalunya

by

Jordi Burgués Miró

In partial fulfilment

of the requirements for the degree in

AUDIOVISUAL SYSTEMS ENGINEERING

Advisors: Jaakko Lehtinen,

Josep Ramon Casas Pla

Barcelona, June 2019

1

Abstract

Object detection is present nowadays in many aspects of our life. From security to entertainment,

its applications play a key role in computer vision and image processing worlds.

This thesis addresses, through the usage of an object detector, the creation of an application that

allows its user to play a music score. The main goal is to display a digital music score and be able

to play it by touching on its notes.

In order to achieve the proposed system, deep learning techniques based on neural networks are

used to detect musical symbols from a digitized score and infer their position along the staff lines.

Different models and approaches are considered to tackle the main objective.

2

Resum

Avui en dia, la detecció d’objectes és present en molts aspectes de la nostra vida. Des

d’aplicacions relacionades amb la seguretat fins a eines d’entreteniment, la detecció d’objectes

té un paper clau en el món de la visió per computador i el processament d’imatge.

Aquesta tesi adreça la creació d’una aplicació que permeti al seu usuari reproduïr una partitura, a

través de l’ús d’un detector d’objectes. L’objectiu principal és mostrar en pantalla una partitura

digital i poder-la fer sonar al prémer les seves notes.

Per tal d’aconseguir el sistema proposat, tècniques d’aprenentatge profund basades en xarxes

neuronals són utilitzades per detectar símbols musicals d’una partitura i trobar la seva posició

respecte de les línies del pentagrama. Diferents models i enfocaments s’han considerat per

abordar l’objectiu principal.

3

Resumen

Hoy en día, la detección de objetos está presente en muchos aspectos de nuestra vida. De

aplicaciones relacionadas con la seguridad hasta herramientas de entretenimiento, la detección

de objetos juega un papel clave en el mundo de la visión por computador y el procesado de

imagen.

Esta tesis aborda la creación de una aplicación que permita a su usuario reproducir una partitura,

a través de un detector de objetos. El objetivo principal es mostrar en pantalla una partitura

digital y hacerla sonar tocando sus notas.

Con tal de lograr el sistema propuesto, técnicas de aprendizaje profundo basadas en redes

neuronales han sido utilizadas para detector símbolos musicales de una partitura y hallar su

posición a lo largo del pentagrama. Diferentes modelos y enfoques se han considerado para

lograr el objetivo principal.

4

To everyone who made this possible

Que estás aquí – que existe la vida y la identidad,

Que prosigue el poderoso drama, y que tú

puedes contribuir con un verso.

Walt Whitman

5

Acknowledgements

I would like to thank my thesis advisors, Jaakko Lehtinen and Josep Ramon Casa Pla for their

guidance during the different stages of this project.

6

Revision history and approval record

Revision Date Purpose

0 12/05/2019 Document creation

1 3/06/2019 Document revision

2 24/06/2019 Final document revision

DOCUMENT DISTRIBUTION LIST

Name e-mail

Jordi Burgués [email protected]

Jaakko Lehtinen [email protected]

Josep Ramon Casas Pla [email protected]

Written by: Reviewed and approved by:

Date 24/06/2019 Date 24/06/2019

Name Jordi Burgués Name Josep Ramon Casas

Position Project Author Position Project Supervisor

7

Table of contents

Abstract ............................................................................................................................................. 1

Resum ................................................................................................................................................ 2

Resumen ............................................................................................................................................ 3

Acknowledgements ........................................................................................................................... 5

Revision history and approval record ............................................................................................... 6

Table of contents ............................................................................................................................... 7

List of Figures .................................................................................................................................... 9

List of Tables: ................................................................................................................................... 11

1. Introduction............................................................................................................................. 12

1.1. Statement of purpose ..................................................................................................... 13

1.2. Requirements and specifications .................................................................................... 13

1.3. Methods and procedures ................................................................................................ 14

1.4. Work plan ........................................................................................................................ 14

1.5. Incidences and modifications .......................................................................................... 15

2. State of the art ........................................................................................................................ 16

2.1. Optical Music Recognition ............................................................................................... 16

2.2. State-of-the-art object detection algorithms .................................................................. 16

2.2.1. Convolutional Neural Networks (CNNs) and Region-based CNNs (R-CNNs) ........... 17

2.2.1.1. R-CNN ..................................................................................................................... 18

2.2.1.2. Fast R-CNN .............................................................................................................. 19

2.2.1.3. Faster R-CNN .......................................................................................................... 19

2.3. Datasets for OMR systems .............................................................................................. 19

3. Methodology ........................................................................................................................... 22

3.1. Data preparation ............................................................................................................. 22

3.1.1. Musical symbol dataset ........................................................................................... 22

3.1.2. Synthetic data generation with Python ................................................................... 23

3.2. Object detector ............................................................................................................... 25

3.2.1. Tensorflow Object Detection API ............................................................................ 25

3.2.1.1. Training and validation ........................................................................................... 26

3.2.1.2. Testing .................................................................................................................... 27

3.3. App integration ............................................................................................................... 29

8

4. Results ..................................................................................................................................... 31

4.1. Evaluation metrics ........................................................................................................... 31

4.2. Training, validation and testing of models ...................................................................... 32

4.2.1. First object detection model ................................................................................... 32

4.2.2. Second object detection model .............................................................................. 34

4.3. Application ...................................................................................................................... 40

5. Budget ..................................................................................................................................... 42

6. Conclusions and future development ..................................................................................... 43

Bibliography .................................................................................................................................... 44

Appendices ...................................................................................................................................... 48

Glossary ........................................................................................................................................... 54

9

List of Figures

Figure 1. Object detection example ................................................................................................ 12

Figure 2. Work packages breakdown structure .............................................................................. 14

Figure 3. Gantt diagram .................................................................................................................. 15

Figure 4. OMR system structure...................................................................................................... 16

Figure 5. CNN architecture .............................................................................................................. 17

Figure 6. Convolution operation ..................................................................................................... 17

Figure 7. MNIST and CIFAR-10 pictures example ............................................................................ 17

Figure 8. Region proposals example ............................................................................................... 18

Figure 9. R-CNN working stages ...................................................................................................... 18

Figure 10. Fast R-CNN architecture ................................................................................................. 19

Figure 11. OMR datasets example .................................................................................................. 20

Figure 12. Bounding boxes example ............................................................................................... 20

Figure 13. Incipit from PrIMuS dataset ........................................................................................... 21

Figure 14. Object detector input (left) and output (right) .............................................................. 21

Figure 15. Breakdown structure of the main parts of the project .................................................. 22

Figure 16. DeepScores classes (left) and a crop from one of the dataset scores (right) ................ 22

Figure 17. PASCAL-VOC format example......................................................................................... 23

Figure 18. Bounding boxes example ............................................................................................... 23

Figure 19. Fragments from rendered music scores ........................................................................ 23

Figure 20. Mapped positions along the staff lines .......................................................................... 24

Figure 21. Original bounding boxes (left) and flattened bounding boxes (right) ........................... 24

Figure 22. Data augmentation examples ........................................................................................ 25

Figure 23. Object detector scheme ................................................................................................. 25

Figure 24. Random crop example ................................................................................................... 26

Figure 25. Training (a) and validation (b) processes going on ........................................................ 27

Figure 26. Object detector input (left) and output (right) example ............................................... 27

Figure 27. Different predicted class in two random crops .............................................................. 27

Figure 28. Non-overlapping crops problem .................................................................................... 28

Figure 29. Non-random overlapping crops ..................................................................................... 28

Figure 30. XML file fragment created after testing ......................................................................... 29

Figure 31. Color map example ........................................................................................................ 29

10

Figure 32. Application steps ............................................................................................................ 30

Figure 33. Original crop (left) and during validation (right) ............................................................ 33

Figure 34. [email protected] graph examples ....................................................................................... 33

Figure 35. Original test images (left) and detections made by the model (right) ........................... 33

Figure 36. "Notes over notes" areas ............................................................................................... 34

Figure 37. Validation set examples from training A (left) and B (right) .......................................... 35

Figure 38. Testing difference between trainings A and B ............................................................... 35

Figure 39. Ground truth (top) and test output examples from trainings B (left) and C (right) ....... 36

Figure 40. Ground truth (top) and test output examples from trainings B (left) and C (right) ....... 36

Figure 41. Validation set examples in low (left) and high (right) density areas .............................. 36

Figure 42. Validation set examples from trainings C (left) and D (right) ........................................ 37

Figure 43. Ground truth (top) and test output examples from trainings C (left) and D (right) ...... 37

Figure 44. Remember me (left), Mia and Sebastian (center) and La La Land intro (right) examples

......................................................................................................................................................... 38

Figure 45. Initial and score selection screens ................................................................................. 40

Figure 46. Transition screen with brief description of the selected music score ........................... 40

Figure 47. Playable score screen with key signature and clef setup ............................................... 41

Figure 48. Informative messages from key signature and clef setup ............................................. 41

Figure 49. Real test output from Figure 35 ..................................................................................... 52




11

List of Tables:

Table 1. Color mapping ................................................................................................................... 29

Table 2. First object detection model trainings .............................................................................. 32

Table 3. [email protected] in the validation set .................................................................................... 32

Table 4. Second object detection model trainings .......................................................................... 34

Table [email protected] in the validation set ..................................................................................... 34

Table 6. Evaluation metrics for different real tested music scores................................................. 38

Table 7. Hardware resources costs ................................................................................................. 42

Table 8. Software resources costs ................................................................................................... 42

Table 9. Human resources costs ..................................................................................................... 42

Table 10. Work packages ................................................................................................................ 51

Table 11. Milestones ....................................................................................................................... 51

12

1. Introduction

Nowadays object detection plays a key role in a variety of applications and it is one of the rising technologies in the field of computer vision and image processing. The necessity of tracking objects for different purposes is present in many useful systems such as video surveillance, self-driving cars, and handwritten digit recognition or face detection. Most of these systems provide a faster service that is noticeably less time-consuming than if carried out by humans.

The primary goal of object detection is to localize a certain object given a picture and mark it with the appropriate category. In Figure 1 we can see a very common example of one of its applications in which cars, trucks and persons are being detected during a traffic jam [1].

Object detection is achieved by using machine learning approaches through what is known as Region-based Convolutional Neural Networks (R-CNNs, [2]), to be detailed in the following sections. The newest techniques look directly into the input images and differ from old approaches that used segmentation and pre-processing algorithms to implement these tasks.

This work presents an object detection-based system structured in two different models that identify musical symbols and positions of music figures along the staff lines, respectively, from digital music scores. Previous work has been carried out in the field, as the one developed by Pacha et al [3]. However, our project uses an approach for note detection different from the ones adopted in related work.

In order to accomplish the main objectives, parts of the Tensorflow Object Detection API open source framework [4, 5] (Tensorflow creators, 2017) are adapted to the different models considered in this work.

All in all, this thesis presents and end-to-end system that comprises the creation of a user-accessible prototype application that shows the achieved object detection results in music scores as an entertainment and demonstration tool.

Figure 1. Object detection example

13

1.1. Statement of purpose

The main idea of this project is to build a prototype consisting of an accessible interface that allows its user to play notes from digital music scores. The application acts as a platform with a collection of partitures that can be played by touching on the screen of the device in which it runs.

This project tackles one of the parts conforming the creation of an Optical Music Recognition (OMR) system whose main objective is to turn a digitized score into a machine-readable format. An OMR process involves the recognition of the notes in a score to be later played on, and this is the main task considered to carry out in this project.

The primary goal of this thesis addresses the creation of a new deep learning-based model to detect the position along the staff lines where the different musical figures are in a score. Once the detection has been carried out, a mapping between the detected position and the corresponding notes is performed. The whole system is integrated such that the results can be interactively checked offline (detection is not performed live) through the final application.

The whole detection system is based on an object detection model developed by the Tensorflow deep learning framework that uses the so-called Region-based Convolutional Neural Networks (R-CNNs). Several object detection models using R-CNNs exist regarding object recognition and one of them has been considered in order to accomplish the final objective. The core work is structured in two different models including pure symbol detection and position detection. These models use a pre-trained neural network in another dataset provided by Tensorflow.

1.2. Requirements and specifications

This project considers a series of requirements to be fulfilled during its development. The most

important is to adapt a pre-trained model that takes a digital image from a music score as input

and outputs the same image with bounding boxes highlighting detected musical symbols or

positions (for the sake of visualization).

Regarding the evaluation of the system, it is convenient to perform training, validation and testing on a chosen or generated dataset obtaining satisfactory results by achieving at least a 75% of precision in the validation set. Once this is achieved, design and implement a user accessible interface that uses the system capabilities for both detection and recognition of positions in digital musical scores. The app should run with no delays and deliver instant playing of the desired music scores. As for the specifications, Python and Java programming languages are used as well as the Tensorflow deep learning framework. These resources are used to train from scratch the chosen pre-trained model. Besides, it is considered the generation of a suitable dataset for the project that can be trained, validated and tested along with their annotations (in PASCAL-VOC format).

14

1.3. Methods and procedures

This work navigates through the research and creation of suitable datasets to tackle the main problem and uses the resources provided by the Tensorflow Object Detection API, which will be explained in further detail in the next sections. In the end, the detection results offered by the API are integrated in a user-accessible application acting as a platform for showing results.

The different parts of the project have been developed with Python and Java as programming languages for object detection tasks and app development respectively. The Tensorflow deep learning framework has been used for the object detection tasks that use pre-trained Faster Region-based Convolutional Neural Networks (Faster R-CNNs).

The pre-trained models have been trained from scratch on NVIDIA Quadro P500 GPU.

All the work has been developed at the School of Science from Aalto University in Espoo, Finland.

1.4. Work plan

The following Figure shows the breakdown structure followed in this project, divided in six

different work packages. The specification of every WP can be found in Appendix 1 in the

Appendices section. Besides, Figure 3 shows the Gantt diagram with the final schedule followed

during the project.

Figure 2. Work packages breakdown structure

15

Figure 3. Gantt diagram

1.5. Incidences and modifications

There have not been major incidences to report. The main modifications have only affected the schedule of the different work packages, which in some cases have taken longer than scheduled. However, the original plan was conceived considering potential incidences so, overall, the project has not been affected by these delays.

16

2. State of the art

This part summarizes the literature behind the topic of Optical Music Recognition (OMR). From a brief description of what an OMR system is, to the neural networks used for modern object detection, as well as examples of related work.

2.1. Optical Music Recognition

An Optical Music Recognition system is a process that aims to turn a digitized score into a machine-readable format [6]. Several approaches have been considered to transform a music score into a playable file [7], [8], and one of the challenges when building this kind of system is how to interpret sheet music and understand the musical context. Figure 4 shows a typical pipeline for an OMR process [8].

Figure 4. OMR system structure

As seen in Figure 4, an OMR system takes a digitized music score as an input. In order to improve the quality of the input image, some of the most used pre-processing tasks rely on enhancement, noise removal or morphological operations. Regarding symbol recognition and classification, one of the main techniques traditionally used consists of isolating the primitive elements through image segmentation or detection and removal of the staff lines [9]. The last steps combine the recognized musical symbols and, along with musical context rules, turns all the information into a final representation to be played by a machine.

However, current results from OMR systems are far from ideal, and new techniques are arising in the field of object detection with the appearance of powerful algorithms based on Region-based Convolutional Neural Networks (R-CNNs, [2]).

2.2. State-of-the-art object detection algorithms

This section contains the most important aspects of the current neural networks being used for state-of-the-art object detection. The primary distinctive fact regarding traditional object detection is that no segmentation is performed on the input image since the whole image is analyzed by the neural network by looking directly into it (i.e. by feeding the network with the actual pixel values).

17

Figure 6. Convolution operation

2.2.1. Convolutional Neural Networks (CNNs) and Region-based CNNs (R-CNNs)

Object detection is mainly conceived by the usage of Convolutional Neural Networks (CNN). A CNN is a multi-layer deep neural network [10] consisting of different convolutional layers with filters (known as Kernels), fully connected layers (FC) and non-linear functions (such as ReLU) to classify an object with probability values between 0 and 1. Figure 5 shows the architecture of a CNN [11].

Convolutional layers are used to extract features from the input image. The operations done here correspond to a convolution between a small portion of the input image (I) and a filter (K) producing a feature map [12]. Figure 6 on the right shows the procedure followed in this layer. After this, a non-linear function is applied to the CNN because working with a linear model would lead to very poor learning outcomes.

After the non-linearity is applied, a pooling layer is used to reduce the dimensionality when input images are too large. Finally, a fully connected layer (FC) flattens the feature map into a vector and the output layer outputs the class using a Logistic Regression with cost functions to classify the images.

It is important to highlight the difference between a CNN and an R-CNN. A CNN usually stands for image classification involving only one object at a time per image. For this reason, CNNs cannot be used to address multiple object detection since the number of appearances of the objects of interest is not fixed. In Figure 7 we can see two of the most common examples related to CNN classifiers: handwritten digits and object classification on the MNIST [13], [14] and CIFAR-10 [15] datasets respectively. The images only contain a single object of interest occupying almost all the space in the pictures. Due to this, the main target of a CNN is to assess what kind of object is present an image but not where it is.

Figure 5. CNN architecture

Figure 7. MNIST and CIFAR-10 pictures example

18

On the contrary, an R-CNN uses several region proposals to be passed through a CNN [2]. These region proposals contain the object of interest that needs to be classified, and different objects can be within the different region proposals hence being detected at the same time. The usual outputs of an R-CNN are the bounding box highlighting the location of the detected objects along with their labels (predicted class). In Figure 8 [16] we can appreciate the process showing the initial proposal of regions (blue) and the final bounding boxes corresponding to detected objects (green).

As a result, algorithms based on R-CNNs have been developed to speed up the process in both localizing and recognizing a series of objects. The following lines show a brief explanation of the newest ones and how they work.

2.2.1.1. R-CNN

The first type of its kind was given the general name for these networks: Region Convolutional Neural Network (R-CNN, [17]). It is based on a selective search that consists of extracting a fixed number of region proposals from the original image. The algorithm behind an R-CNN is illustrated in Figure 9: a CNN is fed with the extracted region proposals and obtains the most important features. These features are then fed into a classifier evaluating the presence of the object in the current region proposal.

However, the problems the first type of R-CNN presents go from a huge amount of training time (a fixed number of proposals have to be classified per each image), no real-time implementation because it takes more than 45 seconds to test an image and no learning in the selective search part since it is a fixed algorithm (which could trigger bad region proposals).

Figure 8. Region proposals example

Figure 9. R-CNN working stages

19

2.2.1.2. Fast R-CNN

With the objective of speeding up and simplifying the architecture of an R-CNN, a new type of network was created in order to improve test time [18].

The main difference regarding its predecessor is that the whole input image is fed into the CNN and a convolutional feature map is generated. Region proposals are then inferred from this map and converted into squares through a Region of Interest (RoI) pooling layer. A RoI layer performs max pooling (down-sampling in order to reduce dimensionality) to the inputs (which are convolutional neural network feature maps) and produces a small feature map of a fixed size (i.e. 7x7) that is fed into the fully connected (FC) layer.

This network is faster than a standard R-CNN because region proposals are not fed to the CNN every time. Instead, a convolutional operation is done only once per image and a feature map is generated from it. Thanks to this, it only takes 2 seconds to test an image.

2.2.1.3. Faster R-CNN

Even though having significantly improved test time from R-CNN to Fast R-CNN, both models use selective search to find region proposals. This algorithm is slow and time-consuming and affects the behavior of the network. For this reason, Shaoqing Ren et al [19] created an object detection algorithm eliminating the selective search algorithm and letting the network learn the region proposals.

In a similar way as in the Fast R-CNN, the whole image is the input of the CNN and a convolutional feature map is created. The main difference is that a new network called Region Proposal Network (RPN) is used to predict the region proposals. The predicted regions are resized with a RoI pooling layer which is used to classify the image within the proposed region and therefore predict the values for the bounding boxes. In these networks, test time is around 0,2 seconds per image.

2.3. Datasets for OMR systems

There exist several available datasets containing whole labeled music scores that have been used in projects related to OMR tasks [20]. This section presents the most important ones directly related to the topic of this project.

It is important to highlight that most of the traditionally used datasets for OMR datasets have relied on symbol classification thus far. As stated in previous sections, image segmentation has been one of the used techniques for musical symbol recognition. This can be exemplified by seeing that there exist a variety of datasets only containing musical symbols such as Rebelo [21],

Figure 10. Fast R-CNN architecture

20

Fornes [22], Printed Music Symbols [23] or Music Score [24] for classification purposes. Figure 11 shows examples of the available data in this kind of datasets.

Nevertheless, focusing on state-of-the-art object detection we are not interested in classifying symbols separately but in considering the whole score and then applying object detection to the complete partiture at once. For this reason, it is important to search for available datasets complying with the desired requirements and that have been used for related work in OMR tasks.

Regarding digital music scores, the largest public annotated dataset is called DeepScores. DeepScores is formed by 300.000 high resolution digital music scores and developed by Lukas Tuggener et al [25]. The dataset was created upon the goal of improving small objects recognition, since note heads belonging to musical symbols are much smaller than other traditional objects considered for recognition.

This dataset is addressed for tasks such as symbol classification, object detection and semantic segmentation applications. Most of the scores are created through MusicXML files extracted from MuseScore (a free music notation software) and converted into readable sheet music. The different partitures are rendered in five different fonts to vary the visual appearance.

The dataset contains 124 different annotated classes from the most common ones (noteheads, clefs, halfnotes) to more specific ones (staccato above, augmentation dots). Figure 12 shows an example of the different annotated musical symbols [25] in the form of red bounding boxes overshadowing every symbol.

Following digital music scores, we have the aforementioned MuseScore platform [26]. It consists of a free musical notation software containing more than 340.000 playable digital music scores. The partitures can be downloaded in several formats: software’s own format, in PDF, in MusicXML files, MIDI or MP3 format.

Another quite new dataset is the Printed Images of Music Staves (PrIMuS, [27]) that contains more than 80.000 real-music excerpts. The dataset was created by Calvo-Zaragoza et al and each score is available in different formats: rendered PNG format, MIDI-file, MEI-file and two types of encoding (semantic and agnostic). The excerpts are taken from the Répertoire international des Sources Musicales (RISM, [28]) dataset and their different formats contain not only information

Figure 11. OMR datasets example

Figure 12. Bounding boxes example

21

about the musical symbols present in the excerpt but also about the notes conforming it. Figure 13 illustrates two formats in which every excerpt is provided.

As for handwritten annotated music scores, there also exist some available datasets being CVC-MUSCIMA the best known [29]. This dataset has been used for writer identification and staff removal tasks involved in different OMR stages. It is formed by 1.000 music scores written by 50 different adult musicians. Each writer transcribed the same music scores using the same pen and paper with printed staff lines. This dataset was created for writer identification and staff removal tasks.

Previous object detection systems have been developed using this dataset. In Figure 14 we can see the input and output of a music object detector [8] using an incipit from the CVC-MUSCIMA dataset. Bounding boxes highlighting detected objects contain the class to which they correspond (i.e. stem, duration-dot) and the probability in % with which the symbol is classified into the predicted class.

Figure 14. Object detector input (left) and output (right)

Figure 13. Incipit from PrIMuS dataset

22

3. Methodology

This part illustrates the core work carried out in this thesis. The whole development is broken down into three main blocks forming and end-to-end system that can be described as follows: a selection of music scores is extracted from a chosen dataset in order to train an object detector. Once the training is done, new music scores (different from the training ones) are tested through the detector and the detection results are integrated in an interactive application, to be accessed from a portable device. It is important to highlight that the entire system refers to only one of the two used object detection models that will be detailed in the next sections.

3.1. Data preparation

This section comprises the handling of two different datasets that feed the music object detector in the next stage: one of them already exists and the other is created upon certain premises. The datasets will be used for both the first and second object detection models respectively although only the second model is fully integrated in the final application. The following lines explain the characteristics of every set of data as well as which and why pre-processing tasks must be applied to the original data (music scores).

3.1.1. Musical symbol dataset

One of the models contemplated in this work will be used for pure musical symbol detection, that is to say, the main goal of the neural network is localize and recognize different types of musical symbols. For this reason, the DeepScores dataset [25] has been chosen. It contains more than 300.000 labeled whole scores formed by a wide variety of elements. There are 124 labeled musical symbols (or classes) in total, some of them can be seen in Figure 16.

Figure 16. DeepScores classes (left) and a crop from one of the dataset scores (right)

Figure 15. Breakdown structure of the main parts of the project

23

Every music score is given in PNG format with a resolution of 2707x3828 pixels and along with it there is an additional XML file that contains the annotations (labels per each class) created in PASCAL-VOC format [30]. This format holds different nodes as illustrated in Figure 17, indicating the name of the labeled musical symbol as well as the absolute coordinates related to the bounding box delimiting every object of interest.

We can extract valuable information from the annotations file in the form of bounding boxes to be visualized showing the ground truth from which the neural network will learn from. The Figure in the left shows a crop from one of the dataset scores and its corresponding ground truth bounding boxes.

The R-CNN present in the next stage is going to perform supervised learning so the annotations are essential and need to be used accordingly. Besides, the high

resolution of the music scores may produce memory issues if the whole score is fed into the neural network. However, the scores can be randomly cropped to smaller sizes so that there are no problems of this kind. This issue will be discussed again the next stage.

3.1.2. Synthetic data generation with Python

This project heretofore has only considered datasets and neural network-based models orientated to pure musical symbol detection. However, we are interested in developing a system that tells us which notes compose a score and their location, as previously stated. For this reason, we have created a synthetic dataset using Python, where a variety of musical symbols and staff lines are rendered randomly creating synthetic sheet music.

The main purpose is to feed the object detector with realistic and non-realistic scores containing a lot of valuable annotated information. Besides, since we want the neural network to learn the positions along the staff lines, the resulting scores have no musical context at all, everything is generated randomly as it can be appreciated in Figure 19.

Figure 17. PASCAL-VOC format example

Figure 18. Bounding boxes example

Figure 19. Fragments from rendered music scores

24

The strategy followed to create this new dataset can be described as conceiving the music scores in a way that when a musical symbol is rendered along the staff lines, it has an associated position according to the mapping presented in Figure 20.

Figure 20. Mapped positions along the staff lines

In other words, the central musical note C marked in the Figure (or Do/C4), taking G (or Sol/G4) clef as the reference clef, corresponds to the first mapped position. The mapping is created by having twenty musical notes in ascending order starting from central C and seven other musical notes in descending order below central C.

As it can be deduced from Figure 20, the total mapped positions are twenty-seven, hence not all possible positions along the staff lines are mapped. Nevertheless, we have left out some of the highest and lowest-pitched musical notes due to their low appearance frequency in the considered scores to test the object detector in the next stage.

Regarding implementation issues, the synthetic scores have been created using the Pillow image library [31] from Python that allows to composite and create images of different kinds. The algorithm is based upon very simple operations: grayscale RGBA musical symbol images are pasted in varying y-axis coordinates and along the x-axis (direction in which the staff lines are drawn). Each variation of a certain number of pixels in the y-axis belongs to a new musical note.

As for the annotations, once a new musical symbol is rendered randomly, the original bounding box (corresponding to the dimensions of the original musical symbol image) is resized around the notehead, halfnote or whole note as it can be seen in Figure 21. At the same time the score is being rendered, a 2707x3828 PNG file is created as well as an XML file following the previously mentioned PASCAL-VOC format (as in the DeepScores dataset).

Other symbols such as rests and bar lines are rendered and not annotated so that the neural network in the next stage sees them, but it is expected to ignore them. The purpose of these stuffing symbols is to emulate the appearance of real sheet music.

Figure 21. Original bounding boxes (left) and flattened bounding boxes (right)

25

In order to increase the robustness of the system, data is augmented so that variations of the original rendered scores are also fed into the neural network. Data has been augmented using the imgaug library [32] and Figure 22 shows some of the variations applied such as shear (affine transformation), impulse noise, Gaussian blurring and other types of transformations.

The code that generates synthetic data can be found in [43].

3.2. Object detector

This section describes the object detector that has been used to perform detection and recognition tasks in music scores. It is mainly formed by a pre-trained neural network extracted from the Tensorflow deep learning framework and it is trained on one of its APIs.

3.2.1. Tensorflow Object Detection API

As mentioned at the beginning of this document, this work uses the Tensorflow Object Detection API [5] for detection and recognition tasks in music scores needed for the proposed objective. In order to adapt its resources to the current problem, it is important to define the structure of the API as well as how data is handled by it and all the necessary modifications to be made.

The object detector is composed by the blocks depicted in Figure 23. As it can be seen, the API comprises the main element in the whole detection system.

In order to be able to fit our data into the API, the input images and their annotations must be

converted into a readable format by the framework called Tensorflow records (TFRecords, [33]).

This format is specific for the library and stores data as a sequence of binary strings. Besides,

binary data reduces the space of the original data and can be read more efficiently.

As previously stated, we have used an off-the-shelf pre-trained model offered by Tensorflow [34],

to be detailed in the next section, that needs to be adapted to the different datasets for the two

proposed object detection models. After the adaptation, training and validation are run at the

same time. Besides, training and validation can be monitored live by using TensorBoard [35]. This

Figure 22. Data augmentation examples

Figure 23. Object detector scheme

26

platform allows to visualize different evaluation metrics for both sets while the network is being

trained.

Before testing the detector, what is known as the inference graph is exported and used to

perform posterior object detection in the test part. Testing is carried out by selecting new music

scores different from the ones used for training and validation. The output reflects the predicted

bounding boxes by the neural network along with its label and confidence score.

3.2.1.1. Training and validation

The chosen model to train from scratch is a Faster R-CNN with Resnet101 architecture (this

concept is not going to be developed in this thesis), pre-trained on the Oxford IIIT Pets Dataset.

Since it is very important to make sure the neural network will handle the input images in a

correct way, the whole music scores cannot be fed into the network directly. Here we introduce

one of the most important parameters regarding Convolutional neural networks: the receptive

field. This parameter is defined as the region in the input space that a CNN’s feature is looking at

(i.e. affected by) [36]. Without entering in a complex technical explanation, this parameter is

important because input images cannot be larger than it. In the case of a Faster R-CNN, the

receptive field has dimensions of 1027x1027 [37]. Because of this, and the given dimensions of

original and rendered music scores, the initial crop size used in the training and validation images

is 1000x700 pixels. Consequently, training and validation images are obtained from randomly

cropping the original scores into new images of the specified size and creating the new

corresponding annotations XML file. Figure 24 illustrates an example of a random crop.

Previously to proceed, a label map must be created containing the names of all the possible

annotated objects along the training data. Besides, the structure of the pre-trained models

follows the setting up of a config file. At this stage the file needs to be adapted to the current

model. The neural network parameters can be changed, or the training can be performed by

using the default configuration. This will be further detailed in the Results section.

Once the TFRecords have been created and all the configuration is correctly set up, the training

and validation phase can start. It is key to emphasize validation images are never seen by the

neural network; they are used to measure how well the model is performing while it is being

trained. While training, checkpoints are saved so that the model can be validated every few

minutes.

Figure 24. Random crop example

27

(a)

(b)

3.2.1.2. Testing

In order to test the object detector, different approaches to evaluate the performance of the

detector have been considered. Every time a test image passes through the network, a new

image is created containing the bounding boxes of the detected objects, the predicted class by

the network and the probability (or confidence, in %) with which the detected object has been

classified in that class. Figure 26 illustrates an input and output example after testing an object

detector with images from the CVC-MUSCIMA dataset present in the work by Pacha et al [38].

For training and validation, images were created as random crops. However, in the testing part

we want to assess how well the model perform so the way input images are created can yield

very different results:

• Random crops (can be overlapped or not): in the case of random overlapping crops, the

same object passes more than once through the network, and in order to cover the

whole music score, it is necessary to create many crops. One of the advantages of taking

this approach is objects not being detected in one crop may be detected in another.

The fact that an object can be seen more than once by the network leads to different

problems. An object may be classified into two different classes depending on the crop,

as depicted in Figure 27, where the same part of the image is present in two different

crops. The highlighted predicted class of the circled objects differs from one to the other.

Besides, when gathering the final detection results, a huge number of bounding boxes

may be around every detected object. These bounding boxes cannot be unified into one

since the last detection might be wrong and all previous detections right.

Figure 25. Training (a) and validation (b) processes going on

Figure 26. Object detector input (left) and output (right) example

Figure 27. Different predicted class in two random crops

28

• Non-overlapping crops: in this case, the network only sees the whole music score once,

by tiling it into different non-overlapping crops. The main drawback of this technique

arises from the fact the network only sees every individual object once so the case of

objects not being detected is likely to happen in some crops. Besides, another issue must

be considered: creating the crops without overlap can lead to “cutting” the original score

in the middle of musical notes or symbols of interest. Given this situation, the object

cannot be detected by any means. Figure 28 shows an example illustrating this problem.

Figure 28. Non-overlapping crops problem

• Non-random overlapping crops: this approach uses the same technique as in the

overlapping crops, but the new images are not created randomly but contain a certain

number of overlapping pixels between images, as depicted in Figure 29.

Figure 29. Non-random overlapping crops

By applying this technique, musical notes and symbols may be seen more than once

through the network (around two or three at the most). The difference between this case

and the overlapping one is that detecting the same object wrongly in different crops is

much less probable. This happens because the location of a musical note and symbol

does not significantly change (in terms of pixels) as it does in the overlapping crops.

The music scores used for testing are obtained from the MuseScore platform and do not contain

any labels. In the Results section some evaluation metrics for the second object detection model

are computed manually to assess how well the model performs in real music scores.

Overall, the detection script provided by the Tensorflow creators has been modified according to

the desired output consisting of getting the predicted classes in a file. All the modified files

concerning the API can be found in [43].

29 Table 1. Color mapping

3.3. App integration

The final objective of this thesis is to create a user-accessible application that allows to play a

score by touching on its notes. This section presents the creation of an interface that will allow

the user to do so.

In the application, only the second object detection model is considered since it is the one that

detects the positions of musical notes along the staff lines. With the predicted positions by the

neural network, a mapping is performed between the assigned position and its corresponding

musical note, as stated before.

The most important thing is how detection results are handled by the application. After the

testing part, an XML file is created per every test image containing information about the

detections, as Figure 30 shows (coordinates are relative to the size of the test crop).

In order to gather all detection results belonging to the same music

score into one file, a color is assigned to every detected position

according to Table 1. While the XML files are being read, a new file

of the same size as the tested music score is created. In this new file,

bounding boxes belonging to detections are printed with the color

associated to the detected position, creating a color map. As an

example, Figure 31 shows a real tested music score and its

corresponding color map.

Figure 30. XML file fragment created after testing

Figure 31. Color map example

30

After the color map is created, we can proceed to describe the application itself. The interface is

composed by different screens that allow the user to select one of the available tested music

scores, set up the clef and key signature and get to play it. Figure 32 shows the scheme

conforming the app functioning.

Figure 32. Application steps

When the user touches on the screen, the implemented algorithm checks which color is present

in the color map corresponding to the selected music score. According to the different colors and

previous settings, the musical note is played instantly. Musical notes belong to a piano and have

been extracted from the Electronic Music Studios [39] database.

In terms of implementation, the application has been programmed using Java in Android Studio

[40]. A list of scores is displayed on screen once the app has started. When the pertinent

selections have been made, the score to be played is displayed as a bitmap using the

Subsampling Scale Image View library created by Dave Morrissey [41]. This library allows to load

a subsampled version of an image every time the user zooms or pans around the screen. It also

allows to obtain the coordinates relative to the size of the current image when a touching event

takes place.

The color map is also read as a bitmap with the same dimensions of the real music score that is

displayed. The correspondence between colors and musical notes have a unique red value (from

its RGB value). For this reason, only the red value of the area where the user touches on is

checked every time a musical note is to be played. After the user presses on the desired note, the

corresponding tone is played instantly.

The implemented code related to the application can be found in [43].

31

4. Results

This section comprises the outcomes of training, validating and testing the two models presented

in previous sections. Besides, it shows how the final application looks like and its main

functionalities.

4.1. Evaluation metrics

In order to understand the evaluation metrics used in the models, their definition is detailed in

the following equations:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 (1) 𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇𝑃

𝑇𝑃 + 𝐹𝑁 (2)

𝑀𝑖𝑠𝑠 𝑅𝑎𝑡𝑒 =𝐹𝑁

𝐹𝑁 + 𝑇𝑃 (3) 𝐼𝑜𝑈 =

𝐴1 ∩ 𝐴2

𝐴1 ∪ 𝐴2 (4)

Where TP = True Positive, FP = False Positive, FN = False Negative and A1, A2 correspond to the

areas of the predicted and original bounding boxes for the detected objects.

Precision measures how accurate are the predictions by computing the proportion of positive

detections that are correct, whereas recall computes the proportion of actual positives identified

correctly. The miss rate, or misclassification rate, computes the proportion of non-detected or

detected wrongly objects compared to these and positive detections.

IoU [42] measures the overlapping of the boundaries belonging to the bounding boxes: the

predicted and the original ones. IoU has been computed in the validation sets of all trainings with

a threshold of 0.5. If the IoU is greater or equal than 0.5, the prediction is considered as positive.

Besides, mean Average Precision (mAP, [30]) has been used to assess the performance of the

model in the validation sets of every training. The mAP is computed as the average of Average

Precision (AP) per each class, defined in equation (5),

𝐴𝑃 = ∫ 𝑝(𝑟) 𝑑𝑟 (5)1

0

For a specific class the precision-recall curve is computed, and AP finds the area under it. AP is

calculated as the mean precision at a set of eleven equally spaced recall levels (from 0.0 to 1.0),

as equation (6) reflects:

𝐴𝑃 = 1

11 ∑ 𝐴𝑃𝑟

𝑟 ∈{0.0,….,1.0}

(6)

The precision at each recall level r (APr) is interpolated by taking the maximum precision for a

method for which the corresponding recall exceeds r:

𝐴𝑃 = 1

11 ∑ 𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟) 𝑤ℎ𝑒𝑟𝑒 𝑝𝑖𝑛𝑡𝑒𝑟𝑝(𝑟) =

𝑟 ∈{0.0,….,1.0}

max�̃�≥𝑟

𝑝(�̃�) (7)

All in all, AP is computed per each validated class in the model and the average of all of them is

the result of mAP. In this section, mAP is computed with IOU >= 0.5 in the validation sets

belonging to the different trainings performed in this work.

32

4.2. Training, validation and testing of models

This section will show the different trainings performed with the presented models. Regarding

validation, examples evaluated in TensorBoard are included to understand the different faced

problems graphically. Images used for validation share similar characteristics as the ones used for

training but are never seen by the network. The main advantage is TensorBoard allows to check

the predictions made on the validation set while the model is being trained. The data partition

made to the datasets used for every training consists of 70% of data for training and 30% for

validation.

As for the testing, we have used non-random overlapping crops due to the problems of random

overlapping and non-overlapping crops. These crops are extracted from eight real music scores

that are not labeled. For this reason, evaluation metrics for the second object detection model

are computed manually in a selection of music scores.

4.2.1. First object detection model

In order to train, validate and test both object detection models, several trainings have been

carried out. It is important to remark the first object detection model serves as a guideline to

create the second one. Therefore, it has only been used as a “toy example” to discover and figure

out how the Tensorflow Object Detection API works, since it is not integrated in the final

application. Table 2 shows some of the main specifications for different trainings performed

regarding the first object detection model.

Experiment Scores Crops/score Crop size Training images

Validation

images

Configuration

A 30 40 1000x700 840 360 Default

B 100 40 1000x700 4368 1872 Default

C 548 60 500x350 23016 9864 Default

Table 2. First object detection model trainings

Training [email protected] in the validation set

A 0.9512

B 0.9110

C 0.7469

Table 3. [email protected] in the validation set

Besides, in Table 3 IoU values for the different validation sets per each training are shown. mAP

values from trainings A and B are high because not all classes are represented in the training and

validation set. Then, since the model has much less classes to predict than the ones that exist,

prediction is performed correctly in nearly all classes. However, other not represented classes in

33

these sets are not detected at all since the model never learns them. An example of this situation

is illustrated in Figure 33, where a validation image is shown. Some musical symbols (note head

wholes, rest 16th) are not detected during validation because they don’t have ground truth

examples in training.

Due to the problems stated above, the number of music scores used in the last training (C) is

considerably higher than in other trainings to ensure all classes are present. However, the low

frequency appearance of certain classes may cause the mAP value to drop. This is caused by few

representations of certain classes, which lead to poor detection capability of the model in them.

Figure 34 shows the mAP values graphs for two different classes evaluated in the validation set

belonging to training C.

The situation illustrated above happens because not all musical symbols have the same

frequency appearance. Nevertheless, the real music scores we have tested do not pose most of

the low-frequency-appearance musical symbols.

As an example of how data is delivered by the API after testing, Figure 35 shows test examples

belonging to the last two trainings. The original image is shown on the left and detections

performed by the model on the right. The original output can be found in Appendix 2 in the

Appendices section. In this section, the results appearance on test images has been modified for

visual clearance purposes.

Figure 33. Original crop (left) and during validation (right)

Figure 34. [email protected] graph examples

Figure 35. Original test images (left) and detections made by the model (right)

34

Different issues will be properly discussed in the following lines, regarding the second object

detection model, since as stated before, the first object detection model has been used to

understand how the API works and to pave the way to the second model.

4.2.2. Second object detection model

In order to reflect the outcomes of the second object detection model, we have carried out four

different trainings. Table 4 shows the main specifications for each training and the following lines

will explain and show graphically the modifications made along them (DA stands for data

augmentation).

Training Scores Crops/score Crop size Training images

Validation

images

Configuration DA

A 100 40 1000x700 2800 1200 Default No

B 156 40 1000x700 4368 1872 Default Yes

C 156 40 1000x700 4368 1872 Not default Yes

D 156 40 500x350 21840 9360 Not default Yes

Table 4. Second object detection model trainings

Additionally, Table 5 shows the mAP values measured in the validation set of every training.

Training [email protected] in the

validation set

A 0.9494

B 0.7569

C 0.7981

D 0.8041

Table [email protected] in the validation set

The dataset used for the initial training (A) does not contain any synthetic score reflecting “notes

over other notes”, or in other words, musical figures sharing the same x-coordinate. An example

of this situation is illustrated in Figure 36. Besides, no data augmentation is used in this training.

Figure 36. "Notes over notes" areas

35

However, the first experiment was carried out as an example to see the viability of the strategy

followed in order to carry on with other experiments.

The main difference between trainings A and B is the different dataset used in them. We can also

see this reflected in the mAP for both which is considerably reduced in training B. On the one

hand, since in training A there are no synthetic scores with high density areas or regions as the

ones depicted in Figure 36, detection is performed correctly in most of the validation set. This

can be clearly seen by looking at the image on the left in Figure 37. On the other hand, when we

change the dataset, precision on the validation set drops because detection in high density areas

becomes poor, as the image on the right in Figure 37 shows.

This situation presents two problems: detection is not as good in high density areas as it is in

other kind of regions and training data may not have significant representation of these areas to

make a good prediction in the validation set.

Regarding the testing, Figure 38 shows a part of the same crop tested in trainings A and B. From

this comparison we can infer adding “notes over other notes” makes the model predict positions

in these regions.

Before changing the number of training and validation images, we changed the default

configuration of the pre-trained model in order to see if the model improved or not, leading to

training C. Figure 39 shows an incipit from a real tested music score: the ground truth for the

predictions and the predictions made by the model in trainings B and C.

Figure 37. Validation set examples from training A (left) and B (right)

Figure 38. Testing difference between trainings A and B

36

Figure 39. Ground truth (top) and test output examples from trainings B (left) and C (right)

As we can see, positions wrongly detected in training B or not detected at all, are generally

detected successfully in training C. This is also reflected in Figure 40, where another incipit from a

real tested music score shows the difference regarding detections between trainings B and C.

Nevertheless, this training did not solve the problem of detection in high density areas. We can

easily illustrate this by looking at some examples from the validation set. Figure 41 shows

detection in high density areas in training C is still poor compared to low-density areas.

Figure 40. Ground truth (top) and test output examples from trainings B (left) and C (right)

Figure 41. Validation set examples in low (left) and high (right) density areas

37

In order to solve detection problems in high density areas, we made two changes leading to the

final experiment: increase the number of images so that high density areas are highly

represented in the training part, and change the size of the random crops used for training and

validation to half of its initial size so that less objects are to be detected in every crop. Figure 42

shows examples of the validation set from trainings C and D.

From Figure 42 we can conclude the size of the input crop determines how capable the model is

to perform detections in one crop. In order to clearly see the improvement between the last two

trainings, Figure 43 shows a crop tested in training C, and the four equivalent crops from training

D containing the same areas, as well as the ground truth image with the correct labels per each

detectable position.

From the previous Figure we can appreciate positions not being detected in training C are

detected correctly in training D. Besides, due to the reduction of the crop size in the last training,

the model mistakes the lyrics of the score and detects them as if they were objects of interest.

However, this does not suppose a major problem since it is not a misdetection concerning an

area of interest.

Figure 42. Validation set examples from trainings C (left) and D (right)

Figure 43. Ground truth (top) and test output examples from trainings C (left) and D (right)

38

Taking the model from training D, we tested nine real music scores and computed some evaluation metrics according to the following criterion:

• True positive (TP): position is correctly detected

• False positive (FP): non-position area is detected as a position

• False negative (FN): position is not detected (object of interest without any predicted bounding boxes)/position is wrongly detected

We have chosen a variety of real music scores with different densities in terms of positions to be detected. Figure 44 shows portions from the different music scores where we can appreciate the range of musical notes as well as density significantly varies between them.

Figure 44. Remember me (left), Mia and Sebastian (center) and La La Land intro (right) examples

Table 6 shows the precision, recall and miss rate in % for the different real music scores tested on the final training:

Score Total detections TP FP FN Precision

(%)

Recall

(%)

Miss rate (%)

In Dreams 150 150 25 1 85.71 99.33 0.66

Finding Nemo 157 132 23 4 85.16 97.05 2.95

Another day of sun 162 146 16 1 90.12 99.31 0.69

Mia and Sebastian 218 199 17 1 92.12 99.50 0.50

Game of Thrones 258 246 14 4 94.61 98.40 1.60

Rey’s theme 274 263 11 14 95.98 94.94 5.06

Concerning Hobbits 301 287 11 9 96.30 96.95 3.05

Carl and Ellie 328 294 10 7 96.70 97.67 2.33

Remember me 410 400 6 10 98.52 97.56 2.47

Table 6. Evaluation metrics for different real tested music scores

39

It is important to highlight the sum of TPs, FPs and FNs is not always equal to the total number of detections, since positions not being detected are counted as false negatives. As an example, in Rey’s theme the sum of TP, FP and FN equals 288, whereas the total number of detections is 279. This means fourteen positions are not detected at all.

As for the results reflected in Table 6, we can see precision is quite high in almost all scores. The reduction of false positives improves precision in scores where less non-position areas are detected as positions. On the contrary, when false positives increase (majorly in less “populated” scores), precision decreases. Nevertheless, the number of false positives compared to true positives is noticeably lower in all music scores.

In terms of recall, we can see that the more positions to be detected, the more positions do not get detected or get wrongly detected, which is logical. However, the proportion of false negatives compared to true positives is significantly lower. For this reason, recall values for all scores are so high, being 94.94% the minimum value obtained.

Finally, misclassification rates exemplify the percentage of positions not detected or detected wrongly. As we can see, the obtained rates are comprehended between 0.66% and 5.06%, so misdetection is not very significant. All in all, the results are sufficiently good considering we have achieved a misclassification rate of 2.14% on average as well as a precision of 92.80% and recall of 97.85%.

40

4.3. Application

The following Figures show the final application is composed of initial, selection, description and playing screens.

Figure 45. Initial and score selection screens

Figure 46. Transition screen with brief description of the selected music score

41

Once the score has been selected, the user can modify the key signature and clef according to their preferences. Figure 47 shows the screen where the user gets to play the selected score with the different options mentioned.

When the user selects the key signature and clef, informative messages appear as Figure 48 illustrates.

Figure 47. Playable score screen with key signature and clef setup

Figure 48. Informative messages from key signature and clef setup

42

5. Budget

This project has been carried out at the Department of Computer Science from Aalto University.

The following tables show the different costs related to hardware, software and human resources. The total cost related to hardware and software resources has been computed by dividing the cost of the product by its useful life (working at highest performance). Since the project has had a total extend of approximately 6 months, the resulting cost per each product is the one corresponding to half year.

Some of the hardware resources used do not imply a direct cost, but Table 7 provides the real cost if not provided by the university.

Product Cost Units Useful life (years) Total cost

Personal Computer I 1100€ 1 3 183,33€

Personal Computer II 900€ 1 3 150€

NVIDIA Quadro P500 GPU

2000€ 1 2 500€

Table 7. Hardware resources costs

Since Tensorflow, Linux and Python are open source, the main software costs are related to the usage of Windows 10 and Microsoft Office for project documentation.

Product Cost Units Useful life (years) Total cost

Windows 10 125€ 1 2 32,25€

Microsoft Office 2019 107€ 1 2 26,75€ Table 8. Software resources costs

Costs related to human resources are based in the minimum wage and are detailed in Table 9.

Worker Total dedication (hours) Cost/hour Total

Junior engineer 600 8€ 4800€ Table 9. Human resources costs

All in all, the total cost has been 5692,33€.

43

6. Conclusions and future development

This project has shown predicting musical notes positions along the staff lines using synthetic data is possible. Thanks to the Tensorflow Object Detection API and Android Studio, this work has reached the main objective through the creation of an end-to-end system with the desired outcomes: the whole system is comprised in an application that allows its user to choose a digitized music score to be played by touching on its notes.

It is important to highlight pre-trained models originally trained in very different datasets from the ones used in this thesis have turned out to work successfully with no modifications in its original set up, or a very slight alteration. This clearly stands out the astonishing results the used API can achieve by training a new model.

The different strategies followed have worked generally as expected, although there are some problems and improvements to be considered. In relation to the second object detection model, the synthetic dataset has been created containing only unique sizes of both musical symbols and staff lines. If a high-resolution digitized music score is not resized before testing it on the model, there is no detection at all. This directly affects the model’s robustness since the neural network used for detection is not taught different sizes of the same objects of interest. Despite this, considering the appearance of digital music scores is relatively similar between them has helped the main strategy to properly work.

A potential improvement for the model would be to create a synthetic dataset with varying sizes of the musical scores. Besides, different fonts should be considered to render the musical figures since there are different typographies widely used in music scores. Another improvement could focus on how musical notes are detected using the proposed strategy. Neural networks tend to be better on predicting what an object is instead of where it is. Predicting the location of the musical notes, as the first object detection model does, and use another technique to obtain its positions along the staff lines could be considered.

Regarding the application, detection could be performed live through a server so that it is not limited to offline tasks. However, current results achieved are suitable for a demonstration tool, as intended since the idea first came up.

In terms of work development, the project has been successfully carried out in English and the progress has been mainly addressed through autonomous learning in what refers to making a pre-trained model work and building an application integrating different tasks.

All in all, the obtained results after testing real music scores are sufficiently good considering the limitations of the used dataset.

44

Bibliography

[1] Sharma P., A practical Guide to Object Detection using the Popular YOLO Framework – Part III, Analytics Vidhya, 2018. [Online]

Available: https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/ [Accessed: March 2019]

[2] Gandhi R., R-CNN, Fast R-CNN and YOLO object detection algorithms, Towards Data Science, 2018. [Online]

Available: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object detectionalgorithms-36d53571365e [Accessed: March 2019]

[3] Pacha, A.; Hajič, J., Jr.; Calvo-Zaragoza, J. A Baseline for General Music Object Detection with Deep Learning. Appl. Sci. 2018, 8, 1488

[4] Tensorflow, Tensorflow creators, 2015. [Online]

Available: https://www.tensorflow.org/ [Accessed: January 2019]

[5] Tensorflow Object Detection API, Tensorflow creators, 2018. [Online]

Available: https://github.com/tensorflow/models/tree/master/research/object_detection [Accessed: 12 February 2019]

[6] Calvo-Zaragoza, J.; Rizo, D. End-to-End Neural Optical Music Recognition of Monophonic Scores. Appl. Sci. 2018, 8, 606.

[7] Pacha A. and Calvo-Zaragoza J., Optical Music Recognition in mensural notation with Region-based Convolutional Neural Networks, TU Wien, 2018

[8] Pacha A., Choi K.-Y., Couasnon B., Ricquebourg Y. and Zanibbi R., Handwritten music object detection: Open issues and baseline results. In International Workshop on Document Analysis Systems, 2018.

[9] Rebelo, A., Fujinaga, I., Paszkiewicz, F. et al. Int J Multimed Info Retr (2012), Optical music recognition: state-of-the-art and open issues Vol.1 Issue 3: 173-190. https://doi.org/10.1007/s13735-012-0004-6

[10] Krizhevsky A., Sutskever I., Hinton G., Imagenet classification with deep convolutional neural

networks. InAdvances in neural information processing systems,pp. 1097–1105, 2012.

[11] Prahbu, Understanding of Convolutional Neural Netowrk (CNN) – Deep Learning, Medium, 2018. [Online]

https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/

https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/

https://www.tensorflow.org/

https://github.com/tensorflow/models/tree/master/research/object_detection

45

Available: https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148 [Accessed: February 2019]

[12] S. Mohamed, Ihab. (2017). Detection and Tracking of Pallets using a Laser Rangefinder and Machine Learning Techniques. 10.13140/RG.2.2.30795.69926.

[13] Lecun Y., Cortes C., J.C.Burges C., The MNIST database, 1998 [Online]

Available: http://yann.lecun.com/exdb/mnist/ [Accessed: January 2019]

[14] Zhu W., Classification of MNIST Handwritten Digit Database using Neural Network, Australian National University, 2012

[15] Krizhevsky A., Nair V. and Hinton G., The CIFAR-10 dataset, 2009. [Online]

Available: https://www.cs.toronto.edu/~kriz/cifar.html [Accessed: January 2019]

[16] Hui J., Fast R-CNN and Faster R-CNN, Github, 2017. [Online]

Available: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/ [Accessed: May 2019]

[17] Girshick R., Donahue J., Darrell T., Malik J., Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE CVPR2014

[18] Girshick R., Fast R-CNN, IEEE international conference on computer vision, 2015, pp. 1440–1448

[19] Ren S., He K., Girshick R., and Sun J., Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015

[20] Pacha A., Collection of datasets used for Optical Music Recognition, Github, 2017. [Online]

Available: https://github.com/apacha/OMR-Datasets [Accessed: March 2019]

[21] Rebelo, A., Capela, G. & Cardoso, J.S. IJDAR (2010) 13: 19. https://doi.org/10.1007/s10032-009-0100-1

[22] Fornés A., Lladós J., Sánchez G. (2008) Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. In: Liu W., Lladós J., Ogier JM. (eds) Graphics Recognition. Recent Advances and New Opportunities. GREC 2007. Lecture Notes in Computer Science, vol 5046. Springer, Berlin, Heidelberg

https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148

https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148

46

[24] Pacha A. and Eidenberger H., Towards self-learning optical music recognition, in Proceedings of the 16th IEEE International Conference On Machine Learning and Applications, 2017

[25] Tuggener L., Elezi I., Schmidhuber J., Pelillo M. and Stadelmann T., DeepScores – A Dataset for Segmentation Detection and Classification of Tiny Objects. arXiv preprint arXiv:1804.00525, 2018

[26] Schweer W., MuseScore: Free music composition and notation software, 2002. [Online]

Available: https://musescore.com/ [Accessed: March 2019]

[27] Calvo-Zaragoza J. and Rizo D., 2018. Camera-PrIMuS: Neural End-to-End Optical Music

Recognition on Realistic Monophonic Scores. In 19th International Society for Music Information

Retrieval Conference

[28] Répertoire International des Sources Musicales (RISM), 1952. [Online]

Available: http://www.rism.info/en/home.html [Accessed: March 2019]

[29] Fornés, A., Dutta, A., Gordo, A. et al. IJDAR (2012) 15: 243. https://doi.org/10.1007/s10032-011-0168-2

[30] Everingham, M., Van Gool, L., Williams, C.K.I. et al. Int J Comput Vis (2010) 88: 303. https://doi.org/10.1007/s11263-009-0275-4

[31] Lundh F., Pillow Image Module, 2009. [Online]

Available: https://pillow.readthedocs.io/en/stable/reference/Image.html [Accessed: March 2019]

[32] Jung A., Image augmentation for machine learning experiments, Github, 2018. [Online]

Available: https://github.com/aleju/imgaug [Accessed: April 2019]

[33] Gamauf T., Tensorflow Records? What they are and how to use them, Medium, 2018. [Online]

Available: https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564 [Accessed: February 2019]

[34] Tensorflow creators, Pre-trained models, Github, 2017. [Online]

https://musescore.com/

https://github.com/aleju/imgaug

47

Available: https://github.com/tensorflow/models/tree/master/research/slim [Accessed: February 2019]

[35] Tensorflow creators, Tensorboard, 2017. [Online]

Available: https://www.tensorflow.org/guide/summaries_and_tensorboard [Accessed: February 2019]

[36] Ha The Hien D., A guide to receptive field arithmetic for Convolutional Neural Networks, Medium, 2017. [Online]

Available: https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807 [Accessed: April 2019]

[37] Tensorflow creators, Receptive field computation for convnets, Github, 2017. [Online]

Available: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/receptive_field [Accessed: April 2019]

[38] Pacha A., Music Object Detector with Tensorflow, Github, 2017. [Online]

Available: https://github.com/apacha/MusicObjectDetector-TF [Accessed: March 2019]

[39] Cash M., Electronic Music Studios, University of Iowa, 2001. [Online]

Available: http://theremin.music.uiowa.edu/MISpiano.html [Accessed: April 2019]

[40] Android creators, Android Studio, 2013. [Online]

Available: https://developer.android.com/studio [Accessed: March 2019]

[41] Morrissey D., Subsampling Scale Image View, Github, 2017. [Online]

Available: https://github.com/davemorrissey/subsampling-scale-image-view [Accessed: April 2019]

[42] Rosebrock A., Intersection over Union (IoU) for object detection, pyimagesearch, 2016. [Online]

Available: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ [Accessed: April 2019]

[43] Burgués J., Recognition of musical symbols and notes, Github, 2019. [Online]

Available: https://github.com/jordiburgues/Recognition-of-musical-symbols-and-notes

https://github.com/tensorflow/models/tree/master/research/slim

https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

https://github.com/jordiburgues/Recognition-of-musical-symbols-and-notes

48

Appendices

Appendix 1: Work Packages and Milestones

Project: Project documentation WP ref: (WP1)

Major constituent: Document writing Sheet 1 of 1

Short description: this part extends along all the project and it

comprehends the key deliverables to be done during the work.

Planned start date: 14/2/2019

Planned end date: 25/6/2019

Start event: 14/2/2019

End event: 25/6/2019

Internal task T1: Project proposal and work plan

Internal task T2: Critical revision

Internal task T3: Final report

Deliverables:

PP and WP

CR

FR

Dates:

10/03/2019

12/04/2019

25/06/2019

Project: Initial work and research WP ref: (WP2)

Major constituent: Documentation Sheet 1 of 1

Short description: this part focuses on an introductory research by

making an introduction to TensorFlow framework and gathering

papers regarding Optical Music Recognition.





Internal task T1: Learn the basics of Python and TensorFlow framework

Internal task T2: Train, validate and test a “toy example” using

Convolutional Neural Networks

Internal task T3: Define main topic after OMR research

Deliverables:

Fully

convolutional

classifier

Research

information

Dates:

30/1/2019

7/2/2019

49

Project: Datasets WP ref: (WP3)

Major constituent: Research/Programming Sheet 1 of 1

Short description: research about available datasets of printed musical

scores that will be used as the input of the system.




End event: 7/3/2019

Internal task T1: Check availability of data, choose an initial dataset for

the system

Internal task T2: Explore chosen data (parse dataset files)

Internal task T3: Implement pre-processing tasks if needed (re-scaling,

cropping…)

Internal task T4: Prepare input CNN images and its annotations

Deliverables:

Pre-processed

dataset

Dates:

21/2/2019

Project: First object detection model WP ref: (WP4)

Major constituent: Programming Sheet 1 of 1

Short description: in this part, a pre-trained model will be used to

train, validate and test a Convolutional Neural Network using the

TensorFlow Object Detection API for musical symbol localization and

recognition.





Internal task T1: Object Detection neural networks study (Fast/Faster

Region based-CNN)

Internal task T2: TensorFlow Object Detection API study

Internal task T3: Current CNN implementation adaptation

Internal task T4: Training, validation and testing (performance,

evaluation metrics, model tuning…)

Deliverables:

Model

evaluation

Dates:

25/4/2019

50

Project: Second object detection model WP ref: (WP5)


Short description: this part will consist of generating synthetic data to

train a new model that predicts the location of music symbols along

the staff lines so that with the clef and key signature information the

actual note can be guessed.





Internal task T1: Generate synthetic data and its annotations

Internal task T2: Data augmentation on synthetic data

Internal task T3: Current CNN implementation adaptation

Internal task T4: Training and validation

Internal task T5: Testing (evaluation metrics)

Deliverables:

Synthetic data,

first version

Synthetic data,

final version

Model

evaluation

Dates:

14/3/2019

25/4/2019

31/5/2019

51

Project: User interface development and integration WP ref: (WP6)


Short description: this part will consist of building an app on top of the

previous object detection model. The main idea is to construct a user

interface that can play notes from a score after a selection of the clef

and key signature made by the user. The detection results from a score

will be previously charged to the app, so it will work in an offline way.





Internal task T1: Research on app development with Android Studio

Internal task T2: Implementation (coding, interface…)

Internal task T3: Integration of each task (note detection, user

selections) into the app

Deliverables:

First version

Final version

Dates:

2/5/2019

14/6/2019

Table 10. Work packages

Milestones:

WP# Task# Short title Milestone / deliverable Date (week)

1 1 Project documentation Proposal and work plan

(document)

2

3 4 Datasets (existing and

generated)

Images and annotations

(Python files)

3-10

1 1 Project documentation Critical revision (document) 7

4 4 First object detection model Model evaluation (Python

files)

9

5 4 Second object detection model Model evaluation (Python

files)

10

1 2,3 User interface (app) Implementation (Java files) 12-15

1 Project documentation Final report (document) 18 Table 11. Milestones

52

Appendix 2: Test (real output examples)

Figure 49. Real test output from Figure 35



53


54

Glossary

R-CNN. Region-based Convolutional Neural Network

CNN. Convolutional Neural Network

API. Application Programming Interface

OMR. Optical Music Recognition

VOC. Visual Object Class

WP. Work Package

FC. Fully connected

ReLU. Rectified Linear Unit

RoI. Region of Interest

XML. Extensible Markup Language

PDF. Portable Document Format

MIDI. Musical Instrument Digital Interface

PrIMuS. Printed Images of Music Staves

MEI. Music Encoding Initiative

PNG. Portable Network Graphics

RISM. Répertoire International des Sources Musicales

IoU. Intersection over Union

mAP. Mean Average Precision

DA. Data augmentation

TP. True positive

FP. False positive

FN. False negative

GPU. Graphics Portable Unit

recognition of musical symbols in scores using neural ......recognition of musical symbols in scores...

Documents