recognition of musical symbols in scores using neural ......recognition of musical symbols in scores...
TRANSCRIPT
Recognition of musical symbols in scores using neural networks
A Degree Thesis
Submitted to the Faculty of the
Escola Tècnica d'Enginyeria de Telecomunicació de Barcelona
Universitat Politècnica de Catalunya
by
Jordi Burgués Miró
In partial fulfilment
of the requirements for the degree in
AUDIOVISUAL SYSTEMS ENGINEERING
Advisors: Jaakko Lehtinen,
Josep Ramon Casas Pla
Barcelona, June 2019
1
Abstract
Object detection is present nowadays in many aspects of our life. From security to entertainment,
its applications play a key role in computer vision and image processing worlds.
This thesis addresses, through the usage of an object detector, the creation of an application that
allows its user to play a music score. The main goal is to display a digital music score and be able
to play it by touching on its notes.
In order to achieve the proposed system, deep learning techniques based on neural networks are
used to detect musical symbols from a digitized score and infer their position along the staff lines.
Different models and approaches are considered to tackle the main objective.
2
Resum
Avui en dia, la detecció d’objectes és present en molts aspectes de la nostra vida. Des
d’aplicacions relacionades amb la seguretat fins a eines d’entreteniment, la detecció d’objectes
té un paper clau en el món de la visió per computador i el processament d’imatge.
Aquesta tesi adreça la creació d’una aplicació que permeti al seu usuari reproduïr una partitura, a
través de l’ús d’un detector d’objectes. L’objectiu principal és mostrar en pantalla una partitura
digital i poder-la fer sonar al prémer les seves notes.
Per tal d’aconseguir el sistema proposat, tècniques d’aprenentatge profund basades en xarxes
neuronals són utilitzades per detectar símbols musicals d’una partitura i trobar la seva posició
respecte de les línies del pentagrama. Diferents models i enfocaments s’han considerat per
abordar l’objectiu principal.
3
Resumen
Hoy en día, la detección de objetos está presente en muchos aspectos de nuestra vida. De
aplicaciones relacionadas con la seguridad hasta herramientas de entretenimiento, la detección
de objetos juega un papel clave en el mundo de la visión por computador y el procesado de
imagen.
Esta tesis aborda la creación de una aplicación que permita a su usuario reproducir una partitura,
a través de un detector de objetos. El objetivo principal es mostrar en pantalla una partitura
digital y hacerla sonar tocando sus notas.
Con tal de lograr el sistema propuesto, técnicas de aprendizaje profundo basadas en redes
neuronales han sido utilizadas para detector símbolos musicales de una partitura y hallar su
posición a lo largo del pentagrama. Diferentes modelos y enfoques se han considerado para
lograr el objetivo principal.
4
To everyone who made this possible
Que estás aquí – que existe la vida y la identidad,
Que prosigue el poderoso drama, y que tú
puedes contribuir con un verso.
Walt Whitman
5
Acknowledgements
I would like to thank my thesis advisors, Jaakko Lehtinen and Josep Ramon Casa Pla for their
guidance during the different stages of this project.
6
Revision history and approval record
Revision Date Purpose
0 12/05/2019 Document creation
1 3/06/2019 Document revision
2 24/06/2019 Final document revision
DOCUMENT DISTRIBUTION LIST
Name e-mail
Jordi Burgués [email protected]
Jaakko Lehtinen [email protected]
Josep Ramon Casas Pla [email protected]
Written by: Reviewed and approved by:
Date 24/06/2019 Date 24/06/2019
Name Jordi Burgués Name Josep Ramon Casas
Position Project Author Position Project Supervisor
7
Table of contents
Abstract ............................................................................................................................................. 1
Resum ................................................................................................................................................ 2
Resumen ............................................................................................................................................ 3
Acknowledgements ........................................................................................................................... 5
Revision history and approval record ............................................................................................... 6
Table of contents ............................................................................................................................... 7
List of Figures .................................................................................................................................... 9
List of Tables: ................................................................................................................................... 11
1. Introduction............................................................................................................................. 12
1.1. Statement of purpose ..................................................................................................... 13
1.2. Requirements and specifications .................................................................................... 13
1.3. Methods and procedures ................................................................................................ 14
1.4. Work plan ........................................................................................................................ 14
1.5. Incidences and modifications .......................................................................................... 15
2. State of the art ........................................................................................................................ 16
2.1. Optical Music Recognition ............................................................................................... 16
2.2. State-of-the-art object detection algorithms .................................................................. 16
2.2.1. Convolutional Neural Networks (CNNs) and Region-based CNNs (R-CNNs) ........... 17
2.2.1.1. R-CNN ..................................................................................................................... 18
2.2.1.2. Fast R-CNN .............................................................................................................. 19
2.2.1.3. Faster R-CNN .......................................................................................................... 19
2.3. Datasets for OMR systems .............................................................................................. 19
3. Methodology ........................................................................................................................... 22
3.1. Data preparation ............................................................................................................. 22
3.1.1. Musical symbol dataset ........................................................................................... 22
3.1.2. Synthetic data generation with Python ................................................................... 23
3.2. Object detector ............................................................................................................... 25
3.2.1. Tensorflow Object Detection API ............................................................................ 25
3.2.1.1. Training and validation ........................................................................................... 26
3.2.1.2. Testing .................................................................................................................... 27
3.3. App integration ............................................................................................................... 29
8
4. Results ..................................................................................................................................... 31
4.1. Evaluation metrics ........................................................................................................... 31
4.2. Training, validation and testing of models ...................................................................... 32
4.2.1. First object detection model ................................................................................... 32
4.2.2. Second object detection model .............................................................................. 34
4.3. Application ...................................................................................................................... 40
5. Budget ..................................................................................................................................... 42
6. Conclusions and future development ..................................................................................... 43
Bibliography .................................................................................................................................... 44
Appendices ...................................................................................................................................... 48
Glossary ........................................................................................................................................... 54
9
List of Figures
Figure 1. Object detection example ................................................................................................ 12
Figure 2. Work packages breakdown structure .............................................................................. 14
Figure 3. Gantt diagram .................................................................................................................. 15
Figure 4. OMR system structure...................................................................................................... 16
Figure 5. CNN architecture .............................................................................................................. 17
Figure 6. Convolution operation ..................................................................................................... 17
Figure 7. MNIST and CIFAR-10 pictures example ............................................................................ 17
Figure 8. Region proposals example ............................................................................................... 18
Figure 9. R-CNN working stages ...................................................................................................... 18
Figure 10. Fast R-CNN architecture ................................................................................................. 19
Figure 11. OMR datasets example .................................................................................................. 20
Figure 12. Bounding boxes example ............................................................................................... 20
Figure 13. Incipit from PrIMuS dataset ........................................................................................... 21
Figure 14. Object detector input (left) and output (right) .............................................................. 21
Figure 15. Breakdown structure of the main parts of the project .................................................. 22
Figure 16. DeepScores classes (left) and a crop from one of the dataset scores (right) ................ 22
Figure 17. PASCAL-VOC format example......................................................................................... 23
Figure 18. Bounding boxes example ............................................................................................... 23
Figure 19. Fragments from rendered music scores ........................................................................ 23
Figure 20. Mapped positions along the staff lines .......................................................................... 24
Figure 21. Original bounding boxes (left) and flattened bounding boxes (right) ........................... 24
Figure 22. Data augmentation examples ........................................................................................ 25
Figure 23. Object detector scheme ................................................................................................. 25
Figure 24. Random crop example ................................................................................................... 26
Figure 25. Training (a) and validation (b) processes going on ........................................................ 27
Figure 26. Object detector input (left) and output (right) example ............................................... 27
Figure 27. Different predicted class in two random crops .............................................................. 27
Figure 28. Non-overlapping crops problem .................................................................................... 28
Figure 29. Non-random overlapping crops ..................................................................................... 28
Figure 30. XML file fragment created after testing ......................................................................... 29
Figure 31. Color map example ........................................................................................................ 29
10
Figure 32. Application steps ............................................................................................................ 30
Figure 33. Original crop (left) and during validation (right) ............................................................ 33
Figure 34. [email protected] graph examples ....................................................................................... 33
Figure 35. Original test images (left) and detections made by the model (right) ........................... 33
Figure 36. "Notes over notes" areas ............................................................................................... 34
Figure 37. Validation set examples from training A (left) and B (right) .......................................... 35
Figure 38. Testing difference between trainings A and B ............................................................... 35
Figure 39. Ground truth (top) and test output examples from trainings B (left) and C (right) ....... 36
Figure 40. Ground truth (top) and test output examples from trainings B (left) and C (right) ....... 36
Figure 41. Validation set examples in low (left) and high (right) density areas .............................. 36
Figure 42. Validation set examples from trainings C (left) and D (right) ........................................ 37
Figure 43. Ground truth (top) and test output examples from trainings C (left) and D (right) ...... 37
Figure 44. Remember me (left), Mia and Sebastian (center) and La La Land intro (right) examples
......................................................................................................................................................... 38
Figure 45. Initial and score selection screens ................................................................................. 40
Figure 46. Transition screen with brief description of the selected music score ........................... 40
Figure 47. Playable score screen with key signature and clef setup ............................................... 41
Figure 48. Informative messages from key signature and clef setup ............................................. 41
Figure 49. Real test output from Figure 35 ..................................................................................... 52
Figure 50. Real test output from Figure 39 ..................................................................................... 52
Figure 51. Real test output from Figure 40 ..................................................................................... 52
Figure 52. Real test output from Figure 43 ..................................................................................... 53
11
List of Tables:
Table 1. Color mapping ................................................................................................................... 29
Table 2. First object detection model trainings .............................................................................. 32
Table 3. [email protected] in the validation set .................................................................................... 32
Table 4. Second object detection model trainings .......................................................................... 34
Table [email protected] in the validation set ..................................................................................... 34
Table 6. Evaluation metrics for different real tested music scores................................................. 38
Table 7. Hardware resources costs ................................................................................................. 42
Table 8. Software resources costs ................................................................................................... 42
Table 9. Human resources costs ..................................................................................................... 42
Table 10. Work packages ................................................................................................................ 51
Table 11. Milestones ....................................................................................................................... 51
12
1. Introduction
Nowadays object detection plays a key role in a variety of applications and it is one of the rising technologies in the field of computer vision and image processing. The necessity of tracking objects for different purposes is present in many useful systems such as video surveillance, self-driving cars, and handwritten digit recognition or face detection. Most of these systems provide a faster service that is noticeably less time-consuming than if carried out by humans.
The primary goal of object detection is to localize a certain object given a picture and mark it with the appropriate category. In Figure 1 we can see a very common example of one of its applications in which cars, trucks and persons are being detected during a traffic jam [1].
Object detection is achieved by using machine learning approaches through what is known as Region-based Convolutional Neural Networks (R-CNNs, [2]), to be detailed in the following sections. The newest techniques look directly into the input images and differ from old approaches that used segmentation and pre-processing algorithms to implement these tasks.
This work presents an object detection-based system structured in two different models that identify musical symbols and positions of music figures along the staff lines, respectively, from digital music scores. Previous work has been carried out in the field, as the one developed by Pacha et al [3]. However, our project uses an approach for note detection different from the ones adopted in related work.
In order to accomplish the main objectives, parts of the Tensorflow Object Detection API open source framework [4, 5] (Tensorflow creators, 2017) are adapted to the different models considered in this work.
All in all, this thesis presents and end-to-end system that comprises the creation of a user-accessible prototype application that shows the achieved object detection results in music scores as an entertainment and demonstration tool.
Figure 1. Object detection example
13
1.1. Statement of purpose
The main idea of this project is to build a prototype consisting of an accessible interface that allows its user to play notes from digital music scores. The application acts as a platform with a collection of partitures that can be played by touching on the screen of the device in which it runs.
This project tackles one of the parts conforming the creation of an Optical Music Recognition (OMR) system whose main objective is to turn a digitized score into a machine-readable format. An OMR process involves the recognition of the notes in a score to be later played on, and this is the main task considered to carry out in this project.
The primary goal of this thesis addresses the creation of a new deep learning-based model to detect the position along the staff lines where the different musical figures are in a score. Once the detection has been carried out, a mapping between the detected position and the corresponding notes is performed. The whole system is integrated such that the results can be interactively checked offline (detection is not performed live) through the final application.
The whole detection system is based on an object detection model developed by the Tensorflow deep learning framework that uses the so-called Region-based Convolutional Neural Networks (R-CNNs). Several object detection models using R-CNNs exist regarding object recognition and one of them has been considered in order to accomplish the final objective. The core work is structured in two different models including pure symbol detection and position detection. These models use a pre-trained neural network in another dataset provided by Tensorflow.
1.2. Requirements and specifications
This project considers a series of requirements to be fulfilled during its development. The most
important is to adapt a pre-trained model that takes a digital image from a music score as input
and outputs the same image with bounding boxes highlighting detected musical symbols or
positions (for the sake of visualization).
Regarding the evaluation of the system, it is convenient to perform training, validation and testing on a chosen or generated dataset obtaining satisfactory results by achieving at least a 75% of precision in the validation set. Once this is achieved, design and implement a user accessible interface that uses the system capabilities for both detection and recognition of positions in digital musical scores. The app should run with no delays and deliver instant playing of the desired music scores. As for the specifications, Python and Java programming languages are used as well as the Tensorflow deep learning framework. These resources are used to train from scratch the chosen pre-trained model. Besides, it is considered the generation of a suitable dataset for the project that can be trained, validated and tested along with their annotations (in PASCAL-VOC format).
14
1.3. Methods and procedures
This work navigates through the research and creation of suitable datasets to tackle the main problem and uses the resources provided by the Tensorflow Object Detection API, which will be explained in further detail in the next sections. In the end, the detection results offered by the API are integrated in a user-accessible application acting as a platform for showing results.
The different parts of the project have been developed with Python and Java as programming languages for object detection tasks and app development respectively. The Tensorflow deep learning framework has been used for the object detection tasks that use pre-trained Faster Region-based Convolutional Neural Networks (Faster R-CNNs).
The pre-trained models have been trained from scratch on NVIDIA Quadro P500 GPU.
All the work has been developed at the School of Science from Aalto University in Espoo, Finland.
1.4. Work plan
The following Figure shows the breakdown structure followed in this project, divided in six
different work packages. The specification of every WP can be found in Appendix 1 in the
Appendices section. Besides, Figure 3 shows the Gantt diagram with the final schedule followed
during the project.
Figure 2. Work packages breakdown structure
15
Figure 3. Gantt diagram
1.5. Incidences and modifications
There have not been major incidences to report. The main modifications have only affected the schedule of the different work packages, which in some cases have taken longer than scheduled. However, the original plan was conceived considering potential incidences so, overall, the project has not been affected by these delays.
16
2. State of the art
This part summarizes the literature behind the topic of Optical Music Recognition (OMR). From a brief description of what an OMR system is, to the neural networks used for modern object detection, as well as examples of related work.
2.1. Optical Music Recognition
An Optical Music Recognition system is a process that aims to turn a digitized score into a machine-readable format [6]. Several approaches have been considered to transform a music score into a playable file [7], [8], and one of the challenges when building this kind of system is how to interpret sheet music and understand the musical context. Figure 4 shows a typical pipeline for an OMR process [8].
Figure 4. OMR system structure
As seen in Figure 4, an OMR system takes a digitized music score as an input. In order to improve the quality of the input image, some of the most used pre-processing tasks rely on enhancement, noise removal or morphological operations. Regarding symbol recognition and classification, one of the main techniques traditionally used consists of isolating the primitive elements through image segmentation or detection and removal of the staff lines [9]. The last steps combine the recognized musical symbols and, along with musical context rules, turns all the information into a final representation to be played by a machine.
However, current results from OMR systems are far from ideal, and new techniques are arising in the field of object detection with the appearance of powerful algorithms based on Region-based Convolutional Neural Networks (R-CNNs, [2]).
2.2. State-of-the-art object detection algorithms
This section contains the most important aspects of the current neural networks being used for state-of-the-art object detection. The primary distinctive fact regarding traditional object detection is that no segmentation is performed on the input image since the whole image is analyzed by the neural network by looking directly into it (i.e. by feeding the network with the actual pixel values).
17
Figure 6. Convolution operation
2.2.1. Convolutional Neural Networks (CNNs) and Region-based CNNs (R-CNNs)
Object detection is mainly conceived by the usage of Convolutional Neural Networks (CNN). A CNN is a multi-layer deep neural network [10] consisting of different convolutional layers with filters (known as Kernels), fully connected layers (FC) and non-linear functions (such as ReLU) to classify an object with probability values between 0 and 1. Figure 5 shows the architecture of a CNN [11].
Convolutional layers are used to extract features from the input image. The operations done here correspond to a convolution between a small portion of the input image (I) and a filter (K) producing a feature map [12]. Figure 6 on the right shows the procedure followed in this layer. After this, a non-linear function is applied to the CNN because working with a linear model would lead to very poor learning outcomes.
After the non-linearity is applied, a pooling layer is used to reduce the dimensionality when input images are too large. Finally, a fully connected layer (FC) flattens the feature map into a vector and the output layer outputs the class using a Logistic Regression with cost functions to classify the images.
It is important to highlight the difference between a CNN and an R-CNN. A CNN usually stands for image classification involving only one object at a time per image. For this reason, CNNs cannot be used to address multiple object detection since the number of appearances of the objects of interest is not fixed. In Figure 7 we can see two of the most common examples related to CNN classifiers: handwritten digits and object classification on the MNIST [13], [14] and CIFAR-10 [15] datasets respectively. The images only contain a single object of interest occupying almost all the space in the pictures. Due to this, the main target of a CNN is to assess what kind of object is present an image but not where it is.
Figure 5. CNN architecture
Figure 7. MNIST and CIFAR-10 pictures example
18
On the contrary, an R-CNN uses several region proposals to be passed through a CNN [2]. These region proposals contain the object of interest that needs to be classified, and different objects can be within the different region proposals hence being detected at the same time. The usual outputs of an R-CNN are the bounding box highlighting the location of the detected objects along with their labels (predicted class). In Figure 8 [16] we can appreciate the process showing the initial proposal of regions (blue) and the final bounding boxes corresponding to detected objects (green).
As a result, algorithms based on R-CNNs have been developed to speed up the process in both localizing and recognizing a series of objects. The following lines show a brief explanation of the newest ones and how they work.
2.2.1.1. R-CNN
The first type of its kind was given the general name for these networks: Region Convolutional Neural Network (R-CNN, [17]). It is based on a selective search that consists of extracting a fixed number of region proposals from the original image. The algorithm behind an R-CNN is illustrated in Figure 9: a CNN is fed with the extracted region proposals and obtains the most important features. These features are then fed into a classifier evaluating the presence of the object in the current region proposal.
However, the problems the first type of R-CNN presents go from a huge amount of training time (a fixed number of proposals have to be classified per each image), no real-time implementation because it takes more than 45 seconds to test an image and no learning in the selective search part since it is a fixed algorithm (which could trigger bad region proposals).
Figure 8. Region proposals example
Figure 9. R-CNN working stages
19
2.2.1.2. Fast R-CNN
With the objective of speeding up and simplifying the architecture of an R-CNN, a new type of network was created in order to improve test time [18].
The main difference regarding its predecessor is that the whole input image is fed into the CNN and a convolutional feature map is generated. Region proposals are then inferred from this map and converted into squares through a Region of Interest (RoI) pooling layer. A RoI layer performs max pooling (down-sampling in order to reduce dimensionality) to the inputs (which are convolutional neural network feature maps) and produces a small feature map of a fixed size (i.e. 7x7) that is fed into the fully connected (FC) layer.
This network is faster than a standard R-CNN because region proposals are not fed to the CNN every time. Instead, a convolutional operation is done only once per image and a feature map is generated from it. Thanks to this, it only takes 2 seconds to test an image.
2.2.1.3. Faster R-CNN
Even though having significantly improved test time from R-CNN to Fast R-CNN, both models use selective search to find region proposals. This algorithm is slow and time-consuming and affects the behavior of the network. For this reason, Shaoqing Ren et al [19] created an object detection algorithm eliminating the selective search algorithm and letting the network learn the region proposals.
In a similar way as in the Fast R-CNN, the whole image is the input of the CNN and a convolutional feature map is created. The main difference is that a new network called Region Proposal Network (RPN) is used to predict the region proposals. The predicted regions are resized with a RoI pooling layer which is used to classify the image within the proposed region and therefore predict the values for the bounding boxes. In these networks, test time is around 0,2 seconds per image.
2.3. Datasets for OMR systems
There exist several available datasets containing whole labeled music scores that have been used in projects related to OMR tasks [20]. This section presents the most important ones directly related to the topic of this project.
It is important to highlight that most of the traditionally used datasets for OMR datasets have relied on symbol classification thus far. As stated in previous sections, image segmentation has been one of the used techniques for musical symbol recognition. This can be exemplified by seeing that there exist a variety of datasets only containing musical symbols such as Rebelo [21],
Figure 10. Fast R-CNN architecture
20
Fornes [22], Printed Music Symbols [23] or Music Score [24] for classification purposes. Figure 11 shows examples of the available data in this kind of datasets.
Nevertheless, focusing on state-of-the-art object detection we are not interested in classifying symbols separately but in considering the whole score and then applying object detection to the complete partiture at once. For this reason, it is important to search for available datasets complying with the desired requirements and that have been used for related work in OMR tasks.
Regarding digital music scores, the largest public annotated dataset is called DeepScores. DeepScores is formed by 300.000 high resolution digital music scores and developed by Lukas Tuggener et al [25]. The dataset was created upon the goal of improving small objects recognition, since note heads belonging to musical symbols are much smaller than other traditional objects considered for recognition.
This dataset is addressed for tasks such as symbol classification, object detection and semantic segmentation applications. Most of the scores are created through MusicXML files extracted from MuseScore (a free music notation software) and converted into readable sheet music. The different partitures are rendered in five different fonts to vary the visual appearance.
The dataset contains 124 different annotated classes from the most common ones (noteheads, clefs, halfnotes) to more specific ones (staccato above, augmentation dots). Figure 12 shows an example of the different annotated musical symbols [25] in the form of red bounding boxes overshadowing every symbol.
Following digital music scores, we have the aforementioned MuseScore platform [26]. It consists of a free musical notation software containing more than 340.000 playable digital music scores. The partitures can be downloaded in several formats: software’s own format, in PDF, in MusicXML files, MIDI or MP3 format.
Another quite new dataset is the Printed Images of Music Staves (PrIMuS, [27]) that contains more than 80.000 real-music excerpts. The dataset was created by Calvo-Zaragoza et al and each score is available in different formats: rendered PNG format, MIDI-file, MEI-file and two types of encoding (semantic and agnostic). The excerpts are taken from the Répertoire international des Sources Musicales (RISM, [28]) dataset and their different formats contain not only information
Figure 11. OMR datasets example
Figure 12. Bounding boxes example
21
about the musical symbols present in the excerpt but also about the notes conforming it. Figure 13 illustrates two formats in which every excerpt is provided.
As for handwritten annotated music scores, there also exist some available datasets being CVC-MUSCIMA the best known [29]. This dataset has been used for writer identification and staff removal tasks involved in different OMR stages. It is formed by 1.000 music scores written by 50 different adult musicians. Each writer transcribed the same music scores using the same pen and paper with printed staff lines. This dataset was created for writer identification and staff removal tasks.
Previous object detection systems have been developed using this dataset. In Figure 14 we can see the input and output of a music object detector [8] using an incipit from the CVC-MUSCIMA dataset. Bounding boxes highlighting detected objects contain the class to which they correspond (i.e. stem, duration-dot) and the probability in % with which the symbol is classified into the predicted class.
Figure 14. Object detector input (left) and output (right)
Figure 13. Incipit from PrIMuS dataset
22
3. Methodology
This part illustrates the core work carried out in this thesis. The whole development is broken down into three main blocks forming and end-to-end system that can be described as follows: a selection of music scores is extracted from a chosen dataset in order to train an object detector. Once the training is done, new music scores (different from the training ones) are tested through the detector and the detection results are integrated in an interactive application, to be accessed from a portable device. It is important to highlight that the entire system refers to only one of the two used object detection models that will be detailed in the next sections.
3.1. Data preparation
This section comprises the handling of two different datasets that feed the music object detector in the next stage: one of them already exists and the other is created upon certain premises. The datasets will be used for both the first and second object detection models respectively although only the second model is fully integrated in the final application. The following lines explain the characteristics of every set of data as well as which and why pre-processing tasks must be applied to the original data (music scores).
3.1.1. Musical symbol dataset
One of the models contemplated in this work will be used for pure musical symbol detection, that is to say, the main goal of the neural network is localize and recognize different types of musical symbols. For this reason, the DeepScores dataset [25] has been chosen. It contains more than 300.000 labeled whole scores formed by a wide variety of elements. There are 124 labeled musical symbols (or classes) in total, some of them can be seen in Figure 16.
Figure 16. DeepScores classes (left) and a crop from one of the dataset scores (right)
Figure 15. Breakdown structure of the main parts of the project
23
Every music score is given in PNG format with a resolution of 2707x3828 pixels and along with it there is an additional XML file that contains the annotations (labels per each class) created in PASCAL-VOC format [30]. This format holds different nodes as illustrated in Figure 17, indicating the name of the labeled musical symbol as well as the absolute coordinates related to the bounding box delimiting every object of interest.
We can extract valuable information from the annotations file in the form of bounding boxes to be visualized showing the ground truth from which the neural network will learn from. The Figure in the left shows a crop from one of the dataset scores and its corresponding ground truth bounding boxes.
The R-CNN present in the next stage is going to perform supervised learning so the annotations are essential and need to be used accordingly. Besides, the high
resolution of the music scores may produce memory issues if the whole score is fed into the neural network. However, the scores can be randomly cropped to smaller sizes so that there are no problems of this kind. This issue will be discussed again the next stage.
3.1.2. Synthetic data generation with Python
This project heretofore has only considered datasets and neural network-based models orientated to pure musical symbol detection. However, we are interested in developing a system that tells us which notes compose a score and their location, as previously stated. For this reason, we have created a synthetic dataset using Python, where a variety of musical symbols and staff lines are rendered randomly creating synthetic sheet music.
The main purpose is to feed the object detector with realistic and non-realistic scores containing a lot of valuable annotated information. Besides, since we want the neural network to learn the positions along the staff lines, the resulting scores have no musical context at all, everything is generated randomly as it can be appreciated in Figure 19.
Figure 17. PASCAL-VOC format example
Figure 18. Bounding boxes example
Figure 19. Fragments from rendered music scores
24
The strategy followed to create this new dataset can be described as conceiving the music scores in a way that when a musical symbol is rendered along the staff lines, it has an associated position according to the mapping presented in Figure 20.
Figure 20. Mapped positions along the staff lines
In other words, the central musical note C marked in the Figure (or Do/C4), taking G (or Sol/G4) clef as the reference clef, corresponds to the first mapped position. The mapping is created by having twenty musical notes in ascending order starting from central C and seven other musical notes in descending order below central C.
As it can be deduced from Figure 20, the total mapped positions are twenty-seven, hence not all possible positions along the staff lines are mapped. Nevertheless, we have left out some of the highest and lowest-pitched musical notes due to their low appearance frequency in the considered scores to test the object detector in the next stage.
Regarding implementation issues, the synthetic scores have been created using the Pillow image library [31] from Python that allows to composite and create images of different kinds. The algorithm is based upon very simple operations: grayscale RGBA musical symbol images are pasted in varying y-axis coordinates and along the x-axis (direction in which the staff lines are drawn). Each variation of a certain number of pixels in the y-axis belongs to a new musical note.
As for the annotations, once a new musical symbol is rendered randomly, the original bounding box (corresponding to the dimensions of the original musical symbol image) is resized around the notehead, halfnote or whole note as it can be seen in Figure 21. At the same time the score is being rendered, a 2707x3828 PNG file is created as well as an XML file following the previously mentioned PASCAL-VOC format (as in the DeepScores dataset).
Other symbols such as rests and bar lines are rendered and not annotated so that the neural network in the next stage sees them, but it is expected to ignore them. The purpose of these stuffing symbols is to emulate the appearance of real sheet music.
Figure 21. Original bounding boxes (left) and flattened bounding boxes (right)
25
In order to increase the robustness of the system, data is augmented so that variations of the original rendered scores are also fed into the neural network. Data has been augmented using the imgaug library [32] and Figure 22 shows some of the variations applied such as shear (affine transformation), impulse noise, Gaussian blurring and other types of transformations.
The code that generates synthetic data can be found in [43].
3.2. Object detector
This section describes the object detector that has been used to perform detection and recognition tasks in music scores. It is mainly formed by a pre-trained neural network extracted from the Tensorflow deep learning framework and it is trained on one of its APIs.
3.2.1. Tensorflow Object Detection API
As mentioned at the beginning of this document, this work uses the Tensorflow Object Detection API [5] for detection and recognition tasks in music scores needed for the proposed objective. In order to adapt its resources to the current problem, it is important to define the structure of the API as well as how data is handled by it and all the necessary modifications to be made.
The object detector is composed by the blocks depicted in Figure 23. As it can be seen, the API comprises the main element in the whole detection system.
In order to be able to fit our data into the API, the input images and their annotations must be
converted into a readable format by the framework called Tensorflow records (TFRecords, [33]).
This format is specific for the library and stores data as a sequence of binary strings. Besides,
binary data reduces the space of the original data and can be read more efficiently.
As previously stated, we have used an off-the-shelf pre-trained model offered by Tensorflow [34],
to be detailed in the next section, that needs to be adapted to the different datasets for the two
proposed object detection models. After the adaptation, training and validation are run at the
same time. Besides, training and validation can be monitored live by using TensorBoard [35]. This
Figure 22. Data augmentation examples
Figure 23. Object detector scheme
26
platform allows to visualize different evaluation metrics for both sets while the network is being
trained.
Before testing the detector, what is known as the inference graph is exported and used to
perform posterior object detection in the test part. Testing is carried out by selecting new music
scores different from the ones used for training and validation. The output reflects the predicted
bounding boxes by the neural network along with its label and confidence score.
3.2.1.1. Training and validation
The chosen model to train from scratch is a Faster R-CNN with Resnet101 architecture (this
concept is not going to be developed in this thesis), pre-trained on the Oxford IIIT Pets Dataset.
Since it is very important to make sure the neural network will handle the input images in a
correct way, the whole music scores cannot be fed into the network directly. Here we introduce
one of the most important parameters regarding Convolutional neural networks: the receptive
field. This parameter is defined as the region in the input space that a CNN’s feature is looking at
(i.e. affected by) [36]. Without entering in a complex technical explanation, this parameter is
important because input images cannot be larger than it. In the case of a Faster R-CNN, the
receptive field has dimensions of 1027x1027 [37]. Because of this, and the given dimensions of
original and rendered music scores, the initial crop size used in the training and validation images
is 1000x700 pixels. Consequently, training and validation images are obtained from randomly
cropping the original scores into new images of the specified size and creating the new
corresponding annotations XML file. Figure 24 illustrates an example of a random crop.
Previously to proceed, a label map must be created containing the names of all the possible
annotated objects along the training data. Besides, the structure of the pre-trained models
follows the setting up of a config file. At this stage the file needs to be adapted to the current
model. The neural network parameters can be changed, or the training can be performed by
using the default configuration. This will be further detailed in the Results section.
Once the TFRecords have been created and all the configuration is correctly set up, the training
and validation phase can start. It is key to emphasize validation images are never seen by the
neural network; they are used to measure how well the model is performing while it is being
trained. While training, checkpoints are saved so that the model can be validated every few
minutes.
Figure 24. Random crop example
27
(a)
(b)
3.2.1.2. Testing
In order to test the object detector, different approaches to evaluate the performance of the
detector have been considered. Every time a test image passes through the network, a new
image is created containing the bounding boxes of the detected objects, the predicted class by
the network and the probability (or confidence, in %) with which the detected object has been
classified in that class. Figure 26 illustrates an input and output example after testing an object
detector with images from the CVC-MUSCIMA dataset present in the work by Pacha et al [38].
For training and validation, images were created as random crops. However, in the testing part
we want to assess how well the model perform so the way input images are created can yield
very different results:
• Random crops (can be overlapped or not): in the case of random overlapping crops, the
same object passes more than once through the network, and in order to cover the
whole music score, it is necessary to create many crops. One of the advantages of taking
this approach is objects not being detected in one crop may be detected in another.
The fact that an object can be seen more than once by the network leads to different
problems. An object may be classified into two different classes depending on the crop,
as depicted in Figure 27, where the same part of the image is present in two different
crops. The highlighted predicted class of the circled objects differs from one to the other.
Besides, when gathering the final detection results, a huge number of bounding boxes
may be around every detected object. These bounding boxes cannot be unified into one
since the last detection might be wrong and all previous detections right.
Figure 25. Training (a) and validation (b) processes going on
Figure 26. Object detector input (left) and output (right) example
Figure 27. Different predicted class in two random crops
28
• Non-overlapping crops: in this case, the network only sees the whole music score once,
by tiling it into different non-overlapping crops. The main drawback of this technique
arises from the fact the network only sees every individual object once so the case of
objects not being detected is likely to happen in some crops. Besides, another issue must
be considered: creating the crops without overlap can lead to “cutting” the original score
in the middle of musical notes or symbols of interest. Given this situation, the object
cannot be detected by any means. Figure 28 shows an example illustrating this problem.
Figure 28. Non-overlapping crops problem
• Non-random overlapping crops: this approach uses the same technique as in the
overlapping crops, but the new images are not created randomly but contain a certain
number of overlapping pixels between images, as depicted in Figure 29.
Figure 29. Non-random overlapping crops
By applying this technique, musical notes and symbols may be seen more than once
through the network (around two or three at the most). The difference between this case
and the overlapping one is that detecting the same object wrongly in different crops is
much less probable. This happens because the location of a musical note and symbol
does not significantly change (in terms of pixels) as it does in the overlapping crops.
The music scores used for testing are obtained from the MuseScore platform and do not contain
any labels. In the Results section some evaluation metrics for the second object detection model
are computed manually to assess how well the model performs in real music scores.
Overall, the detection script provided by the Tensorflow creators has been modified according to
the desired output consisting of getting the predicted classes in a file. All the modified files
concerning the API can be found in [43].
29 Table 1. Color mapping
3.3. App integration
The final objective of this thesis is to create a user-accessible application that allows to play a
score by touching on its notes. This section presents the creation of an interface that will allow
the user to do so.
In the application, only the second object detection model is considered since it is the one that
detects the positions of musical notes along the staff lines. With the predicted positions by the
neural network, a mapping is performed between the assigned position and its corresponding
musical note, as stated before.
The most important thing is how detection results are handled by the application. After the
testing part, an XML file is created per every test image containing information about the
detections, as Figure 30 shows (coordinates are relative to the size of the test crop).
In order to gather all detection results belonging to the same music
score into one file, a color is assigned to every detected position
according to Table 1. While the XML files are being read, a new file
of the same size as the tested music score is created. In this new file,
bounding boxes belonging to detections are printed with the color
associated to the detected position, creating a color map. As an
example, Figure 31 shows a real tested music score and its
corresponding color map.
Figure 30. XML file fragment created after testing
Figure 31. Color map example
30
After the color map is created, we can proceed to describe the application itself. The interface is
composed by different screens that allow the user to select one of the available tested music
scores, set up the clef and key signature and get to play it. Figure 32 shows the scheme
conforming the app functioning.
Figure 32. Application steps
When the user touches on the screen, the implemented algorithm checks which color is present
in the color map corresponding to the selected music score. According to the different colors and
previous settings, the musical note is played instantly. Musical notes belong to a piano and have
been extracted from the Electronic Music Studios [39] database.
In terms of implementation, the application has been programmed using Java in Android Studio
[40]. A list of scores is displayed on screen once the app has started. When the pertinent
selections have been made, the score to be played is displayed as a bitmap using the
Subsampling Scale Image View library created by Dave Morrissey [41]. This library allows to load
a subsampled version of an image every time the user zooms or pans around the screen. It also
allows to obtain the coordinates relative to the size of the current image when a touching event
takes place.
The color map is also read as a bitmap with the same dimensions of the real music score that is
displayed. The correspondence between colors and musical notes have a unique red value (from
its RGB value). For this reason, only the red value of the area where the user touches on is
checked every time a musical note is to be played. After the user presses on the desired note, the
corresponding tone is played instantly.
The implemented code related to the application can be found in [43].
31
4. Results
This section comprises the outcomes of training, validating and testing the two models presented
in previous sections. Besides, it shows how the final application looks like and its main
functionalities.
4.1. Evaluation metrics
In order to understand the evaluation metrics used in the models, their definition is detailed in
the following equations:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 (1) 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁 (2)
𝑀𝑖𝑠𝑠 𝑅𝑎𝑡𝑒 =𝐹𝑁
𝐹𝑁 + 𝑇𝑃 (3) 𝐼𝑜𝑈 =
𝐴1 ∩ 𝐴2
𝐴1 ∪ 𝐴2 (4)
Where TP = True Positive, FP = False Positive, FN = False Negative and A1, A2 correspond to the
areas of the predicted and original bounding boxes for the detected objects.
Precision measures how accurate are the predictions by computing the proportion of positive
detections that are correct, whereas recall computes the proportion of actual positives identified
correctly. The miss rate, or misclassification rate, computes the proportion of non-detected or
detected wrongly objects compared to these and positive detections.
IoU [42] measures the overlapping of the boundaries belonging to the bounding boxes: the
predicted and the original ones. IoU has been computed in the validation sets of all trainings with
a threshold of 0.5. If the IoU is greater or equal than 0.5, the prediction is considered as positive.
Besides, mean Average Precision (mAP, [30]) has been used to assess the performance of the
model in the validation sets of every training. The mAP is computed as the average of Average
Precision (AP) per each class, defined in equation (5),
𝐴𝑃 = ∫ 𝑝(𝑟) 𝑑𝑟 (5)1
0
For a specific class the precision-recall curve is computed, and AP finds the area under it. AP is
calculated as the mean precision at a set of eleven equally spaced recall levels (from 0.0 to 1.0),
as equation (6) reflects:
𝐴𝑃 = 1
11 ∑ 𝐴𝑃𝑟
𝑟 ∈{0.0,….,1.0}
(6)
The precision at each recall level r (APr) is interpolated by taking the maximum precision for a
method for which the corresponding recall exceeds r:
𝐴𝑃 = 1
11 ∑ 𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑟) 𝑤ℎ𝑒𝑟𝑒 𝑝𝑖𝑛𝑡𝑒𝑟𝑝(𝑟) =
𝑟 ∈{0.0,….,1.0}
max�̃�≥𝑟
𝑝(�̃�) (7)
All in all, AP is computed per each validated class in the model and the average of all of them is
the result of mAP. In this section, mAP is computed with IOU >= 0.5 in the validation sets
belonging to the different trainings performed in this work.
32
4.2. Training, validation and testing of models
This section will show the different trainings performed with the presented models. Regarding
validation, examples evaluated in TensorBoard are included to understand the different faced
problems graphically. Images used for validation share similar characteristics as the ones used for
training but are never seen by the network. The main advantage is TensorBoard allows to check
the predictions made on the validation set while the model is being trained. The data partition
made to the datasets used for every training consists of 70% of data for training and 30% for
validation.
As for the testing, we have used non-random overlapping crops due to the problems of random
overlapping and non-overlapping crops. These crops are extracted from eight real music scores
that are not labeled. For this reason, evaluation metrics for the second object detection model
are computed manually in a selection of music scores.
4.2.1. First object detection model
In order to train, validate and test both object detection models, several trainings have been
carried out. It is important to remark the first object detection model serves as a guideline to
create the second one. Therefore, it has only been used as a “toy example” to discover and figure
out how the Tensorflow Object Detection API works, since it is not integrated in the final
application. Table 2 shows some of the main specifications for different trainings performed
regarding the first object detection model.
Experiment Scores Crops/score Crop size Training images
Validation
images
Configuration
A 30 40 1000x700 840 360 Default
B 100 40 1000x700 4368 1872 Default
C 548 60 500x350 23016 9864 Default
Table 2. First object detection model trainings
Training [email protected] in the validation set
A 0.9512
B 0.9110
C 0.7469
Table 3. [email protected] in the validation set
Besides, in Table 3 IoU values for the different validation sets per each training are shown. mAP
values from trainings A and B are high because not all classes are represented in the training and
validation set. Then, since the model has much less classes to predict than the ones that exist,
prediction is performed correctly in nearly all classes. However, other not represented classes in
33
these sets are not detected at all since the model never learns them. An example of this situation
is illustrated in Figure 33, where a validation image is shown. Some musical symbols (note head
wholes, rest 16th) are not detected during validation because they don’t have ground truth
examples in training.
Due to the problems stated above, the number of music scores used in the last training (C) is
considerably higher than in other trainings to ensure all classes are present. However, the low
frequency appearance of certain classes may cause the mAP value to drop. This is caused by few
representations of certain classes, which lead to poor detection capability of the model in them.
Figure 34 shows the mAP values graphs for two different classes evaluated in the validation set
belonging to training C.
The situation illustrated above happens because not all musical symbols have the same
frequency appearance. Nevertheless, the real music scores we have tested do not pose most of
the low-frequency-appearance musical symbols.
As an example of how data is delivered by the API after testing, Figure 35 shows test examples
belonging to the last two trainings. The original image is shown on the left and detections
performed by the model on the right. The original output can be found in Appendix 2 in the
Appendices section. In this section, the results appearance on test images has been modified for
visual clearance purposes.
Figure 33. Original crop (left) and during validation (right)
Figure 34. [email protected] graph examples
Figure 35. Original test images (left) and detections made by the model (right)
34
Different issues will be properly discussed in the following lines, regarding the second object
detection model, since as stated before, the first object detection model has been used to
understand how the API works and to pave the way to the second model.
4.2.2. Second object detection model
In order to reflect the outcomes of the second object detection model, we have carried out four
different trainings. Table 4 shows the main specifications for each training and the following lines
will explain and show graphically the modifications made along them (DA stands for data
augmentation).
Training Scores Crops/score Crop size Training images
Validation
images
Configuration DA
A 100 40 1000x700 2800 1200 Default No
B 156 40 1000x700 4368 1872 Default Yes
C 156 40 1000x700 4368 1872 Not default Yes
D 156 40 500x350 21840 9360 Not default Yes
Table 4. Second object detection model trainings
Additionally, Table 5 shows the mAP values measured in the validation set of every training.
Training [email protected] in the
validation set
A 0.9494
B 0.7569
C 0.7981
D 0.8041
Table [email protected] in the validation set
The dataset used for the initial training (A) does not contain any synthetic score reflecting “notes
over other notes”, or in other words, musical figures sharing the same x-coordinate. An example
of this situation is illustrated in Figure 36. Besides, no data augmentation is used in this training.
Figure 36. "Notes over notes" areas
35
However, the first experiment was carried out as an example to see the viability of the strategy
followed in order to carry on with other experiments.
The main difference between trainings A and B is the different dataset used in them. We can also
see this reflected in the mAP for both which is considerably reduced in training B. On the one
hand, since in training A there are no synthetic scores with high density areas or regions as the
ones depicted in Figure 36, detection is performed correctly in most of the validation set. This
can be clearly seen by looking at the image on the left in Figure 37. On the other hand, when we
change the dataset, precision on the validation set drops because detection in high density areas
becomes poor, as the image on the right in Figure 37 shows.
This situation presents two problems: detection is not as good in high density areas as it is in
other kind of regions and training data may not have significant representation of these areas to
make a good prediction in the validation set.
Regarding the testing, Figure 38 shows a part of the same crop tested in trainings A and B. From
this comparison we can infer adding “notes over other notes” makes the model predict positions
in these regions.
Before changing the number of training and validation images, we changed the default
configuration of the pre-trained model in order to see if the model improved or not, leading to
training C. Figure 39 shows an incipit from a real tested music score: the ground truth for the
predictions and the predictions made by the model in trainings B and C.
Figure 37. Validation set examples from training A (left) and B (right)
Figure 38. Testing difference between trainings A and B
36
Figure 39. Ground truth (top) and test output examples from trainings B (left) and C (right)
As we can see, positions wrongly detected in training B or not detected at all, are generally
detected successfully in training C. This is also reflected in Figure 40, where another incipit from a
real tested music score shows the difference regarding detections between trainings B and C.
Nevertheless, this training did not solve the problem of detection in high density areas. We can
easily illustrate this by looking at some examples from the validation set. Figure 41 shows
detection in high density areas in training C is still poor compared to low-density areas.
Figure 40. Ground truth (top) and test output examples from trainings B (left) and C (right)
Figure 41. Validation set examples in low (left) and high (right) density areas
37
In order to solve detection problems in high density areas, we made two changes leading to the
final experiment: increase the number of images so that high density areas are highly
represented in the training part, and change the size of the random crops used for training and
validation to half of its initial size so that less objects are to be detected in every crop. Figure 42
shows examples of the validation set from trainings C and D.
From Figure 42 we can conclude the size of the input crop determines how capable the model is
to perform detections in one crop. In order to clearly see the improvement between the last two
trainings, Figure 43 shows a crop tested in training C, and the four equivalent crops from training
D containing the same areas, as well as the ground truth image with the correct labels per each
detectable position.
From the previous Figure we can appreciate positions not being detected in training C are
detected correctly in training D. Besides, due to the reduction of the crop size in the last training,
the model mistakes the lyrics of the score and detects them as if they were objects of interest.
However, this does not suppose a major problem since it is not a misdetection concerning an
area of interest.
Figure 42. Validation set examples from trainings C (left) and D (right)
Figure 43. Ground truth (top) and test output examples from trainings C (left) and D (right)
38
Taking the model from training D, we tested nine real music scores and computed some evaluation metrics according to the following criterion:
• True positive (TP): position is correctly detected
• False positive (FP): non-position area is detected as a position
• False negative (FN): position is not detected (object of interest without any predicted bounding boxes)/position is wrongly detected
We have chosen a variety of real music scores with different densities in terms of positions to be detected. Figure 44 shows portions from the different music scores where we can appreciate the range of musical notes as well as density significantly varies between them.
Figure 44. Remember me (left), Mia and Sebastian (center) and La La Land intro (right) examples
Table 6 shows the precision, recall and miss rate in % for the different real music scores tested on the final training:
Score Total detections TP FP FN Precision
(%)
Recall
(%)
Miss rate (%)
In Dreams 150 150 25 1 85.71 99.33 0.66
Finding Nemo 157 132 23 4 85.16 97.05 2.95
Another day of sun 162 146 16 1 90.12 99.31 0.69
Mia and Sebastian 218 199 17 1 92.12 99.50 0.50
Game of Thrones 258 246 14 4 94.61 98.40 1.60
Rey’s theme 274 263 11 14 95.98 94.94 5.06
Concerning Hobbits 301 287 11 9 96.30 96.95 3.05
Carl and Ellie 328 294 10 7 96.70 97.67 2.33
Remember me 410 400 6 10 98.52 97.56 2.47
Table 6. Evaluation metrics for different real tested music scores
39
It is important to highlight the sum of TPs, FPs and FNs is not always equal to the total number of detections, since positions not being detected are counted as false negatives. As an example, in Rey’s theme the sum of TP, FP and FN equals 288, whereas the total number of detections is 279. This means fourteen positions are not detected at all.
As for the results reflected in Table 6, we can see precision is quite high in almost all scores. The reduction of false positives improves precision in scores where less non-position areas are detected as positions. On the contrary, when false positives increase (majorly in less “populated” scores), precision decreases. Nevertheless, the number of false positives compared to true positives is noticeably lower in all music scores.
In terms of recall, we can see that the more positions to be detected, the more positions do not get detected or get wrongly detected, which is logical. However, the proportion of false negatives compared to true positives is significantly lower. For this reason, recall values for all scores are so high, being 94.94% the minimum value obtained.
Finally, misclassification rates exemplify the percentage of positions not detected or detected wrongly. As we can see, the obtained rates are comprehended between 0.66% and 5.06%, so misdetection is not very significant. All in all, the results are sufficiently good considering we have achieved a misclassification rate of 2.14% on average as well as a precision of 92.80% and recall of 97.85%.
40
4.3. Application
The following Figures show the final application is composed of initial, selection, description and playing screens.
Figure 45. Initial and score selection screens
Figure 46. Transition screen with brief description of the selected music score
41
Once the score has been selected, the user can modify the key signature and clef according to their preferences. Figure 47 shows the screen where the user gets to play the selected score with the different options mentioned.
When the user selects the key signature and clef, informative messages appear as Figure 48 illustrates.
Figure 47. Playable score screen with key signature and clef setup
Figure 48. Informative messages from key signature and clef setup
42
5. Budget
This project has been carried out at the Department of Computer Science from Aalto University.
The following tables show the different costs related to hardware, software and human resources. The total cost related to hardware and software resources has been computed by dividing the cost of the product by its useful life (working at highest performance). Since the project has had a total extend of approximately 6 months, the resulting cost per each product is the one corresponding to half year.
Some of the hardware resources used do not imply a direct cost, but Table 7 provides the real cost if not provided by the university.
Product Cost Units Useful life (years) Total cost
Personal Computer I 1100€ 1 3 183,33€
Personal Computer II 900€ 1 3 150€
NVIDIA Quadro P500 GPU
2000€ 1 2 500€
Table 7. Hardware resources costs
Since Tensorflow, Linux and Python are open source, the main software costs are related to the usage of Windows 10 and Microsoft Office for project documentation.
Product Cost Units Useful life (years) Total cost
Windows 10 125€ 1 2 32,25€
Microsoft Office 2019 107€ 1 2 26,75€ Table 8. Software resources costs
Costs related to human resources are based in the minimum wage and are detailed in Table 9.
Worker Total dedication (hours) Cost/hour Total
Junior engineer 600 8€ 4800€ Table 9. Human resources costs
All in all, the total cost has been 5692,33€.
43
6. Conclusions and future development
This project has shown predicting musical notes positions along the staff lines using synthetic data is possible. Thanks to the Tensorflow Object Detection API and Android Studio, this work has reached the main objective through the creation of an end-to-end system with the desired outcomes: the whole system is comprised in an application that allows its user to choose a digitized music score to be played by touching on its notes.
It is important to highlight pre-trained models originally trained in very different datasets from the ones used in this thesis have turned out to work successfully with no modifications in its original set up, or a very slight alteration. This clearly stands out the astonishing results the used API can achieve by training a new model.
The different strategies followed have worked generally as expected, although there are some problems and improvements to be considered. In relation to the second object detection model, the synthetic dataset has been created containing only unique sizes of both musical symbols and staff lines. If a high-resolution digitized music score is not resized before testing it on the model, there is no detection at all. This directly affects the model’s robustness since the neural network used for detection is not taught different sizes of the same objects of interest. Despite this, considering the appearance of digital music scores is relatively similar between them has helped the main strategy to properly work.
A potential improvement for the model would be to create a synthetic dataset with varying sizes of the musical scores. Besides, different fonts should be considered to render the musical figures since there are different typographies widely used in music scores. Another improvement could focus on how musical notes are detected using the proposed strategy. Neural networks tend to be better on predicting what an object is instead of where it is. Predicting the location of the musical notes, as the first object detection model does, and use another technique to obtain its positions along the staff lines could be considered.
Regarding the application, detection could be performed live through a server so that it is not limited to offline tasks. However, current results achieved are suitable for a demonstration tool, as intended since the idea first came up.
In terms of work development, the project has been successfully carried out in English and the progress has been mainly addressed through autonomous learning in what refers to making a pre-trained model work and building an application integrating different tasks.
All in all, the obtained results after testing real music scores are sufficiently good considering the limitations of the used dataset.
44
Bibliography
[1] Sharma P., A practical Guide to Object Detection using the Popular YOLO Framework – Part III, Analytics Vidhya, 2018. [Online]
Available: https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/ [Accessed: March 2019]
[2] Gandhi R., R-CNN, Fast R-CNN and YOLO object detection algorithms, Towards Data Science, 2018. [Online]
Available: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object detectionalgorithms-36d53571365e [Accessed: March 2019]
[3] Pacha, A.; Hajič, J., Jr.; Calvo-Zaragoza, J. A Baseline for General Music Object Detection with Deep Learning. Appl. Sci. 2018, 8, 1488
[4] Tensorflow, Tensorflow creators, 2015. [Online]
Available: https://www.tensorflow.org/ [Accessed: January 2019]
[5] Tensorflow Object Detection API, Tensorflow creators, 2018. [Online]
Available: https://github.com/tensorflow/models/tree/master/research/object_detection [Accessed: 12 February 2019]
[6] Calvo-Zaragoza, J.; Rizo, D. End-to-End Neural Optical Music Recognition of Monophonic Scores. Appl. Sci. 2018, 8, 606.
[7] Pacha A. and Calvo-Zaragoza J., Optical Music Recognition in mensural notation with Region-based Convolutional Neural Networks, TU Wien, 2018
[8] Pacha A., Choi K.-Y., Couasnon B., Ricquebourg Y. and Zanibbi R., Handwritten music object detection: Open issues and baseline results. In International Workshop on Document Analysis Systems, 2018.
[9] Rebelo, A., Fujinaga, I., Paszkiewicz, F. et al. Int J Multimed Info Retr (2012), Optical music recognition: state-of-the-art and open issues Vol.1 Issue 3: 173-190. https://doi.org/10.1007/s13735-012-0004-6
[10] Krizhevsky A., Sutskever I., Hinton G., Imagenet classification with deep convolutional neural
networks. InAdvances in neural information processing systems,pp. 1097–1105, 2012.
[11] Prahbu, Understanding of Convolutional Neural Netowrk (CNN) – Deep Learning, Medium, 2018. [Online]
45
Available: https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148 [Accessed: February 2019]
[12] S. Mohamed, Ihab. (2017). Detection and Tracking of Pallets using a Laser Rangefinder and Machine Learning Techniques. 10.13140/RG.2.2.30795.69926.
[13] Lecun Y., Cortes C., J.C.Burges C., The MNIST database, 1998 [Online]
Available: http://yann.lecun.com/exdb/mnist/ [Accessed: January 2019]
[14] Zhu W., Classification of MNIST Handwritten Digit Database using Neural Network, Australian National University, 2012
[15] Krizhevsky A., Nair V. and Hinton G., The CIFAR-10 dataset, 2009. [Online]
Available: https://www.cs.toronto.edu/~kriz/cifar.html [Accessed: January 2019]
[16] Hui J., Fast R-CNN and Faster R-CNN, Github, 2017. [Online]
Available: https://jhui.github.io/2017/03/15/Fast-R-CNN-and-Faster-R-CNN/ [Accessed: May 2019]
[17] Girshick R., Donahue J., Darrell T., Malik J., Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE CVPR2014
[18] Girshick R., Fast R-CNN, IEEE international conference on computer vision, 2015, pp. 1440–1448
[19] Ren S., He K., Girshick R., and Sun J., Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015
[20] Pacha A., Collection of datasets used for Optical Music Recognition, Github, 2017. [Online]
Available: https://github.com/apacha/OMR-Datasets [Accessed: March 2019]
[21] Rebelo, A., Capela, G. & Cardoso, J.S. IJDAR (2010) 13: 19. https://doi.org/10.1007/s10032-009-0100-1
[22] Fornés A., Lladós J., Sánchez G. (2008) Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. In: Liu W., Lladós J., Ogier JM. (eds) Graphics Recognition. Recent Advances and New Opportunities. GREC 2007. Lecture Notes in Computer Science, vol 5046. Springer, Berlin, Heidelberg
46
[24] Pacha A. and Eidenberger H., Towards self-learning optical music recognition, in Proceedings of the 16th IEEE International Conference On Machine Learning and Applications, 2017
[25] Tuggener L., Elezi I., Schmidhuber J., Pelillo M. and Stadelmann T., DeepScores – A Dataset for Segmentation Detection and Classification of Tiny Objects. arXiv preprint arXiv:1804.00525, 2018
[26] Schweer W., MuseScore: Free music composition and notation software, 2002. [Online]
Available: https://musescore.com/ [Accessed: March 2019]
[27] Calvo-Zaragoza J. and Rizo D., 2018. Camera-PrIMuS: Neural End-to-End Optical Music
Recognition on Realistic Monophonic Scores. In 19th International Society for Music Information
Retrieval Conference
[28] Répertoire International des Sources Musicales (RISM), 1952. [Online]
Available: http://www.rism.info/en/home.html [Accessed: March 2019]
[29] Fornés, A., Dutta, A., Gordo, A. et al. IJDAR (2012) 15: 243. https://doi.org/10.1007/s10032-011-0168-2
[30] Everingham, M., Van Gool, L., Williams, C.K.I. et al. Int J Comput Vis (2010) 88: 303. https://doi.org/10.1007/s11263-009-0275-4
[31] Lundh F., Pillow Image Module, 2009. [Online]
Available: https://pillow.readthedocs.io/en/stable/reference/Image.html [Accessed: March 2019]
[32] Jung A., Image augmentation for machine learning experiments, Github, 2018. [Online]
Available: https://github.com/aleju/imgaug [Accessed: April 2019]
[33] Gamauf T., Tensorflow Records? What they are and how to use them, Medium, 2018. [Online]
Available: https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564 [Accessed: February 2019]
[34] Tensorflow creators, Pre-trained models, Github, 2017. [Online]
47
Available: https://github.com/tensorflow/models/tree/master/research/slim [Accessed: February 2019]
[35] Tensorflow creators, Tensorboard, 2017. [Online]
Available: https://www.tensorflow.org/guide/summaries_and_tensorboard [Accessed: February 2019]
[36] Ha The Hien D., A guide to receptive field arithmetic for Convolutional Neural Networks, Medium, 2017. [Online]
Available: https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807 [Accessed: April 2019]
[37] Tensorflow creators, Receptive field computation for convnets, Github, 2017. [Online]
Available: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/receptive_field [Accessed: April 2019]
[38] Pacha A., Music Object Detector with Tensorflow, Github, 2017. [Online]
Available: https://github.com/apacha/MusicObjectDetector-TF [Accessed: March 2019]
[39] Cash M., Electronic Music Studios, University of Iowa, 2001. [Online]
Available: http://theremin.music.uiowa.edu/MISpiano.html [Accessed: April 2019]
[40] Android creators, Android Studio, 2013. [Online]
Available: https://developer.android.com/studio [Accessed: March 2019]
[41] Morrissey D., Subsampling Scale Image View, Github, 2017. [Online]
Available: https://github.com/davemorrissey/subsampling-scale-image-view [Accessed: April 2019]
[42] Rosebrock A., Intersection over Union (IoU) for object detection, pyimagesearch, 2016. [Online]
Available: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ [Accessed: April 2019]
[43] Burgués J., Recognition of musical symbols and notes, Github, 2019. [Online]
Available: https://github.com/jordiburgues/Recognition-of-musical-symbols-and-notes
48
Appendices
Appendix 1: Work Packages and Milestones
Project: Project documentation WP ref: (WP1)
Major constituent: Document writing Sheet 1 of 1
Short description: this part extends along all the project and it
comprehends the key deliverables to be done during the work.
Planned start date: 14/2/2019
Planned end date: 25/6/2019
Start event: 14/2/2019
End event: 25/6/2019
Internal task T1: Project proposal and work plan
Internal task T2: Critical revision
Internal task T3: Final report
Deliverables:
PP and WP
CR
FR
Dates:
10/03/2019
12/04/2019
25/06/2019
Project: Initial work and research WP ref: (WP2)
Major constituent: Documentation Sheet 1 of 1
Short description: this part focuses on an introductory research by
making an introduction to TensorFlow framework and gathering
papers regarding Optical Music Recognition.
Planned start date: 24/1/2019
Planned end date: 21/2/2019
Start event: 24/1/2019
End event: 21/2/2019
Internal task T1: Learn the basics of Python and TensorFlow framework
Internal task T2: Train, validate and test a “toy example” using
Convolutional Neural Networks
Internal task T3: Define main topic after OMR research
Deliverables:
Fully
convolutional
classifier
Research
information
Dates:
30/1/2019
7/2/2019
49
Project: Datasets WP ref: (WP3)
Major constituent: Research/Programming Sheet 1 of 1
Short description: research about available datasets of printed musical
scores that will be used as the input of the system.
Planned start date: 7/2/2019
Planned end date: 7/3/2019
Start event: 7/2/2019
End event: 7/3/2019
Internal task T1: Check availability of data, choose an initial dataset for
the system
Internal task T2: Explore chosen data (parse dataset files)
Internal task T3: Implement pre-processing tasks if needed (re-scaling,
cropping…)
Internal task T4: Prepare input CNN images and its annotations
Deliverables:
Pre-processed
dataset
Dates:
21/2/2019
Project: First object detection model WP ref: (WP4)
Major constituent: Programming Sheet 1 of 1
Short description: in this part, a pre-trained model will be used to
train, validate and test a Convolutional Neural Network using the
TensorFlow Object Detection API for musical symbol localization and
recognition.
Planned start date: 7/2/2019
Planned end date: 25/4/2019
Start event: 21/2/2019
End event: 25/4/2019
Internal task T1: Object Detection neural networks study (Fast/Faster
Region based-CNN)
Internal task T2: TensorFlow Object Detection API study
Internal task T3: Current CNN implementation adaptation
Internal task T4: Training, validation and testing (performance,
evaluation metrics, model tuning…)
Deliverables:
Model
evaluation
Dates:
25/4/2019
50
Project: Second object detection model WP ref: (WP5)
Major constituent: Programming Sheet 1 of 1
Short description: this part will consist of generating synthetic data to
train a new model that predicts the location of music symbols along
the staff lines so that with the clef and key signature information the
actual note can be guessed.
Planned start date: 7/3/2019
Planned end date: 31/5/2019
Start event: 7/3/2019
End event: 31/5/2019
Internal task T1: Generate synthetic data and its annotations
Internal task T2: Data augmentation on synthetic data
Internal task T3: Current CNN implementation adaptation
Internal task T4: Training and validation
Internal task T5: Testing (evaluation metrics)
Deliverables:
Synthetic data,
first version
Synthetic data,
final version
Model
evaluation
Dates:
14/3/2019
25/4/2019
31/5/2019
51
Project: User interface development and integration WP ref: (WP6)
Major constituent: Programming Sheet 1 of 1
Short description: this part will consist of building an app on top of the
previous object detection model. The main idea is to construct a user
interface that can play notes from a score after a selection of the clef
and key signature made by the user. The detection results from a score
will be previously charged to the app, so it will work in an offline way.
Planned start date: 7/3/2019
Planned end date: 15/6/2019
Start event: 7/3/2019
End event: 14/6/2019
Internal task T1: Research on app development with Android Studio
Internal task T2: Implementation (coding, interface…)
Internal task T3: Integration of each task (note detection, user
selections) into the app
Deliverables:
First version
Final version
Dates:
2/5/2019
14/6/2019
Table 10. Work packages
Milestones:
WP# Task# Short title Milestone / deliverable Date (week)
1 1 Project documentation Proposal and work plan
(document)
2
3 4 Datasets (existing and
generated)
Images and annotations
(Python files)
3-10
1 1 Project documentation Critical revision (document) 7
4 4 First object detection model Model evaluation (Python
files)
9
5 4 Second object detection model Model evaluation (Python
files)
10
1 2,3 User interface (app) Implementation (Java files) 12-15
1 Project documentation Final report (document) 18 Table 11. Milestones
52
Appendix 2: Test (real output examples)
Figure 49. Real test output from Figure 35
Figure 50. Real test output from Figure 39
Figure 51. Real test output from Figure 40
53
Figure 52. Real test output from Figure 43
54
Glossary
R-CNN. Region-based Convolutional Neural Network
CNN. Convolutional Neural Network
API. Application Programming Interface
OMR. Optical Music Recognition
VOC. Visual Object Class
WP. Work Package
FC. Fully connected
ReLU. Rectified Linear Unit
RoI. Region of Interest
XML. Extensible Markup Language
PDF. Portable Document Format
MIDI. Musical Instrument Digital Interface
PrIMuS. Printed Images of Music Staves
MEI. Music Encoding Initiative
PNG. Portable Network Graphics
RISM. Répertoire International des Sources Musicales
IoU. Intersection over Union
mAP. Mean Average Precision
DA. Data augmentation
TP. True positive
FP. False positive
FN. False negative
GPU. Graphics Portable Unit