[ieee 2014 international conference on electronics, communications and computers (conielecomp) -...

Pitch Estimation For Musical Note RecognitionUsing Artificial Neural Networks

Jose de Jesus Guerrero-Turrubiates*, Sheila Esmeralda Gonzalez-Reyna,Sergio Eduardo Ledesma-Orozco, Juan Gabriel Avina-Cervantes

Universidad de Guanajuato, Division de Ingenierias Campus Irapuato-Salamanca.Carretera Salamanca-Valle de Santiago km 3.5 + 1.8 km. Comunidad de Palo Blanco,

C.P. 36885. Salamanca, Gto., Mexico.email: {jdj.guerreroturrubiates, se.gonzalezreyna, selo, avina}@ugto.mx

Abstract—Pitch estimation has increased its importancedue to the wide variety of applications in different fields,e.g. speech and voice recognition, music transcription, toname a few. Musical signals may contain noise and dis-tortion, therefore pitch detection results can be erroneous.In this paper, a musical note recognition system basedon harmonic modification and Artificial Neural Network(ANN) is proposed. At first, downsampling is appliedto convert the signal from 44,100 Hz sampling rate to2,100 Hz. Fast Fourier Transform (FFT) is used to obtainthe signal spectrum; Harmonic Product Spectrum (HPS)algorithm is implemented to enhance the fundamentalfrequency amplitude. Then a dimensionality reductionmethod based on variances, is used to extract relevantinformation from the input signal. In the present work,audio signals were taken from a proprietary database thatwas constructed using an electric guitar as audio source.The classification is performed by a feed-forward neuralnetwork or Multi-Layer Perceptron (MLP). Experimentalresults present accurate classification with few processingof the input signal. Besides the proposed approach presentsenough robustness to classify musical notes coming fromdifferent musical instruments.

Keywords—Pitch Estimation, Feature Extraction, Har-monic Product Spectrum, Multi-Layer Perceptron.

I. INTRODUCTION

Pitch estimation is necessary for a variety of appli-cations. In musical field, pitch extraction from audiofiles could be used for music transcription (convertaudio recording into symbolic representation) [1], [2],instrument recognition [3]–[5], or speech recognition [6].

Pitch estimation is commonly treated as a two stageproblem: filtering and classification. Filtering is the stagewhere the signal is processed so only relevant informa-tion will pass to the classification stage, where the notewill be identified.

To extract relevant information from a sound frame,different approaches have been proposed. For instance,pitch can be obtained by different methods like PrincipalComponent Analysis (PCA) [7], Constant-Q Transform[8], Fast Fourier Transform (FFT) [9], [10], HarmonicProduct Spectrum (HPS) [11], to name a few.

There are several important factors that make pitchestimation a hard task. One of the main factors is noise,this affect because it can be even “louder” than theoriginal signal. Another problem is to estimate the timeof the individual notes, i.e. it can not be known a prioriwhere the audio file is going to change the pitch. Alsothe amplitude is varying with time.

In this paper, a Musical Pitch Estimation approachis proposed, based on the Fast Fourier Transform(FFT)and the Harmonic Product Spectrum (HPS) for featuresimplification and reduction of harmonics. A Multi-Layer Perceptron (MLP) was trained and programedin NEURAL LAB [12], as classification method. Thepreprocessing steps take the original audio signal andapply the FFT, once the signal is in the frequency domainthe HPS algorithm and a method based on the maximumvariances of the data set column vectors, are applied tomake a dimensionality reduction and feature extraction.Finally, the signal passes to the classifier.

This document is organized as follows. In Section II,the HPS method is explained, the training set and test setcreation is mentioned and a method for dimensionalityreduction based on the variances is explained as well.Experiments and results are deeply described in SectionIII and then discussed in Section IV.

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 53

Figure 1. Downsampling and element-wise multiplication PRO-CESS.

II. METHODOLOGY

In this section will be used the words samples andobservation. For the purpose of this work the word“sample” will refer to a data of a set of data, e.g. 16,384data of a 44,100 data set. The word “observations” willrefer to a number of training or validation cases.

A. HPS Algorithm

When there is a musical note as an input signal,then its spectrum should consist of a series of peakscorresponding to its fundamental frequency and its har-monic components at integer multiples of the fundamen-tal. Hence, when the original spectrum is element-wisemultiplied by the n-times downsampled spectrum, thestrongest harmonic peak rises up. Figure 1 shows thisprocess.

In order to compute the spectrum of a signal usingthe FFT, it should have 2n number of samples in thetime domain to make the transformation. In this work,16,384 which correspond to 214 samples were taken.

Let Y[n] be the power spectrum of the input signaly(t). The HPS algorithm follows the equation

Z[n] =N∏

m=1

Y [mn], n = 0, 1, 2 . . . , (1)

where Z[n] rises the information of the fundamentalfrequency. Since the database is made with audio files

coming from an electric guitar, and this is an instrumentthat tends to move its string frequency due to physicalfactors as temperature or humidity, each Y[n] will takethe sum of its two neighbor harmonics to make thesystem more robust

Y [mn] = Y [n− 1]× Y [mn− 1] + Y [n]× Y [mn]+

+Y [n+ 1]× Y [mn+ 1]. (2)

For the purpose of musical note recognition, it seemsthat in (1) with N=2 is enough to rise the fundamentalfrequency. Figure 2 shows a real audio signal in a), andb) shows the same signal when the HPS algorithm wasapplied.

B. Musical Notes Dataset

For the experiments presented in the next section, itwas necessary to create a database containing musicalnotes. For the purpose of this project an electric guitarwas used to record in wave format all the twelve musicalnotes from, C to B. The database consisted in over4,800 wave files, for both training and testing processes,divided in 12 classes each with 400 observations. Thewave files present strong variations in amplitude and tonecolor. Also there is a small variation in the pitch (lessthat a semitone).

C. Dimensionality Reduction Based on Variances

Once that the HPS algorithm was applied, the result-ing signal had 4,092 features of the original file. This isa huge amount of features to send to the neural networkas inputs, so another process was needed to reduce thisnumber as much as possible.

Since the matrix M is the data set conformed by the50% of the observations of each class and

M = [~f1, ~f2, . . . , ~fn], n = 4, 092 (3)

where ~fi represents the ith-component frequency of allobservations. And if the variance of each ~fi is taken as

σ2i =

1n

n∑j=1

(~fi(j)− f i)2, (4)

and

f = E{~f}, (5)

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 54

0 500 1000 1500 2000 2500 3000 3500 40000

0.02

0.04

0.06

0.08

0.1

0.12Original Signal

Frequency Hz

Ma

gn

itu

de

a)

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5x 10

−5 Signal with downsampling (N=2)

Frequency Hz

Ma

gn

itu

de

b)

Figure 2. a) Original Signal in Frequency Domain and b) Down-sampling with N=2.

then there could be a threshold that allows to form a newmatrix that will be the training set. To set the thresholdseveral tests were performed, knowing that the minimumand maximum variances of all data set were 8.4305 ×10−14 and 14.1385 respectively, so the threshold valuewas moving between this range. Finally the best valuewas thr = 1× 10−4, and the training set was composedby

TrainSet = [~η1, ~η2, . . . , ~ηm] (6)

where ~ηk are the vectors ~fi whose σ2i ≥ thr, and m=181.

III. EXPERIMENTAL RESULTS

In this study, three octaves (interval that split soundswhich fundamental frequencies have a ratio of 2:1, 3:1. . . n:1) and twelve notes per octave were recognized. Thedatabase was split in 50% of samples for training and theremaining for evaluation. Every audio file had over onesecond of duration, and was recorded with a samplingrate of 44,100 Hz, this means that in one second thereis at least the same number of samples that the samplingrate.

In order to apply the FFT to the data, 214 samples ofeach observation were taken, so for a new input signalit is supposed that the pitch will remain for at least

t =16384 samples× 1s

44100 samples= 0.370s. (7)

The HPS algorithm was applied to each wave file tocreate both the training and test sets.

A. MultiLayer Perceptron Optimal Configuration

During this research, the optimal number of hiddenunits was found by running a performance test, where anew MLP was created, trained and tested using a varyingnumber of neurons in the hidden layer. With one hiddenlayer the result was good, to probe if the accuracy of theneural network could be improved, a new performancetest was make using a second layer. The best performancewas obtained using 20 neurons in the first hidden layer,and 10 neurons in the second one. The results for bothtraining and validation datasets for the first layer aredepicted in Fig. 3.

The best results obtained for the final configurationare shown in Table I.

Once the neural network was trained, it was testedwith other instruments to evaluate the response. It wasobserved that the neural network was robust enough toclassify notes coming from different instruments, suchas: piano, saxophone, violin, acoustic guitar and electro-acoustic guitar. For this test five observations of eachclass per instrument were recorded. Table I shows thesystem’s performance per class.

The neural network was also tested with chords (morethan one note simultaneously on the input signal), andwas able to classify the fundamental frequency of thistests. A single C note spectrum is shown in Fig. 4 a).Figure 5 a), shows the spectrum of a C Major chord,

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 55

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Performance

Neurons

Mse

−−−−−−− Train Set

−− −− −− Test Set

Figure 3. Performance for both training and validation sets.

Table I. PROPORTIONS OF TRUE CLASS ACCURACIES.

Class number Training Set (%) Test Set (%)Electric Guitar Other Instruments

1 100 93.6 962 100 95.8 923 98.9 97.9 924 97.9 100 925 100 97.9 966 98.4 100 967 98.9 100 968 98.4 100 1009 98.4 91.3 92

10 97.4 97.9 8411 100 97.9 9612 98.4 97.9 92

average 98.9 97.5 93.6

which is composed by three notes: C (the fundamentalnote), E and G. It can be observed by comparisonthat C note is mostly composed by integer multiplesof the fundamental, while in C Major chord a mixtureof harmonics, not necessarily integer multiples of thefundamental, describe the input signal. Once the HPSalgorithm was applied, it can be demonstrated that inboth, C note (Fig. 4 b)) and C Major chord (Fig. 5 b))the maximum value lies in the same frequency, i.e. thefundamental. Even when there are more harmonics inthe C Major note after HPS, the neural network has theability to recognize the one with highest amplitude andtherefore, a correct classification is performed.

Taking into account that the training set containedonly notes recorded from an electric guitar, this ex-

0 100 200 300 400 500 600 700 800 900 10000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04C Note

Frequency Hz

Ma

gn

itu

de

a)

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5

3

3.5x 10

−8

X: 48Y: 3.441e−08

C Note with downsampling

Frequency Hz

Ma

gn

itu

de

b)


periments demonstrated the robustness of the proposedapproach.

IV. CONCLUSIONS

In this study, musical notes were accurately classified.Taking into account that real life audio signals work onstructured and unstructured environments, the probleminvolves factors that seriously affect intelligent systemsperformance. Issues as noise, variable amplitude and tonecolor are present on the audio files conforming the dataset, although it was observed that the neural network wasrobust enough to discriminate this problems.

The HPS algorithm as a preprocessing stage present

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 56

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.005

0.01

0.015

0.02

0.025Chord of C Major

Frequency Hz

Ma

gn

itu

de

a)

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6x 10

−8

X: 48Y: 1.567e−08

Chord of C Major with Downsampling

Frequency Hz

Ma

gn

itu

de

b)


efficient results due to its feature extraction abilities.Moreover, this algorithm made possible the classificationof notes coming form other real instruments such aspiano, saxophone, violin, acoustic guitar and electro-acoustic guitar, even when the ANN was trained just withaudio files recorded from an electric guitar.

The HPS algorithm leads to both feature extractionand dimensionality reduction, however the remaining4,092 samples (24.97% of the original signal) resultingfrom this process still represent a huge amount of at-tributes for a classifier. Considering that samples withlower variances contain a few or non information, thosesamples were dismissed in order to reduce dimensionalityeven more. The final training and test sets contained only

181 out of the 4,092 samples, which correspond to 1.10%of the characteristics.

A MLP was used for classification, and was con-formed with 20 hidden neurons in the first layer, and 10hidden units in the second one. The final system achieveda 97.5% of recognition accuracy.

Additional experiments were conducted using chordsinstead of simple notes. The system was able to recognizethe fundamental note of the input chord, demonstratingthe validity of the proposed approach and the robustnessof the neural network to noisy signals.

As a future work, we pretend to perform a blindsource separation analysis for a single polyphonic instru-ment audio signals, in order to obtain the independentsources that can be further classified using the methodproposed in this article.

ACKNOWLEDGEMENTS

This work has been supported by the National Councilof Science and Technology of Mexico (CONACYT)under Grant number 429450/265881, and for Universidadde Guanajuato through PIFI-2012.

REFERENCES

[1] J. Nicholas and H. T. Ahmed, “Audio coding for representationin MIDI via pitch detection using harmonic dictionaries,”Journal of VLSI Signal Processing, vol. 20, pp. 45–59, 1998.

[2] A. Barbancho, A. Klapuri, L. Tardon, and I. Barbancho, “Auto-matic transcription of guitar chords and fingering from audio,”Audio, Speech, and Language Processing, IEEE Transactionson, vol. 20, no. 3, pp. 915–921, 2012.

[3] I. Kaminskyj and T. Czaszejko, “Automatic recognition ofisolated monophonic musical instrument sounds using kNNC,”Journal of Intelligent Information Systems, vol. 24, pp. 199–221, 2005.

[4] J. Eggink and G. J. Brown, “Instrument recognition in ac-companied sonatas and concertos,” in Acoustics, Speech, andSignal Processing, 2004. Proceedings. (ICASSP’04). IEEEInternational Conference on, 2004, pp. iv–217–iv–220.

[5] A. Azarloo and F. Farokhi, “Automatic musical instrumentrecognition using K-NN and MLP neural networks,” in Com-putational Intelligence, Communication Systems and Networks(CICSyN), 2012 Fourth International Conference on, 2012, pp.289–294.

[6] B. Resch, M. Nilsson, A. Ekman, and W. Kleijn, “Estimationof the instantaneous pitch of speech,” Audio, Speech, andLanguage Processing, IEEE Transactions on, vol. 15, no. 3,pp. 813–822, 2007.

[7] R. Nickel and S. Oswal, “Optimal pitch bases expansions inspeech signal processing,” in Signals, Systems and Comput-ers, 2004. Conference Record of the Thirty-Seventh AsilomarConference on, 2003, pp. 1885–1889.

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 57

[8] J. C. Brown and M. S. Puckette, “An efficient algorithm forthe calculation of a constant-Q transform,” The Journal of theAcoustical Society of America, vol. 92, no. 3, p. 2698, 1992.

[9] C. Shahnaz, W. P. Zhu, and M. Ahmad, “A method for pitchestimation from noisy speech signals based on a pitch-harmonicextraction,” in Neural Networks and Signal Processing, 2008International Conference on, 2008, pp. 120–123.

[10] P. Taweewat, “Musical visualization and f0 estimation usingneural network,” in Audio Language and Image Processing(ICALIP), 2010 International Conference on, 2010, pp. 346–352.

[11] A. Nielsen, L. Hansen, and U. Kjems, “Pitch based soundclassification,” in Acoustics, Speech, and Signal Processing,2006. ICASSP 2006 Proceedings. 2006 IEEE InternationalConference on, 2006, pp. III–III.

[12] S. Ledesma-Orozco, “Neural lab - wikipedia, la enciclopedialibre,” http://en.wikipedia.org/wiki/Neural Lab, Aug. 2012.

978-1-4799-3469-0/14/$31.00 ©2014 IEEE 58

[ieee 2014 international conference on electronics, communications and computers (conielecomp) -...

Documents