detecting music in a noisy, everyday environment · 2018. 2. 15. · detecting music in a noisy,...

DETECTING MUSIC IN A NOISY, EVERYDAY ENVIRONMENT

Bachelor’s ThesisIvo Bril, s2001179, [email protected]

Abstract: This paper is written to describe how music can be broken down to featuresunderstandable for a computer using Continuity Preserving Signal Processing. The music usedin this project was mixed with ambient noise consisting of seventeen everyday scenarios. Byanalyzing the sound samples and extracting general features that may indicate music it hasbeen made possible to design and implement a detector which can detect music in seventeeneveryday scenarios using these features. This paper shows how the features were found, thedesign process of the detector, its early results and mentions possible improvements.

1. IntroductionThe everyday feat of recognizing sounds, theirsources and their meaning seems like a triviallysimple task. Learning new songs or hummingalong with a song one has never even heard ofbefore is entirely possible with human hearing.Only when one starts to work with soundrecognition or analysis and encounters complexproblems does one realize the strength androbustness of the auditory systems that livingcreatures possess in this world. A typicalquestion scientists working in soundclassification and evaluation could ask is ‘whatwould it take to let a computer algorithmrecognize music?’. Considerable efforts havebeen made in this field of research, all trying toexplain music in ways a computer couldunderstand it. Formulas have been designed forrecognizing the onset of music (Zhou & Reiss,2011) and there has been philosophized aboutinteresting ways of solving problems the fieldstill has (Bregman, 1990). Yet there has not beendone a great deal of research in recognizingmusic in real-life situations. For this thesis, real-world situations refer to real-world scenarios,like shopping, being at home or some place else.Certainly, applets have been developed that tryto determine what kind of song is playing. Butthese techniques approach a different problem.Instead of using a small sample of recordedsound and trying to find a match in a hugedatabase of samples (Wang, 2006), the challengetackled in this paper demands generalizablefeatures that are indicative of music. The systemdescribed in this paper will report most musicdetected out of the signal it receives and says

when the music was playing. One of the biggerproblems with everyday scenarios, universal formost sound analysis, are the levels of ambientsounds. How to separate music from theseambient sounds will be dealt with in the nextsection.

1.1 Dealing with noiseOne problem with real-world sound detection isbeing able to ignore all the interfering ambientsounds that will be present in the signal. Sincehumans can do this, why not try to imitate it?Andringa (2002) has developed a system whichuses a representation that closely resembles thecochlea and its property for frequencysegmentation. This system, named ContinuityPreserving Signal Processing (henceforth namedCPSP), mimics the way the basilar membrane ofthe human hearing systems relays informationabout incoming signals.The rigid, curled-up membrane is sensitive todifferent frequencies at different positions (e.g.high frequencies at the base of the membranebecause of its rigidity). But what makes it aninteresting way of signal processing is the factthat it preserves continuity in time, place, andfrequency. This allows one to follow thedevelopment of signal components through timeand frequency. The latter property is what makesit stand out the most from more standard signalprocessing approaches, which often timesdiscards knowledge about the actual structure ofthe music in favour of making the signal moreanalysable for the system.

1

1.2 CochleogramsThe data generated by CPSP comes in the formof a cochleogram. This is a spectrogram-like graph, containing frequency over time, withcolour codes for the amount of decibels. Thefrequency is represented by channels because itmimics the cochlea. The greatest advantage ofusing this technique is the fact that it keeps thephysical properties of the signal as intact aspossible, which in result leads to greatsmoothness in the cochleogram. This smoothnesskeeps each individual signal component morephysically coherent, making the data lesssusceptible for artifacts. An example of acochleogram can be seen in Figure 1.1.

1.3 ClassificationIf cochleograms contain valuable information

about the signals, what do they tell us aboutmusic? Do cochleograms show anythingindicative of music when the signal containsmusic? This thesis has tried to find features inmusic that are present even in mixdowns of(non-studio-quality) music with an everydayscenario. How these features were found andhow they've been used will be described in themethod section. By finding these features a pathis paved for a detector to use these features indetecting where music is present.

1.4 False positivesBy definition, the detector searches for featuresthat are highly indicative for music. Not onlydoes this make it easier to make a robust system,but it also tries to pinpoint the essentials ofmusic and what makes it special in comparisonto other sounds. This does, however, make it ofutmost importance that the features found arehighly indicative. If these features cause thesystem to recognize anything from crying tosirens as music, then it is possible that thefeatures that were chosen were not specificenough. It could be debated that these kinds ofsounds, if they contain these kinds of features,are music and should be recognized as such. Forthe usability of this system however, I wouldargue otherwise. This system tries to recognizemusic, the kind of music humans have composedfor the past millennia. Therefore, crying andsiren-like sounds should not be classified assuch. The questions that this paper tries toanswer are as follows:

“Does a signal of music and ambient sounds,represented in a cochleogram, show any visualproperties that are highly indicative for the presenceand location of said music?”

”Can a detector, using features that are highlyindicative to music, accurately detect the presence ofmusic in noisy, everyday environments?”

1.5 Pre-existing ideasMusic has existed for centuries, has had all kindsof influences affect it, and is one of our culturalways of self expression. The music we listen to in

2

Figure 1.1: Two cochleograms, the upper being thepark scenario and the bottom being the samescenario mixed with a pop song. The pop song ispresent from 0 till 4000 and 8000 till 12000.

this era has basic elements anyone can describe.They boil down to a few things:

- Rhythm: the sequential and consistentoccurrences of sounds that indicate the tempo ofa song. This is one of the possible features thatthe detector could use, as it certainly is indicativeof music.- Tonality: a highly changing or prolongedpresence of sound that is sinusoid. Most music istonal, be it by singing or the use of instruments,so it is likely to be a distinct structure in thecochleograms.

The expectation is that these pre-existing ideaswill show themselves in the cochleograms aswell. Another expectation is that, given the rightapproach, these features (and other features ifthey are found) allow a detector to find musicfairly well.

1.6 Thesis formatDue to the nature of this research, which is of an exploratory nature, the thesis will consist of two method and result sections. The first method andresult sections will explain what features were found to be indicative of music and how they were found. The second will explain how these features were then used to create a detectorfor the presence of music.

2. Finding features

MethodIn this section will be described what pre-existing ideas there are about possible features,what cochleograms tell about music, and whatfunctions are needed to find and extract thefeatures that are present in the signal. 2.1 An informal analysis of a pure signalBefore looking at the mixed signal, it is good tocheck whether there is anything interesting inthe unmixed signals. As can be seen in Figures1.1 and 2.1 there are differences between thestructure of music and that of a signal existing ofambient sounds. Whereas the music signal has high amounts ofdecibels and strong horizontal linear structures.

The ambient sounds show hardly any of thesefeatures. This kind of analysis was done with allscenarios and music signals and the difference was present in every comparison. This showsthat this music selection has some indicativeproperties. The next step is seeing whether thesefeatures are present when the signals are mixed.The next section will explain how the signals aremixed and why.

2.2 Mixed Signals (test data) So music shows some interesting structureswhen visualized, but do they remain when theyare mixed with an everyday scenario? The soundsamples from everyday scenarios available athttp://www.daresounds.org/ were used(Andringa, Van der Linde & Krijnders, 2009).They include a wide variety of real-worldscenarios. From low level of ambient noise (e.g.bedroom), to high level (e.g. shopping district).These samples are each one minute long, with nospecific beginning or end. The reason this set wasused is because of the variety in scenarios andthe good quality of the samples. The music-styleschosen to mix with these scenarios vary from rapto modern classical music. The test data has fivestyles in total (rap, hard rock, electronic, piano-pop and modern classical music). The five music-styles were chosen as such because theyencompass enough variety of styles for thisthesis to test its detector. It is in no way an allencompassing set of styles to say that it can

3

Figure 2.1: A Figure of the same pop song as seen inFigure 1.1, but without the pause in the middle.

http://www.daresounds.org/

handle all music-styles. The music is mixed withthe scenarios using Adobe Soundbooth, witheach music sample being mixed in four wayswith the scenarios:

- The music is only present during the first 20seconds of the sample.- The music is only present during the last 20seconds of the sample.- The music is present during the first and thelast 20 seconds of the sample.- The music is present during most of the sample(50 sec.), with five seconds of silence before andafter the music sample.

The four different placements were chosen tomake sure that there is an adequate number ofdiffering mixdowns. Ideally all mixdowns wouldhave more randomly placed music samplesinstead of the fixed positions they have now, butthis was not possible to do with AdobeSoundbooth and would increase the workloadtremendously as it had to be done by hand. Still,Soundbooth was chosen due to its intuitivedesign and its usability. Another possibility wasAudacity, but its functionality is less and it is notas intuitive as Soundbooth. The music was mademore realistic (noised up) using sound-alteringeffects from Adobe Soundbooth. These effectsmade the music sound as if it was played in a

distant room, mimicking the way music wouldsound in real life (see the appendix for specifics).This makes the music more fit for the scenariosas opposed to the clean sound of studio-qualitymusic. With each scenario the volume wasequalized to ensure that a more realisticmixdown was created. The appropriate effectsand volume equalization was determined bylistening to it. The actual test data was thencreated by subtracting a matrix of the dB valuesof the scenario from the matrix with dB values ofthe mixdown. This results in a matrix with onlythe music present in the signal. For everyscenario there were five music styles, per musicstyle there were four placements. As each test fileis one minute long, this leads to a total of 340minutes of test data. An example of a test file canbe seen in Figure 2.2.

2.3 CPSP CPSP has many possibilities for examining thedata. Some of which are very useful for findinginformation about rhythm, structures inside thesignal and being able to distinguish between sound and noise. When CPSP processesa signal, it creates a struct 'D' which containsseveral matrices with information about thesignal. In the following subsections will beexplained what functions or matrices from Dwere used for the analysis and why.

4

Figure 2.3: A cochleogram showing the structureswhich have EPAS_S values greater than one. Thesignal consists of a study room mixed with theclassical piano music sample.

Figure 2.2: A cochleogram showing the ground truth of thestudy room scenario mixed with the classical piano music sample.

2.3.1 Volume and tonalityD.EdB is the cochleogram represented in matrixform. This matrix shows the signal in thetime/frequency domain, with the dB levels(volume) made visible through a colour range (asone could see in Figure 1.1). This matrix wasused to visualize the results of the otherfunctions used and to make the Figures used inthis thesis. D.EPAS_S is a matrix that allows the user toidentify tonal components in the signal. CPSPtakes the entire signal and evaluates allstructures from the signal in their adherence tothe expectation for noise. If a structure in thesignal diverges statistically from noise, the datapoints in that structure get a high absolute value.

The values in the D.EPAS_S matrix consist ofstandard deviations of the mean of noise (for amore elaborate explanation, see the appendix).One of the pre-existing ideas about musicmentioned at the beginning of this section wastonality, so naturally D.EPAS_S is an extremelyuseful representation. As can be seen in Figure2.3, putting a threshold on all EPAS_S valuesgives a great insight in structures that areindicative of tonal sounds.

2.3.2 xcorr & findRidgesInPeakMaskxcorr is a function that looks at a given channel ofthe signal and determines whether there arestructures in the signal that repeat themselves. Ifthese repetitions are present in a consequentmatter, one can say that there is a rhythmiccharacter to it. Using this function makes itpossible to find rhythm in a signal, though thecorrect channel has to be chosen. Finding thecorrect channel is a problem which will beaddressed in the method section of the detector.As rhythm is also one of the pre-existing ideasabout music, this function is used to try and see if it really is a suitable feature. Ascan be seen in Figure 2.4, the noise-injectedmusic signal itself shows a strong rhythm. findRidgesInPeakMask finds sequences of peaks inenergy. It connects peaks that it finds within agiven range and makes plot-able lines which can

5

Figure 2.4: The results of the xcorr function showsthat there is a clear rhythm. Peaks indicate areoccurence of a structure in the given channel.The distance between the peaks tells how muchtime has passed between the reoccurences.

Figure 2.5: A cochleogram showing that a musicsample lends itself well to thefindRidgesInPeakMask function. The black linesrepresent the ridges that were found.

be plotted over the D.EdB cochleogram, thusshowing coherent ridges found in that signal.This function has some usability for making thecomputer find the longer lasting tones itself. Theridges can switch from channel 40 to 45 in a fewtime frames due to the function allowing a jumpbetween channels (an example can be seen inFigure 2.5). This makes it harder to findprolonged horizontal structures of sound. One ofthe things that describes music is a rhythmicallypresent sound which can change in frequencyover time. The changes between frequencies issomething that this function can detect, whilestill being able to keep track of the ridge that ithas found.

Results All of the above mentioned functions have beenused in the pursuit of finding features for music.Some worked, some were quickly abandonedand some have not been used to their full extentyet. In this section will be explained whatfeatures were found.

3.1 dB differencesOne of the more obvious features, as one can seein Figure 1.1, is the amount of energy in dBpresent in the more prominent notes of the music(found in the D.EdB matrix). A high dB valuecould therefore indicate the presence of music. If

it had been so easy, this thesis would not haveexisted. Take the following cochleogram (Figure3.1) for example, it shows huge amounts ofenergy, but it is not all from the music. Thisshows that using dB in finding music istreacherous, because a high amount of dB doesnot necessarily mean that the signal containsmusic.

3.2 Tonality and frequenciesThe EPAS_S tell a lot about the tonality of thesignal. As can be seen in Figure 3.3, there is acertain range of frequencies where a lot of thevalues are likely to be tonal sounds. Thischaracteristic is present in all samples, whichmeans it is not necessarily true for all music. Thisrange is interesting as it tells something aboutthe sounds found outside this range. They couldcertainly still be music, but the possibility of itbeing noise becomes much more likely.Especially the lower frequencies seem to be more

6

Figure 3.2: The results of the xcorr shows that there are no clear matches, so no rhythm can be found. Channel 150 was examined here.

Figure 3.1: A cochleogram consisting of one of theeveryday scenarios, combined with hard rock musicin the last 20 seconds of the signal. The greatamounts of dB in the beginning are helicopternoise.

predominantly occupied by noise than sounds.The strongest tonal sounds are usually found inthe range of 260 to 4200 Hz. This range is one ofthe features that shows promise if moreknowledge of frequencies had been used. Toneladders are a frequent sight in music and the factthat certain notes have fairly specific frequenciesmakes them strong possibilities for musicdetection and analysis. This thesis has not usedthis knowledge for the detector due to a lack oftime and in-depth knowledge about the subjectto implement it in the detector. 3.3 RhythmOne of the more prominent features in a signal ofpure music is the fact that it contains a certainrhythm. Most western music can be described interms of beats-per-minute (bpm). Using thefunction xcorr, it is possible to find correlations

between the presence of pulse-like sounds andthe frequency with which they are present in the signal. If there is a correlation between time andthe presence of pulses, it has a high probabilityof telling the rhythm of the music that is presentin the signal. This feature was not as useful as itseems. Most pulses from rhythm are present at the lower frequencies (e.g. lower than 260 Hz),which is also the range where ambient soundsreside. This leads to a conflict between thesignals and finally in the loss of consistency inthe rhythmic pulses, which makes it difficult tofind the rhythm using the xcorr function. Anexample of this can be seen in Figure 3.2.

3.4 RidgesOne of the more useful structures found in thesignals were ridges. Ridges are horizontal linesof consistently high dB or a high amount of

7

Figure 3.3: Five cochleograms, each of the forest-scenario mixed with a different music sample. The music ispresent during the entire signal, except for the first and last five seconds (the last two stop around 50 seconds inthe signal).

standard-deviations (when using EPAS_S). Theyindicate a prolonged presence of sound, which indicates a high chance of it being music orsomething else non-noise related. This featurewas easily found by looking at the structures ofthe sound in the cochleograms (Figure 2.3 showsthese ridges to great extent). It has proven its selfas a valid feature for detection and it was centralin the design of the detector

These were the features found to be mostprominent in the music-styles examined with thefunctions mentioned earlier. They are not acomplete set of all possible features found to behighly indicative of music. They are merely theresults from analysing a few music-stylescombined with seventeen different scenarios.

4 . Detector

MethodFor this section, the way the detector is built andtested will be explained. This includesexplanations for choices made in the design ofthe detector. Figure 4.1 shows a diagram ofthe steps taken by by the detector and what datait needs.

4.1 DetectorThe detector is built with a simple idea in mind:take the information from a cochleogram (likethe ones in Figure 3.2) and use the features foundto create criteria. All the data that meets thecriteria sufficiently is considered music. Each ofthe criteria used for this detector will bediscussed shortly in the order in which it isapplied, what it does and why it was used.

4.2.1 Data has to be tonalThe very first criterion is an obvious butimportant one: the data has to be (remotely)tonal. With EPAS_S it is known what structurescould be considered to differ from noise,structures which could possibly be music.However, a lot of the matrix values only have asmall standard-deviation from noise, so there hasto be a threshold. The key here is to keep enoughinformation about the signal in the matrix sothere is enough data to work with, but gettingrid of the unnecessary information. A standard-

deviation of one to one-and-a-half is effectivemost of the time. To make this threshold a bitflexible the detector takes the average of all cellsof the EPAS_S matrix and adds this to the 1.5threshold. This average holds information aboutthe overall noisiness of the signal, lowering thethreshold if there is a negative average (e.g. alow overall deviation from noise). And itincreases the threshold if the signal differsgreatly from noise. This threshold isadministered after a movAv has been applied tothe EPAS_S values. MovAv makes each matrixcell the average of it's surrounding cells, wherethe reach of its surrounding points is determinedby the user (the range and used values can befound in the appendix). This makes the strongerdiffering values more prominent in the signal.

4.2.2 Filter for RidgesNow that the ridges are more prominentlypresent in the signal, it is time to find them in thesignal and discard any other excess data. For this

8

Figure 4.1: A diagram showing the process withwhich the detector determines whether music ispresent in the signal.

the function lengthAreas is very useful. Explainedbriefly, this function takes ridges and makes allvalues in that ridge equal to the total length ofthat ridge. So lengthAreas will make the longestridges have the highest value. Some of theinteresting ridges aren't fully connected to eachother. With that in mind there has to be a smallgap allowed between the ridges to keep thelonger ridges connected. This way, the longerridges, which are more probable of being music,remain present in the data. Now there are many ridges, each with their ownvalues. The data still contains many smallsnippets of sound that passed the first criterion.These snippets are often very short, so a newcriterion is necessary. This case is the same as thefirst criterion as there is need for anotherthreshold that keeps the valuable data, but kicksout the remnants of other ambient tonal sounds.The threshold that was settled for in this thesis isa minimal length of 150, which is equivalent tothree quarters of a second. By trying out severalvalues it turned out that ¾ of a second strikes abalance between keeping enough interestingdata and removing enough ambient tonalsounds.This latter threshold is purely practical inthe sense that it relieves the matrix from possibleambient sounds while maintaining enoughinformation about the signal to work with.

4.2.3 Keeping out monotone soundsUp until now everything is going fine, thecriteria do a great job of filtering out everythingthat isn't a ridge. Yet one of the stronger ambientsounds is still in the data, sounds that differ fromnoise in the same way that the music ridges do.These sounds consist of long tonal componentsproduced by, for example, electronic devices (e.g.refrigerator, air conditioning, etc.). Figure 2.3shows such a source, the light-blue structurearound channel 100. They are present during theentire signal and have become one long ridgedue to the computations that have been done sofar. The latter part is what makes this problemeasier to solve, because finding an extremelylong ridge is fairly easy. The way the detectortries to deal with these ridges is by looking ateach row and counting the fraction of cells whichhas an EdB level higher than the lowest EdBpossible in the signal. The detector tries to find

channels which have sound for at least 80percent of the entire signal. Channels that meetthis criterion have either a lot of small ridges orone big ridge. Due to the fact that the remainingdata has been severely filtered, the first option israre. The latter, on the other hand, is found inmost scenarios with a clear monotonous ambientsound. The channels for which the 80 percentmark stands are thrown away, along with threechannels on each side to make sure that the ridgeis removed completely. The 80 percent mark waschosen due to the fact that the filters up untilnow could have diminished the monotonesounds slightly. Making the percentage higherwould therefore create a risk for leaving toomany lengthy ridges in the data. Lowering thethreshold would also bring about problems asthe chance of losing the precious ridges that havebeen found with the filters becomes higher. 80percent has been found to strike the right balancebetween these to problems. As this thesis istrying to make a detector and show what isnecessary to create such a system, there has notbeen done any further research in finding theoptimal percentage for this filter.

4.2.4 The length of the ridges The ridges that remain are very close to beingclassified as music. The data consists of tonalridges that have a length of at least ¾ of a secondand do not span over more than 80 percent of thesignal. The final step now is to see where theridges are and take note of this in a vectorconsisting of the time. For every column wheresome energy can be found, the detector keepstrack of where it starts and ends. All columnsfrom the start until the end are registered in thearray as ones (The array starts out with onlyzeros). This new array has now been filled withthe knowledge of where one or more ridges werepresent. This array will be tested by comparing itto a similar array made with the test data. Beforeit compares them there is one final criterion thathas to be met. This criterion demands that thesegments of ones in the array has to be longer than five seconds. Music is something that islistened to in longer segments, so this should bereflected upon in the criteria. Because the ridgesvary in their channel it is not possible to do thiswhen it is still two dimensional data, as can be

9

seen in Figure 4.2. The lines give an indication ofwhere the music is. Especially the right segment shows that demanding the ridgesto be longer than five seconds would result inthrowing away too much the data. The data yousee outside of the segments are all relativelyshort and don't seem to be from music. By demanding music to be longer than five seconds,there is a fail-safe for the small snippets of tonalambient sounds that made it all the way throughall criteria. Again, due to the aim of this thesis,no research has been done in trying to find theoptimal length. With all these criteria, thedetector ran through all the data, the results ofthis run will be revealed in the next section.

Results The performance was measured in two ways:how much music (i.e. columns) the detector had found that was actually present and how muchmusic it had found too much. The total scoreover all seventeen scenarios, each with the fivemusic styles and four placements was asfollows:

- Percentage of supposed music found too little (false negative): 12,49%

- Percentage of supposed music found too much (false positive): 24,11%

So the criteria of the detector seem to be tooweak, some tonal, non-music structures getthrough. The detector, thankfully, does a steadyjob of finding the actual music. The overallperformance, however, doesn't provide all thepeculiar differences between each scenario ormusic style. Table 4.1 shows the results dividedover each scenario. It shows that scenarios withstronger and/or more ambient sounds, such as abusy street or a bus station, are harder toanalyse. When ambient sounds are moreprominently present in the signal, they differmore from the structures that noise would have.This, in return, leads to more non-music tonalstructures present in the EPAS_S values. It seemsthat the criteria are not flexible enough to handlea more cluttered signal. In some of the morequiet scenarios the detector finds a lot moremusic than there is actually present. However,not all quiet scenarios suffer from this, so why dosome do? These scenarios do because there is amonotonous sound source present, a strongambient noise that is present throughout almostthe entire signal. It seems that these kind ofsignals are not as easily dealt with, because thedetector has a criterion to deal with this kind ofstructure. The difference per music style seenover each scenario is shown in table 4.2including the averages for each music style. Thistable shows the performance of the detector foreach music style used in the test set. The resultsreveal that most scenarios lead to the same

10

Table 4.1: A table with the results divided perscenario. The values consist of the averages of eachmusic style per position.

Figure 4.2: A Cochleogram where the final stepwould be to convert it to a 1 by 12000 matrix, onlykeeping the information where there was at leastsome energy. The music is present between theblack lines (which are purely an indication).

results, so the music style does not matter much.However, there is one music style which is hardto detect using the criteria from this thesis,namely rap music. This music style only hassimilar results to the other styles when it is aquiet scenario, and even then it differsoccasionally. One reason for this difference inperformance is due to the structure of rap musicas a whole. Rap music consists mostly out of abeat and rhythmic rhymes. So instead of singingand prolonging notes, rap is more staccato-likein its structure. So rap only has ridges if the beatand tune of the song have prolonged notes,something which is not all that frequent in thegenre. This lack of ridges makes it hard for thedetector to find the music in signals where other,more prominent ridges are present. Althoughthis detector can not detect rap well, it could bedetected better using a detector more focussedon rhythm.

5. ConclusionHopefully, this thesis gives an interesting insightin the world of music and signal analysis. As itturns out, music has some properties that makeit unique in a sense, but the ones on which thisthesis capitalized differ between music genres.However, it can be said that prolonged tones doencompass one of the more usable featuresfound to be indicative of music. Rhythm seemsto be a feature that is not strong enough forsignal analysis. The fact that it is one of the moreprominent properties of music makes it hard tobelieve that rhythm can be ignored as a possibledetectable feature in mixed signals completely.This thesis failed to use the function xcorr to itsfullest, as it works best if the right channel can befound to find the pulses in. As this can be adifferent channel for each new signal that thedetector gets to analyse, a dynamic way ofdetermining the right channel would be asolution to the problem encountered in thisthesis. Another possible gem for this kind of signalanalysis is the information residing in thefrequencies of music. As this thesis has shown,there are certain areas within the frequencychannels of CPSP that include most of the tonalenergy and its structures. Capitalizing more onthis using tone ladders and other knowledge

11

Table 4.2: The results of each music style per scenarioseparated into precentage missed and percentagefound too much.

from music theory about frequencies (e.g.instruments produce multiple harmonics), it iscertainly possible to make a more robust andflexible detector. The problem of filteringmonotone sounds is an essential one for this kindof detector to work. The solution used in thisthesis has proven itself as being too lenient. Asfor the performance of the detector presented inthis thesis, it does a decent job. Although it is toorigid in the way it applies its criteria, it gives agood impression of what can be done with CPSPin this field of research.

6. References[1] Zhou, R., & Reiss, J. D. (2011). Music OnsetDetection. In W. Wang (Ed.), Machine Audition:Principles, Algorithms and Systems (pp. 297-316).Hershey, PA: Information Science Reference.doi:10.4018/978-1-61520-919-4.ch012

[2] Bregman, A. S. (1990). Auditory scene analysis:The perceptual organization of sound. The Journal ofthe Acoustical Society of America, 95(2)

[3] Wang, A. (2006), Communications of the ACM -music information retrieval. Volume 49 issue 8, 44-48. doi:10.1145/1145287.1145312

[4] Andringa, T. C. (2002). Continuity preservingsignal processing. PhD thesis, University ofGroningen, Groningen.

[5] Andringa, T., van der Linden, R., Krijnders, D.(2009). Dares_g1 database. In DARES: Database ofAnnotated Real Environmental Sounds. RetrievedSeptember 10, 2013, fromhttp://www.daresounds.org/database.php.

7. AppendixThe details of what the used CPSP functions doand how the database was created will beexplained here.

7.1 CPSP Functions

Startup & functionsTestTo explain all the matrices and commandsavailable properly, these two commands must beexplained first. The first one is used to initializeall variables and directory paths needed to startreading a sound file. The second commandopens up a UI which asks us to choose certainrequirements we want for our analysis. Theserequirements serve to create a struct, filled withvariables that depend on the requirementschosen in the UI. This paper used calcRS,calcEdB, initPAS, calcPAS_S, calcPAS_T,basicTexture and calcBG. After selectingrequirements and a sound file, the struct called‘D’ is created, which contains the informationused by the commands below. What the 'D'struct contains depends on what requirementsare selected Before any other commands areexplained, let’s see what information from thestruct was used by this thesis.

The CochleogramD.EdB is a matrix of the entire sound file, wherethe rows represent the channels of the cochlea(instead of frequencies) and the columnsrepresent the length of the signal, 1 cell being 1/5of a millisecond. Each cell has an energy levelwhich stands for the strength of the signal at thatparticular channel at that specific time. Plottingthis matrix creates the cochleograms like theones in this thesis.

The tonal matrixD.EPAS_S is a matrix of standard-deviations.This matrix says something about how the localshapes deviate from the expectations of noise.Each cell is described in terms of standarddeviations, where the value given to the cellmeans how strongly it deviates from possiblybeing noise (e.g. a high value means a strongpossibility of it not being noise). So for example,a value of two indicates that the associatedproperty of that part of the cochleogram

12

http://www.daresounds.org/database.php

corresponds to two standard deviations from themean of noise. By using this matrix it is possibleto weed most of the obvious noise out. Thismakes it a strong tool for finding the outlines ofthe information that is interesting, the outlines ofpossible music.

Xcorrxcorr is a function that, given a matrix as input,returns a matrix with an estimation whethersequences of energy found on the x-axiscorrelate. It keeps track of structures found in thegiven channel and notes a high value if thestructure at a certain time in the signal resemblesan earlier noted structure. For detecting musicthis can be a valid tool for finding rhythmicstructures in the signal, as the time between twopeaks can be used to determine the amount ofbeats-per-minute. Finding the right channel tosearch for the rhythmic structures is crucial. Thechannels mainly chosen in this thesis rangedfrom 140 to 150. As mentioned in the conclusion,a dynamic way of selecting the right channel is adirection that is likely to be more fruitful.

MovAv movAv is a helpful function. It requires a matrixwith values and two uneven numbers to know how far it needs to look horizontally and vertically (e.g. making both variables five wouldmake it look two cells in the direction of eachadjacent cell). For each cell in a matrix it sums allthe values found in the range of the variablesand takes the mean of the sum. This leads to thestrengthening of cells surrounded by high valuesand the weakening of cells surrounded by lowvalues. This in return leads to stronger ridgesand the weakening of small spots of noise in thesignal, leading to an overall boost in smoothness.Doing this too often would almost always lead toridges, so to prevent artefacts from occurring inthe data this function should be used inmoderation. The range used for this detector was5.

LengthAreaslengthAreas has an interesting functionality. Itreceives a matrix of values (possibly bound to acriterion), a dimension in which it must operateand a gapsize. It looks through the matrix using

the direction given by the dimension (e.g. a twowould make it evaluate each column per row)Each nonzero value found is replaced by 'a', thenumber of consecutive nonzero values. Thegapsize tells lengthAreas how many zero values itmay ignore per sequence before it finalizes thesequences and starts to search for another one.LengthAreas makes it possible to filter the signalfor information relevant for music. For example, long strokes of energy in one frequency in the signal could point to prolonged musical tones. Ifprolonged sequences of tonal sounds areindicative of music, then this function will helpin finding them. The gapsize used in this thesiswas 50.

7.2 Database of mixed samples

Adobe SoundboothThe exact effect used to create more realisticmusic samples is a reverb effect called 'smallroom'. The music samples were created byselecting a minute in each song. For all songsexcept the classical piece this was the firstminute. The exact songs used per music style are:

Modern Classical Piano: Lucovico Einaudi – Nightbook

Electronic:Daft Punk – Digital Love

Hard Rock:Queens of the Ages – Go with the Flow

Piano Pop:Coldplay – Clocks

Rap:Kanye West (Feat. Syleena Johnson) – All FallsDown

The signal-to-noise ratio was not determined bya parameter sweep. In hindsight, that was abetter option than used in this thesis. For thisdatabase it was done by listening to each sampleand equalizing the volume of both the scenarioand the music sample using Adobe Soundbooth.To make it sound realistic, the volume of the

13

music sample had to be decreased most of thetimes, depending on the volume of the scenario.

14

detecting music in a noisy, everyday environment · 2018. 2. 15. · detecting music in a noisy,...

Documents