automatic speech/music discrimination in audio files · automatic speech/music discrimination in...
TRANSCRIPT
![Page 1: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/1.jpg)
Automatic speech/music
discrimination in audio files
Lars Ericsson
Master’s thesis in Music Acoustics (30 ECTS credits)
at the School of Media Technology Royal Institute of Technology year 2009
Supervisor at SR was Björn Carlsson Supervisor at CSC was Anders Friberg
Examiner was Sten Ternström
![Page 2: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/2.jpg)
![Page 3: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/3.jpg)
Automatic speech/music discrimination in audio files
AbstractThismaster’sthesispresentsanalgorithmfordiscriminationofspeechandmusicinaudiofiles.ThealgorithmismadeforSwedishRadioandisthereforesuitedfortheirneedsandoptimizedfortheirmaterial.AseriesoftestsonacousticfeaturesextractedfromSwedishRadiobroadcastswasperformedtodistinguishthebestmethodforthediscrimination.Thesetestswereevaluatedtofindthemostsuitedmethodforthetask.MethodsbasedonRMSamplitudeofthesignalwerechosenforbothclassificationandsegmentation.AfeaturefortheproportionoflowenergythatusesthesmallpausesthatcantellspeechfrommusicwasusedfortheclassificationandasimilaritymeasurebasedonmeanandvarianceoftheRMSwasusedtofindthetransitionpointsforthesegments.Thisresultedinanalgorithmthatwithanaccuracyof97,3%candiscriminatespeechfrommusicinSwedishRadiobroadcasts.
![Page 4: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/4.jpg)
Automatisk diskriminering av tal och musik i ljudfiler
Sammanfattning Dettaexamensarbetepresenterarenalgoritmfördiskrimineringavtalochmusikiljudfiler.AlgoritmenärgjordförSverigesRadioochärdärförutformadutefterderasbehovochoptimeradförderasmaterial.FörattkommaframtilldenbästametodenfördiskrimineringengjordestesterpåakustiskaparametrarsomextraheratsurmaterialfrånSR:sdigitalaarkiv.Dessatesterutvärderadessedanföratthittadenmetodsomvarbästlämpadförändamålet.FörbådeklassificeringochsegmenteringvaldesmetodersombaseraspåRMSföramplitudenavsignalen.EnparameterförlågnivåproportionersomutnyttjardekortapausernasomskiljertalfrånmusikvaldesförattklassificerasegmentochettlikhetsmåttbaseradpåmedelvärdeochvariansavRMSvaldesföratthittagränsernatillsegmenten.Dettaresulteradeienalgoritmsommed97,3%träffsäkerhetkandiskrimineratalfrånmusikiSR:sprogramutbud.
![Page 5: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/5.jpg)
Contents
1 Introduction ................................................................................. 1 1.1 Speech/Music discrimination ..................................................... 1 1.2 Swedish Radio .............................................................................. 1 1.3 Goal............................................................................................... 1 1.4 Method.......................................................................................... 1 1.5 Limitations ..................................................................................... 2 1.6 Overview of the paper................................................................ 2
2 Background................................................................................. 3 2.1 System structure ........................................................................... 3 2.1.1 Online systems...................................................................... 3 2.1.2 Offline systems...................................................................... 3 2.2 Features and feature extraction................................................ 4 2.2.1 Speech and music .............................................................. 4 2.2.2 Standard Low-Level (SLL) features .................................... 8 2.2.2.1 RMS.............................................................................. 8 2.2.2.2 Zero Crossing Rate..................................................... 8 2.2.2.3 Spectral Centroid ...................................................... 8 2.2.2.4 Spectral Rolloff ........................................................... 9 2.2.2.5 Flux (Delta Spectrum Magnitude) ........................... 9 2.2.3 Frequency Cepstrum Coefficients .................................... 9 2.2.4 Psychoacoustic features .................................................. 10 2.2.5 Special features ................................................................. 10 2.2.6 Psychoacoustic pitch scales ............................................ 10 2.2.7 Extracting features............................................................. 11 2.3 Segmentation............................................................................. 11 2.4 Classification methods .............................................................. 11 2.4.1 Hidden Markov Models..................................................... 12 2.4.2 System learning.................................................................. 12 2.4.3 Refined classification ........................................................ 12 2.5 Evaluation methods................................................................... 12 2.6 Earlier results................................................................................ 13
3 Evaluation and tests..................................................................... 14 3.1 Test database................................................................................ 14 3.2 Tools ................................................................................................ 14 3.3 Feature tests................................................................................... 15 3.3.1 Feature test results ............................................................ 16 3.3.2 Low-level features............................................................. 17 3.3.3 Mel Frequency Cepstrum Coefficients .......................... 18 3.3.4 Modified Low Energy Ratio.............................................. 19 3.4 Segmentation tests ....................................................................... 20 3.5 Test evaluations............................................................................. 21
![Page 6: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/6.jpg)
4 Algorithm ...................................................................................... 23 4.1 Signal preprocessing .................................................................... 23 4.2 Feature extraction ........................................................................ 23 4.3 Segmentation................................................................................ 23 4.4 Classification.................................................................................. 25 4.5 Refinement .................................................................................... 26 4.6 Output ............................................................................................ 26 4.7 Results ............................................................................................. 27
5 Conclusions .................................................................................. 28
6 Future work ................................................................................... 29
7 Acknowledgements .................................................................... 30
8 References.................................................................................... 31
![Page 7: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/7.jpg)
List of Abbreviations
DCS DiscreteCosineFunctionERB EquivalentRectangularBandwidthFFT FastFourierTransformLFCC LinearFrequencyscaledCepstrumCoefficientsLPC LinearPredictiveCodingMFCC MelFrequencyCepstrumCoefficientsMLER ModifiedLowEnergyRatioRMS RootMeanSquareSC SpectralCentroidSLL StandardLowLevel(features)SMD Speech/MusicDiscriminationSR SwedishRadioZCR ZeroCrossingRate
![Page 8: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/8.jpg)
![Page 9: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/9.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
1
1 Introduction Thischapterincludesanoverviewofthetask,thepurpose,methodandlimitationsofthework.Italsogivesabriefviewoftherestofthereport. 1.1 Speech/Music discrimination Thepurposeofspeech/musicdiscrimination(SMD)systemsistodivideaudiotomusicandspeechsegmentsandclassifythem,whetherthediscriminationisdoneinrealtimeoronrecordedaudiofiles.
TheSpeech/MusicDiscriminationtaskisanimportantpartofAutomaticSpeechRecognition(ASR)systems,whereitisusedtodisabletheASRwhenmusicorotherclassesofaudioarepresentinautomatictranscriptionofspeech.
SMDsystemsarealsousefulforbit‐ratecoders.Speechcodersachievebetterresultsforspeechcodingthanmusiccodersdo,andviceversa,thereforeitisimportanttodiscriminatebetweenthetwoaudioclasses,toselecttherighttypeofbit‐ratecoding.
WhenindexingsoundandevenvideoaSMDsystemcanbeofgreatimportance.Apartfromusingthedetectionofspeechandmusicinaudiofiles,thesamealgorithmcanbeappliedtotheaudioofTVshowsormovies.Thiscanlaterbeusedforindexingthevideomaterialtobeabletojumpstraighttoadesiredpartine.g.on‐demandvideosolutionsfortheweb.
1.2 Swedish Radio SRisanon‐commercial,publicserviceradiobroadcasterwithover40radiochannels,includingfournationalFMchannels(P1,P2,P3andP4)and28localchannels.P4isthebiggestradiochannelinSweden.SRisalsoofferingmorethan10channelsonlineontheirwebsite,togetherwithanarchiveofallbroadcastedprogramsavailableondemand.Broadcastsarealsoavailableviashortwave,mediumwaveandsatellite.[12]
1.3 Goal AsystemthateffectivelydiscriminatedbetweenspeechandmusicwillbeusefulforSwedishRadioinmanyoftheabovementionedapplications.ThegoalisthereforetocreateanalgorithmthatcandothetaskaccuratelyforthematerialproducedbySwedishRadio.Tomeettherequirementsofacostefficient,accuratealgorithm,anofflinesystemstructurewaschosen.
1.4 Method Theprojectstartedbyreviewingnecessaryliteratureandarticlesofpreviouswork.Thismadeitpossibletoenterthetestphasewithknowledgeofthearea.Inthisphasemanyfeaturesanddiscriminationmethodsweretestedandevaluatedtobeabletofindthemethodbestsuitedforthisspecifictask.ThechosenmethodwasthencodedinawaythatpermittedanintegrationintoexistingsystemsatSwedishRadio.
![Page 10: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/10.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
2
1.5 Limitations Thisworkfocusedonthediscriminationbetweenspeechandmusic,anddidnotafinerdiscriminationwithintheclasses.Thus,thealgorithmwillnotbeabletodiscriminatebetweendifferentspeakerssuchaswomen,menorchildrenandwillnotbeabletotelldifferentvoicesapart.Musicwillonlyconsistofoneclass,andwillnotbefurtherdividedintodifferentclassesfordifferentgenres.
1.6 Overview of the paper ThepaperstartswithabackgroundchapterincludinganintroductionoftheareaofMusicInformationRetrievalandanoverviewofearlierworkdoneinthespeech/musicdiscriminationarea.Thechapteralsopresentsfeaturesthatarecommonlyusedforthesekindsofsystems.Chapter3containsallthetestsdonewithinthiswork.Italsopresentsthetestresultsandendswithanevaluationoftheresults.Chapter4givesadetaileddescriptionofthefinishedalgorithmandthepaperthenendswithachapterincludingfinalconclusionsdrawnfromtheproject.
![Page 11: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/11.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
3
2 Background ThischapterwillgiveanintroductiontothepartoftheMusicInformationRetrievalresearcharea,whichhasbeenusedduringthisproject.Itwillalsogiveabrieflookatearlierwork.
2.1 System structure EarliersystemsdevelopedforSMDtaskshavedifferentstructures.Theycanbedividedintotwomaingroups,onlinesystemswherethediscriminationismadeinreal‐timeandofflinesystemswherethediscriminationismadeonaudiofiles.Bothgroupshavetheiradvantages;onlinesystemsarebettersuitedforlivepurposeswhileofflinesystemscanbemademoreaccurateandfaster.Differenttasksrequiredifferentsystems,suitedforspecificpurposes.
2.1.1 Online systems Inonlinesystems,boththesegmentationandtheclassificationtasksneedtobedoneatthesametime.Theycanevenberegardedasonetask,wheretheoutputoftheclassifierisusedforthesegmentation.Thisisalldoneinreal‐time.”Real‐time”meansthatthesystemoutputsresultscontinuouslyastheinputaudiostreamcomesin,butwithadelayofatleastoneanalysisframe.
Onlinesystemsoftenconcentrateonfindinglargechangesintheaudiotobeabletofindbordersofspeechormusicsegments.ThisisdonebySaundersin[9]bydividingtheaudiointonon‐overlapping16mslongframes(256samplesat16KHz)fromwhichsimplefeaturesareextractedandusingalonger2,4second(150x16msframes)analyzeframeforstatisticalfeaturesusedintheclassification(seenextchapterformoreonfeatures).Hereamultivariate‐Gaussianclassifierdoestheclassification.
Anotherreal‐timesystemisdescribedin[7].Atypicalframe‐sizeofearlieronlinesystemsisaround20msandtheanalysisframevariesfromhalfaseconduptothreeseconds.Theuseofstatisticalfeaturesdemandslongeranalysisframestogetgoodresults.Theshortframeisusedtoextractfeaturesandthelongeranalysisframeisusedtoextractstatisticsofthesefeatures,suchasvarianceandmeanvalues.
2.1.2 Offline systems Thestructureofofflinesystemsvariesconsiderably.Themostcommonsystemhasthreestepspluspre‐andpost‐processing,asseeninFigure1.Firsttheaudioisdividedintoframes.Fromeachoftheframesasetoffeaturesisextractedandstoredinafeaturevector.Inthesecondstepthesystemsegmentsthesoundandinthethirdstepthesegmentsareclassifiedbyusingsomekindofclassificationmethodtodecidewhetherasegmentconsistsofspeechormusic.TheSMDtaskisoftendoneinmorestepsinanofflinesystemthaninanonlinesystem.Thesegmentationcanberefinedasin[8]byusingthefactthatneighboringsegmentshavehighprobabilitiesofcontainingthesameclass.
![Page 12: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/12.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
4
Figure1.Blockdiagramofanofflinesystemstructure
In[8]someclassificationisdoneduringthesegmentation,andthemoredifficultsegmentsareleftforanothermorecomplexclassificationmethod.Thissavescomputationtime,sinceasimplerclassifiermakesthefirstclassification.Othersystemsclassifieslargesegments,andthendividethesegmentswhereaborderisfoundintosmallersegments,whichalsoisawaytosavecomputationtime.
Offlinesystemsoftenuseaframelengthofaround20msandananalysiswindowlengtharound1second.Shorterframesmakethefeaturesmoresensitivetonoise,whilelongerframesmightincludetoomanyphonemes.
2.2 Features and feature extraction Soundhasmanyfeatures,somethatourearscanpickupandsomethatwecannotevenhear.Speechhasbeencloselystudiedandisrelativelywelldefined,whilstmusicisamuchwiderclassofsound.TheFrenchcomposerEdgarVarèsedefineditas“Musicisorganizedsound”.Thismightseemabstract,butcanbeusedeveninthesekindsoftechnicalsolutions.Itisacommonapproachtolookforrepetitioninsoundtoclassifyitasmusic.
2.2.1 Speech and music Tobeabletodistinguishbetweenspeechandmusic,featuresthatdifferbetweenthetwoclassesneedtobeused.Asimplelookatthewaveformof1minuteexcerptsofspeech,popmusic,classicalmusicandopera(allexamplestakenfromSwedishradioSRbroadcasts)alreadyindicateslargedifferencesbetweentheclasses.ThespeechwaveforminFigure2showsrapidchangesinenergyandamplitudethatnoneofthemusicwaveformsdoes.TheheavilycompressedwaveformofTheKillers’songinFigure3seemstototallylackdynamics,whiletheclassicalpieceinFigure4andtheoperapieceinFigure5hassomeshortamplitudepeaksandshowslargedynamicvariations.
![Page 13: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/13.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
5
Figure2.1minuteofspeech.
Figure3.1minuteexcerptofTheKillers–AllTheseThings.
Figure4.1minuteofclassicalmusic.
Figure5.1minuteofopera.
Eveniftheclassesareeasytoidentifyinthewaveform,theexactpositionofthetransitionscanbedifficulttodetect.Figure6showsanexcerptfromP3Popwhereapopsongabruptlystopsandthehostoftheshowstartsspeaking.Thetransitionismarkedwithaverticalline.
Figure6.WaveformrepresentationofexcerptofP3Popwithtransitionfrommusictospeechmarked.
ThechangesinenergyareevenclearerwhenthesoundwaveisplottedinFigure6.Thescalesaren’tthesameinthefourexamples,butthemostinterestingthingisthechanges.InalltheexamplescontainingmusictheRMSnevergoesdowntozeroanddoesn’tdivergelargelyfromthemeanRMS.However,inthespeechexampletherearerelativelymanyframescontainingzeroorclosetozeroRMS,andthechangesarerapid.WhatlookslikelargevariationsinRMSvaluesofthepopmusicinFigure7canbeexplainedbythescaleontheY‐axis,rangingfrom0to0.1,whilespeechrangesfrom0to0.25.
![Page 14: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/14.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
6
Figure7.RMSgraphs.Y‐axisshowstheRMSvaluecalculatedfrom20msframesandX‐axisshowstemporallocationins.Topleft:Speech.Topright:Popmusic.Bottomleft:Classicalpiece.Bottomright:Operapiece.
BylookingatthespectrumofthefourexamplesinFigure8itisclearthatallthemusicexampleshaveahigherpeakinthelowfrequencies,althoughthepeakoccursatdifferentfrequencies.Thispeakmostlikelycorrespondstothefundamentalfrequencyofthevocalcomponents.Theclassicalpiece,whichlacksvocals,doesnothavethesamesharppeakastheotherthreeexamples.Thespeechexamplehasmoreenergyinthefrequencyrangearound10000Hzthanthemusicexamples.
![Page 15: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/15.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
7
Figure8.SpectrumplotsanalysedusingaHanningwindowswithawindowsizeof1024samples.Y‐axisshowssoundlevel(dB)andX‐axisshowsfrequency(Hz).Fromthetop:speech,popmusic,classicalpiece,operapiece.
![Page 16: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/16.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
8
Thefeaturesdescribedbelowaredividedintothreegroupsaccordingto[5]:ThesimplerStandardlow‐level(SLL)features,theFrequencyCepstrumCoefficientsandthemoreadvancedPsychoacousticfeatures.
OnlinesystemsoftenusesimplerfeaturestoavoidhavingtocomputetheFFTtransform,whichisarelativelycostlycomputationcomparedtotheSLLfeatures.ThemostcommonlyusedfeaturesinthesesystemsarethezerocrossingrateandRMS.
2.2.2 Standard Low-Level (SLL) features TheseincludeRMS,ZeroCrossingRate,SpectralCentroid,SpectralRolloff,BandEnergyRatio,Flux(alsocalledDeltaSpectrumMagnitude),Bandwidth,pitchandpitchstrength.
Somefeatureslikebandwidthandpitchareself‐explanatory.Theothertestedfeaturesareexplainedinthefollowingsections.
2.2.2.1 RMS RMSorRootMeanSquareisameasureofamplitudeofasoundwaveinoneanalysiswindow.Thisisdefinedas
€
RMS =x12 + x2
2 + ...+ xn2
n
(1)
wherenisthenumberofsampleswithinananalysiswindowandxisthevalueofthesample.
2.2.2.2 Zero Crossing Rate Thisisameasurethatcountsthenumberoftimestheamplitudeofthesignalchangessign,i.e.crossingthex‐axis,withinoneanalysiswindow.Thefeatureisdefinedas
€
ZCR =1
T −1func
t=1
T−1
∑ stst−1 < 0{ }
(2)
wheresisthesoundsignaloflengthTmeasuredintimeandfunc{A}equals1ifAistrueand0otherwise.
TheZeroCrossingRatefeatureissometimesusedasaprimitivepitchdetectionformonosignals.Italsoisaroughestimateofthespectralcontent.
2.2.2.3 Spectral Centroid Thisfeatureiseffectiveindescribingthespectralshapeoftheaudio.Thefeatureiscorrelatedwiththepsychoacousticfeaturessharpnessandbrightness.ThereareseveraldefinitionsoftheSpectralCentroidfeatureinpreviouswork.InthisstudyitiscalculatedasaweightedmeanofthefrequenciesintheFFTtransformofthesignalas
![Page 17: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/17.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
9
€
SC=
f (n)x(n)n= 0
N−1
∑
x(n)n= 0
N−1
∑
(3)
wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.
2.2.2.4 Spectral Rolloff AstheSpectralCentroid,theSpectralRolloffisalsoarepresentationofthespectralshapeofasound,andtheyarestronglycorrelated.It’sdefinedasthefrequencywhere85%oftheenergyinthespectrumisbelowthatfrequency.IfKisthebinthatfulfils
€
x(n) = 0.85 x(n)n= 0
N−1
∑n= 0
K
∑
(4)
thentheSpectralRollofffrequencyisf(K),wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.
2.2.2.5 Flux (Delta Spectrum Magnitude) TheFlux,orDeltaSpectrumMagnitude,featureisameasureoftherateatwhichthespectralshapechanges,orfluctuates.Itiscalculatedbysummingthesquareddifferencesofmagnitudespectraoftwoneighboringframes.ThisfeaturehasshowngoodresultsintheSMDtaskin[14].
€
F = ( Xr k[ ] − Xr−1 k[ ] )2k=1
N / 2
∑
(5)
whereNisthenumberofFFTpointsandXr[k]istheSTFTofframeratbink.
2.2.3 Frequency Cepstrum Coefficients ThesecondgroupistheFrequencyCepstrumCoefficients(FCC),whichincludestheMelFrequencyCepstrumCoefficients(MFCC)andtheLogarithmicFrequencyCepstrumCoefficients(LFCC).Theseareallpowerspectrumrepresentationfeaturescalculatedwithdifferentfrequencyscales.
ThemostfrequentlyusedcoefficientsforthesesystemsaretheMFCC.ThesearecomputedbytakingtheFFTofeveryanalysiswindow,mappingthespectrumtotheMelscale,takingthebase10logarithmsofthepowersandthenapplyingaDiscreteCosineFunction(DCT)todecorrelatethecoefficients.[14]
TheoverallperformanceofMFCCfeatureswasshownin[5]tobeslightlybetterthantheSLL‐features.ThisrelatestothefactthatMFCCperformsbetteratpopandrockmusic,butsomewhatworseatclassicalmusicthatcontainsverylittlevocalinformation.
![Page 18: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/18.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
10
2.2.4 Psychoacoustic features Thesefeaturesaremorecloselybasedonourperceptionofsound,andarethereforecalledpsychoacoustic.
Loudnessisthesensationofsignalstrength,andisprimarilyasubjectivemeasureforustoranksoundsfromweaktostrong.Loudnesscanbecalculated(CalculatedLoudness)andisthenmeasuredinSone.OneSoneisdefinedastheloudnessofapure1000Hztoneat40dBre20µPa[21].
Roughnessisdescribedin[5]as“theperceptionoftemporalenvelopemodulationsintherangeofabout20150Hz,maximalat70Hz”andisalsosaidtobeaprimarycomponentofmusicaldissonance.
Sharpnessisameasureofthehigh‐frequencyenergyrelatedtothelow‐frequencyenergystrength.Soundswithlotsofenergyinthehigherfrequencies,andlowenergylevelsinthelowerfrequenciesareconsideredsharp.
2.2.5 Special features In[1]theauthorsuseafeaturecalledChromaticEntropywhichisaversionofSpectralEntropy.ThespectrumisfirstmappedtotheMelscaleandthendividedintotwelvesub‐bandswithcenterfrequenciesthatcoincidewiththefrequenciesofthechromaticscale.Theenergyineachsub‐bandisthennormalizedbythetotalenergyofallthesub‐bands.Lastlytheentropyofthenormalizedspectralenergyiscalculatedas
€
E = − ni × log2(ni)i= 0
L−1
∑
(6)
whereniisthenormalizedenergyofsub‐bandiandListhenumberofsub‐bands.
ThefeatureModifiedLowEnergyRatio(MLER)isintroducedin[8].Thefeatureexploitsthefactthatmusicshowslittlevariationinenergycontourofthewaveform,whilstspeechshowslargevariationsbetweenvoicingandfrication.MLERisdefinedastheproportionofframeswithRMSpowerlessthanavariablethresholdwithinonesecond.Itissuggestedbytheauthorsthatthethresholdshouldbeintheinterval[0.05%,0.12%]forbestperformance.
In[4]afeaturecalledWarpedLPC‐basedSpectralCentroid(WLPC‐SC)isintroduced.ThefrequencyanalysisismappedtotheBarkscaleandthenthecentroidfrequencyiscomputedbyaone‐polelpc‐filter.Thisfeatureexploitsthefactthatspeechhasalowcentroidfrequencythatvarieswithvoicedandunvoicedspeech,whilstmusichasachangingbehavior.
2.2.6 Psychoacoustic pitch scales PsychoacousticscalesarecommonlyusedinMusicInformationRetrieval(MIR)systems.Speechandmusicisoftenwelladjustedtoourearsandthereforehavemostinformationinthefrequencieswhereourearshavethebestresolution.ThemostusedscaleisMel,butevenBarkandEquivalent
![Page 19: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/19.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
11
RectangularBandwidth(ERB)aresometimesused.Theoneusedinthispaper,Melfrequency,isdefinedas
€
f =1127.01048 × log f1700
+1
(7)
wheref1istheoriginalfrequency[1].
2.2.7 Extracting features Usuallyachosensetoffeaturesisextractedfromeachframeoftheaudio.Thefeaturesarethenoftennormalizedbythecomputedmeanvalueandthestandardvariationoveralargertimeunitandthenstoredinafeaturevector.
Featuresareusedintwoways,eitherbyusingtheextractedvalueorbyusingchangesovertime.Whenusingthechangesovertimeitispossibletocalculatestatisticalfeatureslikevarianceandstandarddeviation.In[5]itisshownhowusingchangesovertimearemoreaccuratethanonlyusingtheabsolutevaluesofthefeatures.
Onlyonefeatureisusedin[1],[4]and[7],althoughtheyalluseadvancedspecialfeaturesdescribedearlier.Othershavechosentohaveasetofstandardfeatures.In[3]fivedifferentfeaturesareused,energy,ZCR,SpectralEntropyandthetwofirstMFCCs.RMSandZCRareusedin[2].
2.3 Segmentation In[1]and[4]aregiongrowingtechniqueisusedforthesegmentationstep.Thistechniqueiswidelyusedinimagesegmentation,butcanalsobeusedforaudio.Anumberofframesareselectedasseeds.Thefeaturevectorsoftheseedsarethencomparedtotheframesnexttoit.Thesegmentthengrowswiththeneighboringframesaslongasthedifferenceinthefeaturesdoesnotexceedapredefinedthreshold.
Othersystemslikein[2]lookforbigchangesbetweentwoneighboring1secondframes.Thetwoneighboringframesfeaturevectorsarecompared,andiftheyaresufficientlydifferent,asegmentborderisdetected.Whenaborderisdetectedinaframethetransactionismarkedwithintheframewithanaccuracyof20ms.
2.4 Classification methods Whenthesegmentationprocessisdoneeachsegmentshouldbeclassifiedeitherasspeechormusic.Inamorecomplexsystemasin[3]moreclassescanbedefined,suchassilenceorspeechovermusic.Thelatterisoftenclassedasspeechinsystemswithonlythetwobasicclasses.
Theextractedfeaturevectorisusedtoclassifyeachsegment.Ameanvectoriscalculatedforthewholesegmentandisthencomparedeithertoresultsfromtrainingdataortopredefinedthresholds.
![Page 20: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/20.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
12
Amethodwheretheclassificationisbasedontheoutputofmanyframestogetherisproposedin[10].Eachsecondconsistsof50frames,andeachframeisassignedaclassbyaquadraticGaussianclassifier.Then,aglobaldecisionismadebasedonthemostfrequentlyappearingclasswithinthatsecond.
2.4.1 Hidden Markov Models HMMsarecommonlyusedforclassification.In[3]theyareusedtogetherwithaBayesianNetworkClassifier.Thefeaturevectorsequenceisusedasinputtothemodel.Themodelhasastateforeachclass.Onlytwoclassesareusedin[3],oneforspeechandoneformusic.Probabilityoftransfersbetweenstatesarecomputedonlearningdataandstoredinthemodel.
In[6]aHMMwith24statesisusedforspeech,andanothermodelisusedformusic.Theuseofthreeorfourstatescouldcorrespondtosomephonemeclasses,insteadofhavingonlyoneclassforspeech.
Differentmethodsareusedtotrainclassificationmodels.TheViterbialgorithmisusedforHMMtrainingin[11]andtheBaum‐Welchalgorithmisanotheralternativeforthesystem’slearningprocess.
2.4.2 Refined classification Amethodtorefinetheresultsoftheclassificationmethodisdescribedin[8].Fourstatesareused,oneforspeech,oneformusic,onefortransitionstomusicandthelastfortransitionstospeech.Iftheclassifieroutputsamusicsegmentwhilstthesystemisinthespeechstate,thesegmentwillbestoredinastack.Iftheclassifierkeepsoutputtingmusicsegmentsforasettimethestatewillchangetomusic,andallthesegmentsonthestackwillbeclassedasmusic.Butiftheclassifieroutputsaspeechsegmentwithinthattime,thesystemwillgobacktothespeechstateandallthesegmentsonthestackwillbeclassifiedasspeech.Theaccuracyoftheclassificationisreportedtoincreaseby6.5%percentwhenthismethodisused.
Refinementcanbedoneinbothonline‐andofflinesystems,butcanbemademoreefficientinthelatterwherenodemandsonreal‐timeoutputarepresent.Therefinementtechniquedescribedaboveneedsasetnumberofsegmentstoperformwell.Whenthesesegmentsarefewertheperformancewilldecreasedrastically.
2.5 Evaluation methods Theresultsareoftenevaluatedasin[3]withthemeasuresrecall,precisionandtheoverallaccuracy.In[3]recallisdefinedastheproportionoftheframeswithaspecificclassthatwerecorrectlyclassifiedandprecisionisdefinedastheproportionoftheframesclassifiedasaspecificclass,thatactuallybelongedtothatclass.Atotalaccuracyisthencalculatedasthetotalpercentageofcorrectlyclassifieddata.
![Page 21: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/21.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
13
Systemscanbeoptimizedforeithermusicorspeechtoraisetheprecisionofthatspecificclassinthesystem,althoughthisoftendecreasestheaccuracy.
2.6 Earlier results ComparingearliersystemsforSMDtasksisnoteasy.Thereisnostandarddatabaseforevaluatingthem,likethereisforspeakerandspeechrecognitionsystems.Thismakesithardtoactuallyranktheexistingsystems.Mostarticlesreportsystemswithaccuraciesabove90%andin[9]anaccuracyashighas98%isreported.
![Page 22: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/22.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
14
3 Evaluation and tests Thischapterpresentsthetestsdonewithinthiswork.Theresultsofthetestsareevaluatedinordertobeabletofindthebestmethodforthespecifictask.Thetestphasewasdoneinthreesteps:Featureandclassificationtests,segmentationtestsandtestsofthecompletealgorithmwherebothclassificationandsegmentationwereevaluated.Thealgorithmtestsarepresentedattheendofchapter4.
3.1 Test database Atestdatabasewasneededtorunallthetests.Unfortunately,thereisnostandardizedtestdatabaseforSMDtasks,whichmakesthetestresultshardertocompare.However,inthisspecificcasethetestdatabasewasconstructedofmaterialfromSwedishRadiobroadcaststomatchtheactualmaterialthatwillbeusedasinputwhenthealgorithmisimplementedinaproductionenvironment.
MaterialwasselectedfromSwedishRadio’sdigitalarchiveDigastocoverallkindsofgenres.Thematerialismadeupofwholeprogramsthatwereairedonradioandwasselectedinordertogetawidespreadofincludedmaterial.
Threesetsoftestaudiowereselectedandextractedformthematerial.Thefirstsetconsistedof30secondslongaudiofilescontainingonlyspeechoronlymusic.Thespeechexamplesincludedfemaleandmalevoices,interviews,phoneinterviews,sportcommentaryandmore.Themusicexamplesrangedovermanygenresfromdifferentprograms.Thesewerethenusedforthefeatureteststobeabletovalueeachfeatureandcalculatethecorrelationbetweenthefeatures.Thesecondsetconsistedof1minutelongaudiofilescontainingbothspeechandmusicandwithatleastonetransitionbetweenclasses.Thefileswereselectedsothatthetransitioncanbeanywherewithinthefile.Thesewerethenusedforthesegmentationtests.Thethirdandlastsetconsistedofwholeprogramscontainingbothmusicandspeechsegments.Thelengthofthesefilesvariedfromacoupleofminutesupto90minutes.Thesewerethenusedtotestthecompletealgorithm.
3.2 Tools Audacity[15]wasusedtoeditandconvertaudiofilestoconstructthetestdatabase.
SonicVisualiser[16],togetherwithvariousVampplugins[17],hasbeenusedforearlyanalysisofthetestdatabase.SonicVisualiserisdevelopedbytheCentreforDigitalMusic,QueenMaryUniversityofLondon[18]andiseasytousetogetaquicklookathowvaluesoffeatureschangesinaudio.
MATLABhasbeenusedfortestingandevaluatingduringthewholeprocess.ManyofthefeatureshavebeenextractedusingtheMIRToolbox[19],
![Page 23: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/23.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
15
developedbytheUniversityofJyväskylä[20]inFinland,andtherestofthefeatureswereextractedusingcustomwrittencode.
ThefinalalgorithmiswritteninCcodebecauseofitseffectivenessandspeed,usingthelibsndfilelibrary[13]forreadingwavefiles.
3.3 Feature tests Thefirsttestwastocheckiftheselectedfeatureshaveanysignificancefortheclassificationtask.ThetestedfeatureswerechosenfromearlieralgorithmsthatperformedtheSMDtaskwithgoodresultsandafewwereaddedbasedonpersonalhypothesizes.
Someinitialtestswerealsodonewithotherfeatureslikeflux,otherrhythmfeaturesanddifferentusesofMFCC,howeversincethesefeaturesdidnotshowanypotentialfortheSMDtasktheywerenotincludedinthesetests.
Intheseteststhefeaturesweretestedfortheclassificationpurposeandthecorrelationbetweenfeaturesweremeasured.Thechosenfeatureswereextractedfromthetestmaterialinthefirstset,containingonlyoneclasstogetadistributionofvaluesforeachclass.Mostfeaturesweretestedinfivedifferentways,theextractedvalue,thevariance,thestandarddeviation,thederivativeandthestandarddeviationofthederivative.Boththestandarddeviationandthevariancederivefromthesamedatasincethevarianceisthesquareofthestandarddeviation.Becauseofthis,onlythestandarddeviationresultsarepresented.Histogramsoftheresultswerethengeneratedtovisualizethedistribution.Featuresthatshowedinterestingresultsarefurtherdiscussedlaterinthischapter.
ThetestedfeatureswereRMSamplitude,ZeroCrossing‐Rate(ZCR),MelFrequencyCepstrumCoefficients(MFCC),SpectralCentroid(SC),PulseClarity(PC)andModifiedLowEnergyRatio(MLER).Theirabbreviationswillbeusedduringtherestofthischapter,togetherwithanabbreviationforthewaythefeatureisusedusingthefollowingnamingconventions.
SD StandardDeviationD DerivativeSDD StandardDeviationofDerivative
ThismeansthatthestandarddeviationoftheZeroCrossing‐RatewillbeabbreviatedZCR|SD.Atotalof29differentfeaturevariationsweretestedinthesetests.
![Page 24: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/24.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
16
3.3.1 Feature test results
Feature Framelength AccuracyRMS 20ms 0.639RMS|SD 1s 0.829RMS|D 1s 0.548RMS|SDD 1s 0.764ZCR 20ms 0.588ZCR|SD 1s 0.837ZCR|D 1s 0.550ZCR|SDD 1s 0.835SC 1s 0.581SC|SD 30s 0.792SC|D 30s 0.589SC|SDD 30s 0.958MFCC1 20ms 0.517MFCC1|SD 1s 0.548MFCC1|D 1s 0.525MFCC1|SDD 1s 0.553MFCC2 20ms 0.521MFCC2|SD 1s 0.548MFCC2|D 1s 0.525MFCC2|SDD 1s 0.553MFCC3 20ms 0.521MFCC3|SD 1s 0.548MFCC3|D 1s 0.525MFCC3|SDD 1s 0.553MFCC4 20ms 0.519MFCC4|SD 1s 0.548MFCC4|D 1s 0.525MFCC4|SDD 1s 0.553MLER 1s 0.969PC 5s 0.807
Table1.Featuretestresults.Framelengthisatimemeasureandaccuracyisshowninpercentage/100.
ResultsofthefeaturetestsareshowninTable1above.Theaccuracyoftheeachfeatureisameasureofhowfarthespeechandthemusicdistributionarefromeachother.Thiswascalculatedbyfindingthethresholdvaluewiththeleastmisclassificationsbytestingallthresholdvaluesinanintervalspecifiedforeachfeature.Thenumbersofmisclassifiedframesforthebestthresholdwerethencountedanddividedbythetotalnumberofframesandthentheresultwassubtractedfrom1.
€
A =1− min(misclass)nFrames
(8)
Featureswithaccuracyresultscloseto50%canbeconsideredasrandom,containingnousefulinformationfortheclassification.Thethreshold
![Page 25: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/25.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
17
optimizationdescribedaboveisthereasonthatallfeaturesperformedover50%.
ThefivefeatureswithhighestaccuracieswastheModifiedLowEnergyRatio(97%),thestandarddeviationofthederivativeoftheSpectralCentroid(96%),thestandarddeviationoftheZeroCrossing‐Rate(84%),thestandarddeviationoftheRootMeanSquare(83%)andthePulseClarity(81%).Surprisingly,noneoftheMFCCfeaturesshowedanyusefulresultsinthesetestandgottheworstresultsofallfeatures.TheextractedvaluesforeachMFCC‐parametersgavejustabove50%accuracieswiththeoptimalthreshold.Sincesmallvariationscandependontherathersmalltestdatabase,theMFCC‐featuresthemselvescanberegardedasuselessforthesekindsofclassificationtasks.Thereasonthatitgotover50%isalsobecausethebestthresholdvalueissought,andthereforegivesahigheraccuracy.However,MFCCfeaturescanstillbeusedinotherwaysdiscussedlaterinthispaper.
Thecorrelationbetweenthefourtopfeaturescannotbecalculateddirectlysincetheyusedifferentlengthofanalysisframesandthereforegeneratedifferentnumberofdatapoints.TheMLERfeatureusesthepausesinspeechtodiscriminatebetweentheclassesandsodoestheZCR,sotheywillshowhighcorrelationbetweenthem.Unfortunatelyallfourfeatureshaveproblemwiththesamekindofaudio,namelyspeechwithbackgroundnoiseofsomekind,whichareoftenclassifiedasmusic.Thecorrelationbetweenthetopfourfeatureshasbeencalculatedbycomparingthemeanvaluesforeachfeatureforeveryfileinthefirstsetinthetestdatabase.
MLER SC|SDD ZCR|SD RMS|SD PCMLER ‐ SC|SDD 0.93 ‐ ZCR|SD 0.97 0.97 ‐ RMS|SD 0.97 0.93 0.97 ‐ PC 0.87 0.77 0.77 0.83 ‐Table2.Cross‐correlationofthetopfourfeaturesfromtheaccuracytests.
AsseeninTable2,allthetopfourfeaturesarestronglycorrelated.ThePulseClarityfeatureshowstheleastcorrelationwiththeotherswhilethestandarddeviationoftheZeroCrossingRateshowedashighcorrelationsas97%withboththeModifiedLowEnergyRatioandthestandarddeviationofthederivativeoftheSpectralCentroid.
3.3.2 Analysis of Low-level features TheLow‐levelfeaturesareinterestingbecausetheyincurrelativelylowcomputationcosts.However,bothRMSandZCRshowedweakresultsinthefeaturetests.RMSperformedsomewhatbetterwhichcanbeseeninthehistogramsbelow,wherebigdifferencesbetweentheredandbluegivesgooddiscriminationpossibilities.BothRMSandZCRwhereextractedforeveryframe.SpeechtypicallycontainsmanyframeswithRMSvaluesclosetozero,whilemusichasalmostnozeroenergyframesandapeakatabout0.4.The
![Page 26: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/26.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
18
closetozeroenergyframesinspeechrepresentsthepausesbetweensyllablesthatalwaysexistsinspeechrecordingswithoutbackgroundsound.TheZCRvaluesarecenteredroundthesamemeanvalue,butmusichasahigherpeak,i.e.alowerstandarddeviation.ThisisalsowhatgivesthegoodresultsforZCR|SDandZCR|V.
Figure9.Histogramsoffeaturevalues.Left:RMS.Right:ZC.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredaroundtheX‐values.
3.3.3 Analysis of MFCC TheMFCCfeatureshavebeenfrequentlyusedinMIRalgorithms.However,asseeninthefigurebelowalltheMFCCvaluesarecenteredroundthesamevalues,butwithsomedifferencesinthestandarddeviations.Asseeninthetestresultsthesevariationsofthefeaturesalsoscoredhigherthantheextractedvalues,butstillwithverylowtopscores.
![Page 27: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/27.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
19
Figure10.Histogramoffeaturevalues.Topleft:MFCC1.Topright:MFCC2.Bottomleft:MFCC3.Bottomright:MFCC4.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.
3.3.4 Analysis of Modified Low Energy Ratio AsseeninTable1,theabsolutevaluesofthefeaturesareoftennotagoodwaytodiscriminatebetweenthetwoclasses.AbetterwaymightbetousethefeaturechangesovertimeasdonebytheMLERfeature.
Figure11.MLER.Left:Thresholdoptimization,Y‐axisshowscorrectlyclassifiedframepercentageandX‐axisshowsthethreshold.Right:Withthreshold0.26.Blueshowsmusicandredshowspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.
![Page 28: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/28.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
20
Figure11showstheModifiedLowEnergyRatioasdescribedearlierinthereport.Whenusingathresholdforlowenergyat0.26ofthemeanRMSvalue,theclassificationachieved97%accuracy.IftheMLERequalszerotheframeisclassifiedasmusic,otherwiseitisclassifiedasspeech.AlmostallthespeechframeswithzeroMLERcomefromsportcommentarywithanaudienceinthebackground.Theleftgraphshowstheoptimizationofthetotalerror‐rate.Insomeapplicationsitmightbebettertooptimizeitforeitherspeechormusic.
3.4 Segmentation tests Thesegmentationtestsweredoneontheaudiofilescontainingatleastonetransitionbetweenclasses.Theexactpositionsofthetransitionswithinthefileswerenotedasagroundtruthforthetests.Thesepositionsarenotabsoluteandasmalltestshowedthatdifferentpeoplewouldplacethesepositionsdifferently.Sometimesatransitioncanbeuptohalfasecondlongsegmentofsilence,anditwouldbeacceptableforthesegmentationtoplacethetransitionanywhereinthissilence,sincesilencewasnotdefinedaseitherspeechormusic.Forthesetworeasons,adeviationof100mswasallowedandcountedasacleanhit.Adevianceof0.1‐1secondcountedasahit,andthedistancefromtheborderofthecleanhitsegmentwascalculated.Ifthesegmentationmarkedatransitionmorethan1secondawayfromthegroundtruth,thiscountedasamiss.
Threemeasureswerethenusedtoevaluatethesegmentationtechniques:
Hitefficiency: Ameasureofhowmanyhitswerefound.Themissesissubtractedfromthehits,andthedifferenceisthendividedbythetotalnumberoftransitions.
€
Hiteff =Hits−MissesTransitions
(9)
Hitaccuracy: Anaccuracymeasurewhereonlythehitsareconsidered.Allthedistancesareaddedandameanvaluewascalculatedbydividingbythetotalnumberofhits.Acleanhitcountsaszerodistance.
€
Hitacc =
Hitposn −Transposnn=1
N
∑N
(10)
WhereNisthenumbersofhits.
Hitrate: Asimplemeasurewhereonlythehitsareconsideredonceagain.Thetotalnumberofhitsisdividedbythetotalnumberoftransitionsinthetestfiles.
€
Hitrate =Hits
Transitions
(11)
![Page 29: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/29.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
21
Noneofthesemeasuresconsidersthecomputationtimes:Thesearediscussedinthetestevaluationsbelow.
Inthefirstpartofthetest,theregiongrowingtechniquewastestedagainsttheneighboringdifferencetechnique,bothdescribedin2.3.Bothtechniquesweretestedwiththesamefeatures.
Technique Hitefficiency Hitaccuracy HitrateRegiongrowing 81% 79ms 100%Neighboringdifference 94% 44ms 99%Table3.Segmentationtechniquetestresults.
AsseeninTable3,theneighboringdifferencetechniqueachievedbetterresultsatboththeHitefficiencyandtheHitaccuracy.TheHitratemeasureishardertoevaluate,butresultsascloseto100%aspossiblearegood.Theneighboringdifferencetechniquedidnotdetect1%ofthehits,butascanbeseenintheHitefficiencymeasure,didnotdetectasmanymisseseither.AchoicewasmadetopursueonlytheNeighboringdifferencetechniquebecauseofitsefficiencyandaccuracy.
Feature Hitefficiency Hitaccuracy HitrateMLER 95% 19ms 99%SC|SDD 93% 70ms 99%ZCR|SD 88% 24ms 99%RMS|SD 87% 27ms 99%PC 88% 72ms 99%Table4.Segmentationfeaturetestresults,showingthetop5features.
Inthesecondtestallfeaturestestedinthefeaturetestsweretestedfortheneighboringdifferencetechnique.Thesamekindoffeaturesperformedwellalsointhesetests,althoughfeatureswithlowanalysiswindowsandshortframesperformedbetterandshowedbetterHitaccuracyresults.Table4showsthatthetop4featuresalldetected99%ofthetransitions,buttheMLERfeatureachievedthebestresultsbothfortheHitefficiencyandtheHitaccuracymeasures.FurthertestswerethendonewiththeMLERfeature,andboththeefficiencyandtheaccuracyimprovedwhenusingshorterframelengthsandinsteadusedacombinationofthemeanRMSandthevarianceofRMS.AHitefficiencyof96%andaHitaccuracyof17mswerethenachieved.
3.5 Design criteria Therearemanyaspectstoconsiderwhenchoosingfeaturestouseinthealgorithm.Thetestresultsneedtobethoroughlyanalyzedregardingtheseaspects:
• Accuracyofthefeaturetobeabletodiscriminatebetweenthetwoclasses,neededforbothclassificationandsegmentation.Idealfeatureshavesimilarvalueswithinoneclassandacleardifferencefromotherclasses.
![Page 30: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/30.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
22
• Computationalcosts.Costlyoperationsshouldbeavoidedtomakethediscriminationtaskefficientandcheap.Evenofflinesystemsneedtobefasttoimprovethecostefficiency.
• Correlationbetweenchosenfeatures.Theuseoftwodifferentfeaturesneedstobemotivatedbyimprovedresults.Iftwofeaturesarehighlycorrelated,theimprovementsinaccuracywillberelativelycostly.
• Insensitivitytonoiseintheinputsignal.
TheMLERfeaturewaschosenforitsexcellentaccuracyandbecauseofitscombinationwiththeneighboringdifferencesegmentationtechnique.Boththesegmentationandtheclassificationarebasedononlyonesinglelowlevelfeature,theRMSamplitude.Thismakesthecomputationalcostsverylow.Sincethehighperformingfeatureswereallhighlycorrelateditwasnotmotivatedtoaddanotherfeature.Theimprovementsinaccuracywillbecostly,andanotherclassificationmethodwillbeneeded.WhentheRMSamplitudeisnormalizedoverthefilemaximumitisalsoinsensitivetosoundlevels,butthebackgroundnoisewillstillbeaproblemintheclassification
Thethreecalculatedmeasuresfromthesegmentationtestswhereconsideredwhenchoosingthesegmentationmethod.TheHitratewasconsideredthemostimportant,sincemisses(transitiondetectionswherenotransitioniswithinonesecond)canbediscardedinanalgorithmusingsegmentationrefinementmethods.TheHitefficiencywasonlyconsideredwhentwotestsshowedcommonresultsintheHitrate.Thecomputationtimestoo,wereconsideredwhenchoosingthefinalmethodforthealgorithm.
Theneighboringdifferencetechniqueshowedbetterresultsforallfeaturesandwasthereforethestrongestcandidate.ItisalsoagoodmatchwhentheMLERisusedfortheclassificationtasksinceitachievedgoodresultsinHitaccuracywhenusingRMS.Computationtimesarelargelyreducedsinceboththeclassificationandthesegmentationwillusethesameextractedfeatures.
![Page 31: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/31.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
23
4 Algorithm ThischaptercontainsadetaileddescriptionofthefinalalgorithmthatwascodedanddeliveredtoSwedishRadio.ThealgorithmiswritteninCcodeandusesthelibsndfilelibrary[13]toreadandwritetowavefiles.Thediscriminationisdoneonaudiofilesandhencethisisanofflineprocedure.
4.1 Signal preprocessing Beforeanythingisdonewiththeaudio,afewpreprocessesareperformedontheaudiosignal.
Thereisnoaddedinformationinthedifferenceoftwochannelsthatcanbeusedfortheclassificationorthesegmentation.Thereforeitisdesirabletohaveamonosignaltosimplifylaterprocesses.Thealgorithmchecksthenumberofchannelsoftheaudio.Ifthesignalhasmorethanonechannel,itismixeddowntomono.
Theamplitudeofthesignalisthennormalizedtothemaximumamplitudeofthewholefiletoremoveanyeffectstheoverallamplitudelevelmighthaveonthefeatureextraction.
4.2 Feature extraction Aftertheaudiosignalhasgonethroughthepreprocessingpart,itissplitinto21msnon‐overlappingframes.TheRMSamplitudeisthencalculatedforeachframeusingequation1.
OncetheRMSamplitudehasbeencalculated,theframesaregroupedtogethertoform1second(48shortframes)analysisframes.Thesearealsonon‐overlapping.FourfeaturesbasedontheRMSamplitudevaluesarethenextractedfromeach1secondframe,meanRMS,varianceofRMS,alocallynormalizedvarianceoftheRMSandaModifiedLowEnergyRatio(MLER).ThenormalizedvarianceisthevarianceofRMSdividedbythemeanRMS.
TheModifiedLowEnergyRatioistheproportionoflowenergyshortframeswithinthe1secondframe.ThethresholdforlowenergydependsonthemeanRMSamplitude.ThemeanRMSamplitudeintheanalysisframeismultipliedwithapredefinedvalueddefinedbythetestresults.SpeechcontainsmanysmallpausesbetweensyllablesandwordsandthereforehasahigherMLERthanmusic.Thisisusedlatertoclassifyeachsegment.
4.3 Segmentation Thetaskforthesegmentationpartofthealgorithmistofindtheexactpositionoftransitionsbetweentwoclasses.Thesegmentationisbased,justastheclassification,onthefeaturesextractedearlier.Every1secondframeis
![Page 32: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/32.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
24
examinedtolookforcandidatesfortransitionframes,andthenanexactpositionofthetransitionisfound.
Thisisamodifiedversionofthesegmentationdonein[2].Theadvantageofthismethodisthatitusesthesamebasefeature,RMSamplitude,astheclassification.Thismeansthatnofurtherfeatureextractionisneededandthissavescomputationtimeandminimizesthereadingoftheaudiofiles.TheRMSamplitudeisusedinanotherway,sincetheMLERfeatureusedforclassificationrequireslongeranalysisframes,whilethemeanandvarianceoftheRMSchangesasquicklyasdotheaudioclasses.
Thefirststepisdonebylookingattheframebeforeandtheframeaftertheexaminedframe.Ifthetwoneighboringframesaredifferenttheexaminedframeislikelytohaveatransition.Thetransitioncanbeanywherewithinthatsecondandtheexactpositionisdeterminedinthenextstep.Aproblemwilloccuriftheclasschangestwotimeswithintheexaminedframe.Thenthetwoneighboringframeswillnotdifferenoughtobechosenasacandidatefortransitionframes.Although,thesekindsoferrorsarenotcorrectedinthesegmentationsincesegmentssmallerthan2secondswillberemovedlater.
Figure12.Waveformwithtransitionfromspeechtomusic.Secondsaremarkedwithverticallines.
ThecomparisonbetweenneighboringframesarebasedonthemeanandthevarianceoftheRMSvalues.SincethedistributionoftheamplitudeoftheaudiosignalsisaLaplaciandistributionasshownin[2],aprobabilitydensityfunctionofthe
€
χ 2distributionisused,definedas
€
p(x) =xae−bx
ba+1Γ(a +1)
(12)
wherex≥0,Γisthegammafunctionandthetwoparameters,aandb,aredefinedas
€
a =µ2
σ 2 −1and
€
b =σ 2
µ
(13)
whereμisthemeanRMSandσ2istheRMSvariance.Thesimilaritymeasureisbasedontheprobabilitydensityfunction
€
p p1, p2( ) = p1 x( )p2 x( )dx∫ (14)
wherep1andp2referstotheprobabilityfunctionsofeachframe.Whenthe
€
χ 2distributioninequation(12)isinsertedinequation(14)thisgivesasimilaritymeasurecalculatedwith
![Page 33: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/33.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
25
€
p(p1, p2) =Γa1 + a22
+1
Γ a1 +1( )Γ a2 +1( )2a1 +a22
+1b1a2 +12 b2
a1 +12
b1 + b2( )a1 +a22
+1
(15)
Sincethemeasureiscalculatedwithtwoframeswithoneframebetween,theexaminedframei,thedissimilaritymeasureisdefinedas
€
Dissim i( ) =1− p pi−1, pi+1( ) (16)Thiswillgivehighprobabilitiesofchangeevenforthesurroundingframes.AsseeninFigure12,outofthefivesecondsmarkedbytheverticallinessecond2and4willdiffermostinmeanandvarianceofRMS.However,second3and5willalsogivehighvaluesinthedissimilaritymeasure.Todampentheeffectofthiserror,afilterneedstobeapplied.Thesimilarityvalueisthereforelocallynormalizedover5secondswiththeexaminedframeinthecentre.Thenormalizingiscalculatedby
€
Dissimnorm (i) =Dissim(i) × D(i) − D(i − 2) + ...+ D(i + 2)
5
max(D(i − 2),...,D(i + 2))
(17)
Thedissimilaritymeasureismultipliedbyapositivedifferenceofthemeanoftheneighborhood.Ifthedifferenceisnegative,itissettozero.Thisisthendividedbythemaximumvalueoftheneighborhood.
Athresholdforthenormalizeddissimilarityvalueisthensetaccordingtheresultsofthetestmaterial,todeterminewhichframesareselectedascandidatesfortransition.ThethresholdisvariableanddependsonthevarianceoftheRMSintheneighboringframes.
Whenthecandidatetransitionframeshasbeenchosen,anexactpositionforthetransitionneedstobefound.Thisisdoneinawaysimilartothelaststep.Forevery20msframethepreviousandthenextonesecondsarecompared.Avalueisgivenforevery20msframefortheprobabilityofchange,andtheframewiththehighestprobabilityismarkedastheexactpositionofthetransition.
4.4 Classification OnlytheMLERfeatureisusedfortheclassificationpart.ApredefinedthresholdissetandallsegmentsthathaveahigheraverageMLERthanthethresholdareclassedasspeech,andeverythingbelowthethresholdisclassedasmusic.Theuseofonlyonefeaturereducesthecomputationtime.Thealgorithmneedsnotrainingmaterialtowork,sincethethresholdissetaccordingtotheresultsofthetestmaterial.
![Page 34: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/34.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
26
4.5 Refinement Simplerefinementsofthesegmentsaredoneaftertheclassification.Iftwoconsecutivesegmentsaregiventhesameclasstheyaremergedtogetherandthetransitionbetweenthemiserased.
4.6 Output Whenalltheprocesses,segmentation,classificationandrefinementsaredone,theresultsarereadytobeoutput.Thealgorithmthencreatesasimpletextfilewithonelineforeachsegment.Thelinecontainstheexactpositionofthestartofthesegmentandabinarynumbershowingtheclassofthesegment.Alinewouldlooklikethis
1–0.0000000‐9.7708331–27.098166
wherethe0standsforspeech(1formusic)and9.770833istheexactpositionofthetransitionmeasuredinseconds.Thepositionofthetransitionismeasuredinsecondsbecauseofeasierhandlingforotherapplications.Ifthepositionweremarkedwithanexactframe,thethirdpartyapplicationwouldalsohavetoknowthesamplefrequencyoftheaudio.Thetransitionframeiseasilycalculatedby
€
F = t * sr (18)wheretisthepositionofthetransitionmeasuredinsecondsandsristhesamplerateoftheaudio.
AFlashplayerthatusestheoutputdatatomarkthesegmentscanthenreadthetextfileandmarkthesegmentsinthenavigationbar,asseeninFigure13.Thiscouldbeusedasaguidewheneditingtheaudio,orsimplytolookforerrorsintheoutput.
Figure13.Flashplayerwithmarkedsegmentsinthenavigationbar.Greenshowsspeechsegmentsandwhiteshowsmusicsegments.
TheBroadcastWaveFormatcontainsamarkerchunk,whichcouldbeusedtomarkthetransitionpointsofthesegments.Thisisnotyetintegratedintheapplication,butcouldbedoneforspecificimplementations.
![Page 35: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/35.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
27
4.7 Results Thealgorithmtestswereperformedonthefinishedalgorithm.AudiofilescontainingfulllengthsprogramsfromSwedishRadiowereusedasinputandboththesegmentationandclassificationwereevaluatedatthesametime.Theresultingaccuracyisthepercentagewheretherightclassisfoundattherighttime.Thelengthofthecorrectlyclassedaudioisdividedbythetotallengthoftheprogram.
Speech MusicSpeech 95,4% 4,6%Music 1,9% 98,1%
Table5.Algorithmresults.Leftcolumnshowsinputandtoprowshowstheoutputofthealgorithm.
Theleftcolumnofthetableshowstheclassoftheinputtedaudio,andthetoprowshowstheclassoutputtedbythealgorithm.Speechreachesaloweraccuracybecauseofthesportcommentarysegmentsthatareoftenclassedasmusic,whilemusiciscorrectlyclassedasmusicasoftenas98,1%ofthetime.
Right WrongTotal 97,3% 2,7%
Table6.Summarizedresultsforallinputstothealgorithm.
Thisgivesatotalaccuracyof97,3%sincethetestmaterialcontainedmoremusicthanspeech.
Thecomputationtimeofthealgorithmvariesdependingontheformatoftheinputandthenumberofinputchannels.Thecomputationtimedidnotexceed1%ofthelengthoftheaudioinanyofthetestfiles.
![Page 36: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/36.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
28
5 Conclusions Thefinalalgorithmgivesanaccuracyofover97%inthetestsperformedwithmaterialfromSwedishRadio.Thismatchestheresultsreportedinearlierwork,yetwithoutanyadvancedfeaturesthatrequirelongcomputationtimes.97%isalsoenoughformostapplications.Insomeapplicationsthemisclassifiedaudiostillneedstobeconsidered.Sinceweknowfromthetestswhatkindofaudiothatgiveslowaccuracies,thiscanbedonebymanuallydiscriminatingthesefiles.
Agraphicalinterfacewherethesegmentsaremarkedcouldbeofgooduseformanualworkwiththeaudio.Thealgorithmdoesthediscriminationjob,buttheresultsmightstillneedrefinement.Thisistrueforanapplicationforautomaticeditingforpodmaterial,wheretheresultsofthediscriminationcanbeusedasaguide,butsomeeditinglikecross‐fadesisstillneededtogetagoodsoundingresult.
Offlinesystemsbenefitmostfromfastercomputationtimes.Real‐time,onlinesystemsstillhavetousethesamelengthfortheanalysiswindowtogathertheneededstatisticsandcanbenefitonlybybeingabletousemoreandmoreadvancedfeatures.Offlinesystems,ontheotherhand,canbothusemoreadvancedfeaturestoachievehigheraccuracyandatthesametimecomputefaster.
![Page 37: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/37.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
29
6 Future work Therearestillendlessfeaturesandfeaturecombinationstobetestedforthistask.Thefeaturestestedduringthisworkarestillquitesimple.Eventhemodifiedspecialfeaturesliketheoneusedinthealgorithmarebasedonlyononelowlevelfeature.FurthertestscouldalsobedonewithMFCCfeatures,eventhoughtheyshowedpoorresultsinthepresenttests.CombinationsofdifferentMFCCfeatureshavebeentestedinearlierwork.FurtherexploringrelationsbetweendifferentMFCC,suchasdistancesbetweenthesecondandthirdcoefficient,couldgivegoodresults.
Morecomplexfeatureslikepitchcurvesextractionhavebeendiscussedandtentativelytested,butthereisstillmuchtoexploreinthisarea.PitchfeaturesshouldbeagoodcomplementfortheMLERfeaturesincetheyworkondifferentaspectsofthesound.AlsofeaturesbasedonrhythmcouldbeagoodalternativetocomplementtheMLERfeature.
Addingsubclassestoboththemusicandthespeechclasseswouldbeusefulformanyapplications.SwedishRadiohasalreadyrequestedthepossibilitytobeabletodiscriminatebetweenmaleandfemalespeakers.Musiccouldbesplitintogenresub‐classesforfurtherdiscrimination.Thereisalreadysomeworkdonewithgenreclassification,butmoreresearchisneededtocreateausefulapplication.
Constructingalowcostreal‐timediscriminatorthatcouldbeinsertedincarradioreceiverscouldbeanotherkindofproject.Listeningincarsofteninvolvesnoisyenvironmentswherespeechneedstobeamplifiedmorethanmusictoincreasetheaudibility.Differentlisteningenvironmentsdemanddifferentamplificationsofspeech.Insuchanapplication,theprocessingcannotbedoneinthetransmitterandneedsbeprocessedinreal‐time.
![Page 38: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/38.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
30
7 Acknowledgements ThisworkhasbeendonewithmuchhelpfrombothSwedishRadioandtheRoyalInstituteofTechnology.Specialthanksto:
MysupervisoratSwedishRadio,BjörnCarlsson,forallthehelpwithtechnicalquestionsaboutradioandalsoforalltheencouragementduringthework.
MysupervisorattheRoyalInstituteofTechnology,AndersFriberg,forallthefeedbackandideas.Alsoforallthespecialknowledgeinfeatureextractionandclassification.
HasseWessmanandLarsJonssonatSwedishRadio,TechnicalDevelopmentforgivingmetheopportunitytodothisprojectandgivingmeaninspiringworkingenvironment.
TherestofthestaffatSwedishRadio,TechnicalDevelopmentforfeedbackandmotivation.
MyfriendJohnHäggkvist,whohashelpedmeoutduringthecodingofthealgorithm.
![Page 39: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/39.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
31
8 References [1] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Acomputationally
efficientspeech/musicdiscriminatorforradiorecordings”,inUniversityofVictoria,ISBN:1‐55058‐349‐2,pp.107‐110,2006.
[2] Panagiotakis,C.&Tziritas,G.“ASpeech/MusicDiscriminatorBasedonRMSandZeroCrossings”,IEEETransactionsonMultimedia,vol.7(1),pp.155‐166,Feb.2005.
[3] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Speech/MusicDiscriminationforradiobroadcastsusingahybridHMMBayesianNetworkarchitecture”,inProc.ofthe14thEuropeanSignalProcessingConference(EUSIPCO‐06),September4‐8,2006,Florence,Italy.
[4] Muñoz‐Expósito,J.E.,Garcia‐Galán,S.,Ruiz‐Reyes,N.,Vera‐CandeasP.&Rivas‐Peña,F.“Speech/MusicdiscriminationusingasinglewarpedLPCbasedfeature”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2005.
[5] McKinney,M.F.&Breebart,J.”FeaturesforAudioandMusicClassification”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2003.
[6] Karnebäck,S.”Speech/MusicDiscriminationUsingDiscreteHiddenMarkovModels”,TMH‐QPSR,KTH,Vol.46,41‐59,2004.
[7] Alnabadi,M.S.”RealTimeSpeechMusicDiscriminationUsingASingleFeature”,DurhamUniversitySchoolofEngineering,2007.
[8] Wang,W.Q.,Gao,W.,Ying,D.W.“AFastandRobustSpeech/MusicDiscriminationApproach”,InProceedingsoftheInternationalConferenceonInformation,CommunicationsandSignalProcessing,2003.
[9] Saunders,J.“Realtimediscriminationofbroadcastspeech/music”,inProc.IEEEIntern.Conf.onAcoustics,SpeechandSignalProcessing,1996.
[10] El‐Maleh,K.,Klein,M.,Petrucci,G.&Kabal,P.“Speech/Musicdiscriminationformultimediaapplications”,ICASSPOO,2000.
[11] Ajmera,J.,McCowan,I.&Bourlard,H.“Speech/MusicsegmentationusingentropyanddynamismfeaturesinaHMMclassificationframework”,inSpeechCommunication40,pp.351‐363,2003.
[12] SwedishRadiowebpage,http://www.sr.se/sida/default.aspx?ProgramId=2438.Retrieved6/12010.
[13] Libsndfilewebpage,http://www.mega‐nerd.com/libsndfile/.Retrieved6/12010.
[14] Burred,J.J.,“AnObjectiveApproachtoContentBasedAudioSignalClassification”,TechnischeUniversitätBerlin,2003.
[15] Audacitywebpage,http://audacity.sourceforge.net/.Retrieved24/22010.
[16] SonicVisualiserwebpage,http://www.sonicvisualiser.org/.Retrieved24/22010.
[17] Vamppluginswebpage,http://vamp‐plugins.org/.Retrieved24/22010.
![Page 40: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at](https://reader031.vdocument.in/reader031/viewer/2022022608/5b8957b47f8b9ae7298c2e6c/html5/thumbnails/40.jpg)
Automaticspeech/musicdiscriminationinaudiofiles
32
[18] CentreforDigitalMusicwebpage,http://www.elec.qmul.ac.uk/digitalmusic/.Retrieved24/22010.
[19] MIRtoolboxwebpage,https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox.Retrieved24/22010.
[20] UniversityofJyväskyläwebpage,https://www.jyu.fi/en/.Retrieved24/22010.
[21] Leijon,A.“SoundPerception:IntroductionandExerciseProblems”.RoyalInstituteofTechnology,Stockholm,2007.