automatic speech/music discrimination in audio files · automatic speech/music discrimination in...

Automatic speech/music

discrimination in audio files

Lars Ericsson

Master’s thesis in Music Acoustics (30 ECTS credits)

at the School of Media Technology Royal Institute of Technology year 2009

Supervisor at SR was Björn Carlsson Supervisor at CSC was Anders Friberg

Examiner was Sten Ternström

Automatic speech/music discrimination in audio files

AbstractThismaster’sthesispresentsanalgorithmfordiscriminationofspeechandmusicinaudiofiles.ThealgorithmismadeforSwedishRadioandisthereforesuitedfortheirneedsandoptimizedfortheirmaterial.AseriesoftestsonacousticfeaturesextractedfromSwedishRadiobroadcastswasperformedtodistinguishthebestmethodforthediscrimination.Thesetestswereevaluatedtofindthemostsuitedmethodforthetask.MethodsbasedonRMSamplitudeofthesignalwerechosenforbothclassificationandsegmentation.AfeaturefortheproportionoflowenergythatusesthesmallpausesthatcantellspeechfrommusicwasusedfortheclassificationandasimilaritymeasurebasedonmeanandvarianceoftheRMSwasusedtofindthetransitionpointsforthesegments.Thisresultedinanalgorithmthatwithanaccuracyof97,3%candiscriminatespeechfrommusicinSwedishRadiobroadcasts.

Automatisk diskriminering av tal och musik i ljudfiler

Sammanfattning Dettaexamensarbetepresenterarenalgoritmfördiskrimineringavtalochmusikiljudfiler.AlgoritmenärgjordförSverigesRadioochärdärförutformadutefterderasbehovochoptimeradförderasmaterial.FörattkommaframtilldenbästametodenfördiskrimineringengjordestesterpåakustiskaparametrarsomextraheratsurmaterialfrånSR:sdigitalaarkiv.Dessatesterutvärderadessedanföratthittadenmetodsomvarbästlämpadförändamålet.FörbådeklassificeringochsegmenteringvaldesmetodersombaseraspåRMSföramplitudenavsignalen.EnparameterförlågnivåproportionersomutnyttjardekortapausernasomskiljertalfrånmusikvaldesförattklassificerasegmentochettlikhetsmåttbaseradpåmedelvärdeochvariansavRMSvaldesföratthittagränsernatillsegmenten.Dettaresulteradeienalgoritmsommed97,3%träffsäkerhetkandiskrimineratalfrånmusikiSR:sprogramutbud.

Contents

1 Introduction ................................................................................. 1 1.1 Speech/Music discrimination ..................................................... 1 1.2 Swedish Radio .............................................................................. 1 1.3 Goal............................................................................................... 1 1.4 Method.......................................................................................... 1 1.5 Limitations ..................................................................................... 2 1.6 Overview of the paper................................................................ 2

2 Background................................................................................. 3 2.1 System structure ........................................................................... 3 2.1.1 Online systems...................................................................... 3 2.1.2 Offline systems...................................................................... 3 2.2 Features and feature extraction................................................ 4 2.2.1 Speech and music .............................................................. 4 2.2.2 Standard Low-Level (SLL) features .................................... 8 2.2.2.1 RMS.............................................................................. 8 2.2.2.2 Zero Crossing Rate..................................................... 8 2.2.2.3 Spectral Centroid ...................................................... 8 2.2.2.4 Spectral Rolloff ........................................................... 9 2.2.2.5 Flux (Delta Spectrum Magnitude) ........................... 9 2.2.3 Frequency Cepstrum Coefficients .................................... 9 2.2.4 Psychoacoustic features .................................................. 10 2.2.5 Special features ................................................................. 10 2.2.6 Psychoacoustic pitch scales ............................................ 10 2.2.7 Extracting features............................................................. 11 2.3 Segmentation............................................................................. 11 2.4 Classification methods .............................................................. 11 2.4.1 Hidden Markov Models..................................................... 12 2.4.2 System learning.................................................................. 12 2.4.3 Refined classification ........................................................ 12 2.5 Evaluation methods................................................................... 12 2.6 Earlier results................................................................................ 13

3 Evaluation and tests..................................................................... 14 3.1 Test database................................................................................ 14 3.2 Tools ................................................................................................ 14 3.3 Feature tests................................................................................... 15 3.3.1 Feature test results ............................................................ 16 3.3.2 Low-level features............................................................. 17 3.3.3 Mel Frequency Cepstrum Coefficients .......................... 18 3.3.4 Modified Low Energy Ratio.............................................. 19 3.4 Segmentation tests ....................................................................... 20 3.5 Test evaluations............................................................................. 21

4 Algorithm ...................................................................................... 23 4.1 Signal preprocessing .................................................................... 23 4.2 Feature extraction ........................................................................ 23 4.3 Segmentation................................................................................ 23 4.4 Classification.................................................................................. 25 4.5 Refinement .................................................................................... 26 4.6 Output ............................................................................................ 26 4.7 Results ............................................................................................. 27

5 Conclusions .................................................................................. 28

6 Future work ................................................................................... 29

7 Acknowledgements .................................................................... 30

8 References.................................................................................... 31

List of Abbreviations

DCS DiscreteCosineFunctionERB EquivalentRectangularBandwidthFFT FastFourierTransformLFCC LinearFrequencyscaledCepstrumCoefficientsLPC LinearPredictiveCodingMFCC MelFrequencyCepstrumCoefficientsMLER ModifiedLowEnergyRatioRMS RootMeanSquareSC SpectralCentroidSLL StandardLowLevel(features)SMD Speech/MusicDiscriminationSR SwedishRadioZCR ZeroCrossingRate

Automaticspeech/musicdiscriminationinaudiofiles

1

1 Introduction Thischapterincludesanoverviewofthetask,thepurpose,methodandlimitationsofthework.Italsogivesabriefviewoftherestofthereport. 1.1 Speech/Music discrimination Thepurposeofspeech/musicdiscrimination(SMD)systemsistodivideaudiotomusicandspeechsegmentsandclassifythem,whetherthediscriminationisdoneinrealtimeoronrecordedaudiofiles.

TheSpeech/MusicDiscriminationtaskisanimportantpartofAutomaticSpeechRecognition(ASR)systems,whereitisusedtodisabletheASRwhenmusicorotherclassesofaudioarepresentinautomatictranscriptionofspeech.

SMDsystemsarealsousefulforbit‐ratecoders.Speechcodersachievebetterresultsforspeechcodingthanmusiccodersdo,andviceversa,thereforeitisimportanttodiscriminatebetweenthetwoaudioclasses,toselecttherighttypeofbit‐ratecoding.

WhenindexingsoundandevenvideoaSMDsystemcanbeofgreatimportance.Apartfromusingthedetectionofspeechandmusicinaudiofiles,thesamealgorithmcanbeappliedtotheaudioofTVshowsormovies.Thiscanlaterbeusedforindexingthevideomaterialtobeabletojumpstraighttoadesiredpartine.g.on‐demandvideosolutionsfortheweb.

1.2 Swedish Radio SRisanon‐commercial,publicserviceradiobroadcasterwithover40radiochannels,includingfournationalFMchannels(P1,P2,P3andP4)and28localchannels.P4isthebiggestradiochannelinSweden.SRisalsoofferingmorethan10channelsonlineontheirwebsite,togetherwithanarchiveofallbroadcastedprogramsavailableondemand.Broadcastsarealsoavailableviashortwave,mediumwaveandsatellite.[12]

1.3 Goal AsystemthateffectivelydiscriminatedbetweenspeechandmusicwillbeusefulforSwedishRadioinmanyoftheabovementionedapplications.ThegoalisthereforetocreateanalgorithmthatcandothetaskaccuratelyforthematerialproducedbySwedishRadio.Tomeettherequirementsofacostefficient,accuratealgorithm,anofflinesystemstructurewaschosen.

1.4 Method Theprojectstartedbyreviewingnecessaryliteratureandarticlesofpreviouswork.Thismadeitpossibletoenterthetestphasewithknowledgeofthearea.Inthisphasemanyfeaturesanddiscriminationmethodsweretestedandevaluatedtobeabletofindthemethodbestsuitedforthisspecifictask.ThechosenmethodwasthencodedinawaythatpermittedanintegrationintoexistingsystemsatSwedishRadio.


2

1.5 Limitations Thisworkfocusedonthediscriminationbetweenspeechandmusic,anddidnotafinerdiscriminationwithintheclasses.Thus,thealgorithmwillnotbeabletodiscriminatebetweendifferentspeakerssuchaswomen,menorchildrenandwillnotbeabletotelldifferentvoicesapart.Musicwillonlyconsistofoneclass,andwillnotbefurtherdividedintodifferentclassesfordifferentgenres.

1.6 Overview of the paper ThepaperstartswithabackgroundchapterincludinganintroductionoftheareaofMusicInformationRetrievalandanoverviewofearlierworkdoneinthespeech/musicdiscriminationarea.Thechapteralsopresentsfeaturesthatarecommonlyusedforthesekindsofsystems.Chapter3containsallthetestsdonewithinthiswork.Italsopresentsthetestresultsandendswithanevaluationoftheresults.Chapter4givesadetaileddescriptionofthefinishedalgorithmandthepaperthenendswithachapterincludingfinalconclusionsdrawnfromtheproject.


3

2 Background ThischapterwillgiveanintroductiontothepartoftheMusicInformationRetrievalresearcharea,whichhasbeenusedduringthisproject.Itwillalsogiveabrieflookatearlierwork.

2.1 System structure EarliersystemsdevelopedforSMDtaskshavedifferentstructures.Theycanbedividedintotwomaingroups,onlinesystemswherethediscriminationismadeinreal‐timeandofflinesystemswherethediscriminationismadeonaudiofiles.Bothgroupshavetheiradvantages;onlinesystemsarebettersuitedforlivepurposeswhileofflinesystemscanbemademoreaccurateandfaster.Differenttasksrequiredifferentsystems,suitedforspecificpurposes.

2.1.1 Online systems Inonlinesystems,boththesegmentationandtheclassificationtasksneedtobedoneatthesametime.Theycanevenberegardedasonetask,wheretheoutputoftheclassifierisusedforthesegmentation.Thisisalldoneinreal‐time.”Real‐time”meansthatthesystemoutputsresultscontinuouslyastheinputaudiostreamcomesin,butwithadelayofatleastoneanalysisframe.

Onlinesystemsoftenconcentrateonfindinglargechangesintheaudiotobeabletofindbordersofspeechormusicsegments.ThisisdonebySaundersin[9]bydividingtheaudiointonon‐overlapping16mslongframes(256samplesat16KHz)fromwhichsimplefeaturesareextractedandusingalonger2,4second(150x16msframes)analyzeframeforstatisticalfeaturesusedintheclassification(seenextchapterformoreonfeatures).Hereamultivariate‐Gaussianclassifierdoestheclassification.

Anotherreal‐timesystemisdescribedin[7].Atypicalframe‐sizeofearlieronlinesystemsisaround20msandtheanalysisframevariesfromhalfaseconduptothreeseconds.Theuseofstatisticalfeaturesdemandslongeranalysisframestogetgoodresults.Theshortframeisusedtoextractfeaturesandthelongeranalysisframeisusedtoextractstatisticsofthesefeatures,suchasvarianceandmeanvalues.

2.1.2 Offline systems Thestructureofofflinesystemsvariesconsiderably.Themostcommonsystemhasthreestepspluspre‐andpost‐processing,asseeninFigure1.Firsttheaudioisdividedintoframes.Fromeachoftheframesasetoffeaturesisextractedandstoredinafeaturevector.Inthesecondstepthesystemsegmentsthesoundandinthethirdstepthesegmentsareclassifiedbyusingsomekindofclassificationmethodtodecidewhetherasegmentconsistsofspeechormusic.TheSMDtaskisoftendoneinmorestepsinanofflinesystemthaninanonlinesystem.Thesegmentationcanberefinedasin[8]byusingthefactthatneighboringsegmentshavehighprobabilitiesofcontainingthesameclass.


4

Figure1.Blockdiagramofanofflinesystemstructure

In[8]someclassificationisdoneduringthesegmentation,andthemoredifficultsegmentsareleftforanothermorecomplexclassificationmethod.Thissavescomputationtime,sinceasimplerclassifiermakesthefirstclassification.Othersystemsclassifieslargesegments,andthendividethesegmentswhereaborderisfoundintosmallersegments,whichalsoisawaytosavecomputationtime.

Offlinesystemsoftenuseaframelengthofaround20msandananalysiswindowlengtharound1second.Shorterframesmakethefeaturesmoresensitivetonoise,whilelongerframesmightincludetoomanyphonemes.

2.2 Features and feature extraction Soundhasmanyfeatures,somethatourearscanpickupandsomethatwecannotevenhear.Speechhasbeencloselystudiedandisrelativelywelldefined,whilstmusicisamuchwiderclassofsound.TheFrenchcomposerEdgarVarèsedefineditas“Musicisorganizedsound”.Thismightseemabstract,butcanbeusedeveninthesekindsoftechnicalsolutions.Itisacommonapproachtolookforrepetitioninsoundtoclassifyitasmusic.

2.2.1 Speech and music Tobeabletodistinguishbetweenspeechandmusic,featuresthatdifferbetweenthetwoclassesneedtobeused.Asimplelookatthewaveformof1minuteexcerptsofspeech,popmusic,classicalmusicandopera(allexamplestakenfromSwedishradioSRbroadcasts)alreadyindicateslargedifferencesbetweentheclasses.ThespeechwaveforminFigure2showsrapidchangesinenergyandamplitudethatnoneofthemusicwaveformsdoes.TheheavilycompressedwaveformofTheKillers’songinFigure3seemstototallylackdynamics,whiletheclassicalpieceinFigure4andtheoperapieceinFigure5hassomeshortamplitudepeaksandshowslargedynamicvariations.


5

Figure2.1minuteofspeech.

Figure3.1minuteexcerptofTheKillers–AllTheseThings.

Figure4.1minuteofclassicalmusic.

Figure5.1minuteofopera.

Eveniftheclassesareeasytoidentifyinthewaveform,theexactpositionofthetransitionscanbedifficulttodetect.Figure6showsanexcerptfromP3Popwhereapopsongabruptlystopsandthehostoftheshowstartsspeaking.Thetransitionismarkedwithaverticalline.

Figure6.WaveformrepresentationofexcerptofP3Popwithtransitionfrommusictospeechmarked.

ThechangesinenergyareevenclearerwhenthesoundwaveisplottedinFigure6.Thescalesaren’tthesameinthefourexamples,butthemostinterestingthingisthechanges.InalltheexamplescontainingmusictheRMSnevergoesdowntozeroanddoesn’tdivergelargelyfromthemeanRMS.However,inthespeechexampletherearerelativelymanyframescontainingzeroorclosetozeroRMS,andthechangesarerapid.WhatlookslikelargevariationsinRMSvaluesofthepopmusicinFigure7canbeexplainedbythescaleontheY‐axis,rangingfrom0to0.1,whilespeechrangesfrom0to0.25.


6

Figure7.RMSgraphs.Y‐axisshowstheRMSvaluecalculatedfrom20msframesandX‐axisshowstemporallocationins.Topleft:Speech.Topright:Popmusic.Bottomleft:Classicalpiece.Bottomright:Operapiece.

BylookingatthespectrumofthefourexamplesinFigure8itisclearthatallthemusicexampleshaveahigherpeakinthelowfrequencies,althoughthepeakoccursatdifferentfrequencies.Thispeakmostlikelycorrespondstothefundamentalfrequencyofthevocalcomponents.Theclassicalpiece,whichlacksvocals,doesnothavethesamesharppeakastheotherthreeexamples.Thespeechexamplehasmoreenergyinthefrequencyrangearound10000Hzthanthemusicexamples.


7

Figure8.SpectrumplotsanalysedusingaHanningwindowswithawindowsizeof1024samples.Y‐axisshowssoundlevel(dB)andX‐axisshowsfrequency(Hz).Fromthetop:speech,popmusic,classicalpiece,operapiece.


8

Thefeaturesdescribedbelowaredividedintothreegroupsaccordingto[5]:ThesimplerStandardlow‐level(SLL)features,theFrequencyCepstrumCoefficientsandthemoreadvancedPsychoacousticfeatures.

OnlinesystemsoftenusesimplerfeaturestoavoidhavingtocomputetheFFTtransform,whichisarelativelycostlycomputationcomparedtotheSLLfeatures.ThemostcommonlyusedfeaturesinthesesystemsarethezerocrossingrateandRMS.

2.2.2 Standard Low-Level (SLL) features TheseincludeRMS,ZeroCrossingRate,SpectralCentroid,SpectralRolloff,BandEnergyRatio,Flux(alsocalledDeltaSpectrumMagnitude),Bandwidth,pitchandpitchstrength.

Somefeatureslikebandwidthandpitchareself‐explanatory.Theothertestedfeaturesareexplainedinthefollowingsections.

2.2.2.1 RMS RMSorRootMeanSquareisameasureofamplitudeofasoundwaveinoneanalysiswindow.Thisisdefinedas

€

RMS =x12 + x2

2 + ...+ xn2

n

(1)

wherenisthenumberofsampleswithinananalysiswindowandxisthevalueofthesample.

2.2.2.2 Zero Crossing Rate Thisisameasurethatcountsthenumberoftimestheamplitudeofthesignalchangessign,i.e.crossingthex‐axis,withinoneanalysiswindow.Thefeatureisdefinedas

€

ZCR =1

T −1func

t=1

T−1

∑ stst−1 < 0{ }

(2)

wheresisthesoundsignaloflengthTmeasuredintimeandfunc{A}equals1ifAistrueand0otherwise.

TheZeroCrossingRatefeatureissometimesusedasaprimitivepitchdetectionformonosignals.Italsoisaroughestimateofthespectralcontent.

2.2.2.3 Spectral Centroid Thisfeatureiseffectiveindescribingthespectralshapeoftheaudio.Thefeatureiscorrelatedwiththepsychoacousticfeaturessharpnessandbrightness.ThereareseveraldefinitionsoftheSpectralCentroidfeatureinpreviouswork.InthisstudyitiscalculatedasaweightedmeanofthefrequenciesintheFFTtransformofthesignalas


9

€

SC=

f (n)x(n)n= 0

N−1

∑

x(n)n= 0

N−1

∑

(3)

wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.

2.2.2.4 Spectral Rolloff AstheSpectralCentroid,theSpectralRolloffisalsoarepresentationofthespectralshapeofasound,andtheyarestronglycorrelated.It’sdefinedasthefrequencywhere85%oftheenergyinthespectrumisbelowthatfrequency.IfKisthebinthatfulfils

€

x(n) = 0.85 x(n)n= 0

N−1

∑n= 0

K

∑

(4)

thentheSpectralRollofffrequencyisf(K),wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.

2.2.2.5 Flux (Delta Spectrum Magnitude) TheFlux,orDeltaSpectrumMagnitude,featureisameasureoftherateatwhichthespectralshapechanges,orfluctuates.Itiscalculatedbysummingthesquareddifferencesofmagnitudespectraoftwoneighboringframes.ThisfeaturehasshowngoodresultsintheSMDtaskin[14].

€

F = ( Xr k[ ] − Xr−1 k[ ] )2k=1

N / 2

∑

(5)

whereNisthenumberofFFTpointsandXr[k]istheSTFTofframeratbink.

2.2.3 Frequency Cepstrum Coefficients ThesecondgroupistheFrequencyCepstrumCoefficients(FCC),whichincludestheMelFrequencyCepstrumCoefficients(MFCC)andtheLogarithmicFrequencyCepstrumCoefficients(LFCC).Theseareallpowerspectrumrepresentationfeaturescalculatedwithdifferentfrequencyscales.

ThemostfrequentlyusedcoefficientsforthesesystemsaretheMFCC.ThesearecomputedbytakingtheFFTofeveryanalysiswindow,mappingthespectrumtotheMelscale,takingthebase10logarithmsofthepowersandthenapplyingaDiscreteCosineFunction(DCT)todecorrelatethecoefficients.[14]

TheoverallperformanceofMFCCfeatureswasshownin[5]tobeslightlybetterthantheSLL‐features.ThisrelatestothefactthatMFCCperformsbetteratpopandrockmusic,butsomewhatworseatclassicalmusicthatcontainsverylittlevocalinformation.


10

2.2.4 Psychoacoustic features Thesefeaturesaremorecloselybasedonourperceptionofsound,andarethereforecalledpsychoacoustic.

Loudnessisthesensationofsignalstrength,andisprimarilyasubjectivemeasureforustoranksoundsfromweaktostrong.Loudnesscanbecalculated(CalculatedLoudness)andisthenmeasuredinSone.OneSoneisdefinedastheloudnessofapure1000Hztoneat40dBre20µPa[21].

Roughnessisdescribedin[5]as“theperceptionoftemporalenvelopemodulationsintherangeofabout20150Hz,maximalat70Hz”andisalsosaidtobeaprimarycomponentofmusicaldissonance.

Sharpnessisameasureofthehigh‐frequencyenergyrelatedtothelow‐frequencyenergystrength.Soundswithlotsofenergyinthehigherfrequencies,andlowenergylevelsinthelowerfrequenciesareconsideredsharp.

2.2.5 Special features In[1]theauthorsuseafeaturecalledChromaticEntropywhichisaversionofSpectralEntropy.ThespectrumisfirstmappedtotheMelscaleandthendividedintotwelvesub‐bandswithcenterfrequenciesthatcoincidewiththefrequenciesofthechromaticscale.Theenergyineachsub‐bandisthennormalizedbythetotalenergyofallthesub‐bands.Lastlytheentropyofthenormalizedspectralenergyiscalculatedas

€

E = − ni × log2(ni)i= 0

L−1

∑

(6)

whereniisthenormalizedenergyofsub‐bandiandListhenumberofsub‐bands.

ThefeatureModifiedLowEnergyRatio(MLER)isintroducedin[8].Thefeatureexploitsthefactthatmusicshowslittlevariationinenergycontourofthewaveform,whilstspeechshowslargevariationsbetweenvoicingandfrication.MLERisdefinedastheproportionofframeswithRMSpowerlessthanavariablethresholdwithinonesecond.Itissuggestedbytheauthorsthatthethresholdshouldbeintheinterval[0.05%,0.12%]forbestperformance.

In[4]afeaturecalledWarpedLPC‐basedSpectralCentroid(WLPC‐SC)isintroduced.ThefrequencyanalysisismappedtotheBarkscaleandthenthecentroidfrequencyiscomputedbyaone‐polelpc‐filter.Thisfeatureexploitsthefactthatspeechhasalowcentroidfrequencythatvarieswithvoicedandunvoicedspeech,whilstmusichasachangingbehavior.

2.2.6 Psychoacoustic pitch scales PsychoacousticscalesarecommonlyusedinMusicInformationRetrieval(MIR)systems.Speechandmusicisoftenwelladjustedtoourearsandthereforehavemostinformationinthefrequencieswhereourearshavethebestresolution.ThemostusedscaleisMel,butevenBarkandEquivalent


11

RectangularBandwidth(ERB)aresometimesused.Theoneusedinthispaper,Melfrequency,isdefinedas

€

f =1127.01048 × log f1700

+1

(7)

wheref1istheoriginalfrequency[1].

2.2.7 Extracting features Usuallyachosensetoffeaturesisextractedfromeachframeoftheaudio.Thefeaturesarethenoftennormalizedbythecomputedmeanvalueandthestandardvariationoveralargertimeunitandthenstoredinafeaturevector.

Featuresareusedintwoways,eitherbyusingtheextractedvalueorbyusingchangesovertime.Whenusingthechangesovertimeitispossibletocalculatestatisticalfeatureslikevarianceandstandarddeviation.In[5]itisshownhowusingchangesovertimearemoreaccuratethanonlyusingtheabsolutevaluesofthefeatures.

Onlyonefeatureisusedin[1],[4]and[7],althoughtheyalluseadvancedspecialfeaturesdescribedearlier.Othershavechosentohaveasetofstandardfeatures.In[3]fivedifferentfeaturesareused,energy,ZCR,SpectralEntropyandthetwofirstMFCCs.RMSandZCRareusedin[2].

2.3 Segmentation In[1]and[4]aregiongrowingtechniqueisusedforthesegmentationstep.Thistechniqueiswidelyusedinimagesegmentation,butcanalsobeusedforaudio.Anumberofframesareselectedasseeds.Thefeaturevectorsoftheseedsarethencomparedtotheframesnexttoit.Thesegmentthengrowswiththeneighboringframesaslongasthedifferenceinthefeaturesdoesnotexceedapredefinedthreshold.

Othersystemslikein[2]lookforbigchangesbetweentwoneighboring1secondframes.Thetwoneighboringframesfeaturevectorsarecompared,andiftheyaresufficientlydifferent,asegmentborderisdetected.Whenaborderisdetectedinaframethetransactionismarkedwithintheframewithanaccuracyof20ms.

2.4 Classification methods Whenthesegmentationprocessisdoneeachsegmentshouldbeclassifiedeitherasspeechormusic.Inamorecomplexsystemasin[3]moreclassescanbedefined,suchassilenceorspeechovermusic.Thelatterisoftenclassedasspeechinsystemswithonlythetwobasicclasses.

Theextractedfeaturevectorisusedtoclassifyeachsegment.Ameanvectoriscalculatedforthewholesegmentandisthencomparedeithertoresultsfromtrainingdataortopredefinedthresholds.


12

Amethodwheretheclassificationisbasedontheoutputofmanyframestogetherisproposedin[10].Eachsecondconsistsof50frames,andeachframeisassignedaclassbyaquadraticGaussianclassifier.Then,aglobaldecisionismadebasedonthemostfrequentlyappearingclasswithinthatsecond.

2.4.1 Hidden Markov Models HMMsarecommonlyusedforclassification.In[3]theyareusedtogetherwithaBayesianNetworkClassifier.Thefeaturevectorsequenceisusedasinputtothemodel.Themodelhasastateforeachclass.Onlytwoclassesareusedin[3],oneforspeechandoneformusic.Probabilityoftransfersbetweenstatesarecomputedonlearningdataandstoredinthemodel.

In[6]aHMMwith24statesisusedforspeech,andanothermodelisusedformusic.Theuseofthreeorfourstatescouldcorrespondtosomephonemeclasses,insteadofhavingonlyoneclassforspeech.

Differentmethodsareusedtotrainclassificationmodels.TheViterbialgorithmisusedforHMMtrainingin[11]andtheBaum‐Welchalgorithmisanotheralternativeforthesystem’slearningprocess.

2.4.2 Refined classification Amethodtorefinetheresultsoftheclassificationmethodisdescribedin[8].Fourstatesareused,oneforspeech,oneformusic,onefortransitionstomusicandthelastfortransitionstospeech.Iftheclassifieroutputsamusicsegmentwhilstthesystemisinthespeechstate,thesegmentwillbestoredinastack.Iftheclassifierkeepsoutputtingmusicsegmentsforasettimethestatewillchangetomusic,andallthesegmentsonthestackwillbeclassedasmusic.Butiftheclassifieroutputsaspeechsegmentwithinthattime,thesystemwillgobacktothespeechstateandallthesegmentsonthestackwillbeclassifiedasspeech.Theaccuracyoftheclassificationisreportedtoincreaseby6.5%percentwhenthismethodisused.

Refinementcanbedoneinbothonline‐andofflinesystems,butcanbemademoreefficientinthelatterwherenodemandsonreal‐timeoutputarepresent.Therefinementtechniquedescribedaboveneedsasetnumberofsegmentstoperformwell.Whenthesesegmentsarefewertheperformancewilldecreasedrastically.

2.5 Evaluation methods Theresultsareoftenevaluatedasin[3]withthemeasuresrecall,precisionandtheoverallaccuracy.In[3]recallisdefinedastheproportionoftheframeswithaspecificclassthatwerecorrectlyclassifiedandprecisionisdefinedastheproportionoftheframesclassifiedasaspecificclass,thatactuallybelongedtothatclass.Atotalaccuracyisthencalculatedasthetotalpercentageofcorrectlyclassifieddata.


13

Systemscanbeoptimizedforeithermusicorspeechtoraisetheprecisionofthatspecificclassinthesystem,althoughthisoftendecreasestheaccuracy.

2.6 Earlier results ComparingearliersystemsforSMDtasksisnoteasy.Thereisnostandarddatabaseforevaluatingthem,likethereisforspeakerandspeechrecognitionsystems.Thismakesithardtoactuallyranktheexistingsystems.Mostarticlesreportsystemswithaccuraciesabove90%andin[9]anaccuracyashighas98%isreported.


14

3 Evaluation and tests Thischapterpresentsthetestsdonewithinthiswork.Theresultsofthetestsareevaluatedinordertobeabletofindthebestmethodforthespecifictask.Thetestphasewasdoneinthreesteps:Featureandclassificationtests,segmentationtestsandtestsofthecompletealgorithmwherebothclassificationandsegmentationwereevaluated.Thealgorithmtestsarepresentedattheendofchapter4.

3.1 Test database Atestdatabasewasneededtorunallthetests.Unfortunately,thereisnostandardizedtestdatabaseforSMDtasks,whichmakesthetestresultshardertocompare.However,inthisspecificcasethetestdatabasewasconstructedofmaterialfromSwedishRadiobroadcaststomatchtheactualmaterialthatwillbeusedasinputwhenthealgorithmisimplementedinaproductionenvironment.

MaterialwasselectedfromSwedishRadio’sdigitalarchiveDigastocoverallkindsofgenres.Thematerialismadeupofwholeprogramsthatwereairedonradioandwasselectedinordertogetawidespreadofincludedmaterial.

Threesetsoftestaudiowereselectedandextractedformthematerial.Thefirstsetconsistedof30secondslongaudiofilescontainingonlyspeechoronlymusic.Thespeechexamplesincludedfemaleandmalevoices,interviews,phoneinterviews,sportcommentaryandmore.Themusicexamplesrangedovermanygenresfromdifferentprograms.Thesewerethenusedforthefeatureteststobeabletovalueeachfeatureandcalculatethecorrelationbetweenthefeatures.Thesecondsetconsistedof1minutelongaudiofilescontainingbothspeechandmusicandwithatleastonetransitionbetweenclasses.Thefileswereselectedsothatthetransitioncanbeanywherewithinthefile.Thesewerethenusedforthesegmentationtests.Thethirdandlastsetconsistedofwholeprogramscontainingbothmusicandspeechsegments.Thelengthofthesefilesvariedfromacoupleofminutesupto90minutes.Thesewerethenusedtotestthecompletealgorithm.

3.2 Tools Audacity[15]wasusedtoeditandconvertaudiofilestoconstructthetestdatabase.

SonicVisualiser[16],togetherwithvariousVampplugins[17],hasbeenusedforearlyanalysisofthetestdatabase.SonicVisualiserisdevelopedbytheCentreforDigitalMusic,QueenMaryUniversityofLondon[18]andiseasytousetogetaquicklookathowvaluesoffeatureschangesinaudio.

MATLABhasbeenusedfortestingandevaluatingduringthewholeprocess.ManyofthefeatureshavebeenextractedusingtheMIRToolbox[19],


15

developedbytheUniversityofJyväskylä[20]inFinland,andtherestofthefeatureswereextractedusingcustomwrittencode.

ThefinalalgorithmiswritteninCcodebecauseofitseffectivenessandspeed,usingthelibsndfilelibrary[13]forreadingwavefiles.

3.3 Feature tests Thefirsttestwastocheckiftheselectedfeatureshaveanysignificancefortheclassificationtask.ThetestedfeatureswerechosenfromearlieralgorithmsthatperformedtheSMDtaskwithgoodresultsandafewwereaddedbasedonpersonalhypothesizes.

Someinitialtestswerealsodonewithotherfeatureslikeflux,otherrhythmfeaturesanddifferentusesofMFCC,howeversincethesefeaturesdidnotshowanypotentialfortheSMDtasktheywerenotincludedinthesetests.

Intheseteststhefeaturesweretestedfortheclassificationpurposeandthecorrelationbetweenfeaturesweremeasured.Thechosenfeatureswereextractedfromthetestmaterialinthefirstset,containingonlyoneclasstogetadistributionofvaluesforeachclass.Mostfeaturesweretestedinfivedifferentways,theextractedvalue,thevariance,thestandarddeviation,thederivativeandthestandarddeviationofthederivative.Boththestandarddeviationandthevariancederivefromthesamedatasincethevarianceisthesquareofthestandarddeviation.Becauseofthis,onlythestandarddeviationresultsarepresented.Histogramsoftheresultswerethengeneratedtovisualizethedistribution.Featuresthatshowedinterestingresultsarefurtherdiscussedlaterinthischapter.

ThetestedfeatureswereRMSamplitude,ZeroCrossing‐Rate(ZCR),MelFrequencyCepstrumCoefficients(MFCC),SpectralCentroid(SC),PulseClarity(PC)andModifiedLowEnergyRatio(MLER).Theirabbreviationswillbeusedduringtherestofthischapter,togetherwithanabbreviationforthewaythefeatureisusedusingthefollowingnamingconventions.

SD StandardDeviationD DerivativeSDD StandardDeviationofDerivative

ThismeansthatthestandarddeviationoftheZeroCrossing‐RatewillbeabbreviatedZCR|SD.Atotalof29differentfeaturevariationsweretestedinthesetests.


16

3.3.1 Feature test results

Feature Framelength AccuracyRMS 20ms 0.639RMS|SD 1s 0.829RMS|D 1s 0.548RMS|SDD 1s 0.764ZCR 20ms 0.588ZCR|SD 1s 0.837ZCR|D 1s 0.550ZCR|SDD 1s 0.835SC 1s 0.581SC|SD 30s 0.792SC|D 30s 0.589SC|SDD 30s 0.958MFCC1 20ms 0.517MFCC1|SD 1s 0.548MFCC1|D 1s 0.525MFCC1|SDD 1s 0.553MFCC2 20ms 0.521MFCC2|SD 1s 0.548MFCC2|D 1s 0.525MFCC2|SDD 1s 0.553MFCC3 20ms 0.521MFCC3|SD 1s 0.548MFCC3|D 1s 0.525MFCC3|SDD 1s 0.553MFCC4 20ms 0.519MFCC4|SD 1s 0.548MFCC4|D 1s 0.525MFCC4|SDD 1s 0.553MLER 1s 0.969PC 5s 0.807

Table1.Featuretestresults.Framelengthisatimemeasureandaccuracyisshowninpercentage/100.

ResultsofthefeaturetestsareshowninTable1above.Theaccuracyoftheeachfeatureisameasureofhowfarthespeechandthemusicdistributionarefromeachother.Thiswascalculatedbyfindingthethresholdvaluewiththeleastmisclassificationsbytestingallthresholdvaluesinanintervalspecifiedforeachfeature.Thenumbersofmisclassifiedframesforthebestthresholdwerethencountedanddividedbythetotalnumberofframesandthentheresultwassubtractedfrom1.

€

A =1− min(misclass)nFrames

(8)

Featureswithaccuracyresultscloseto50%canbeconsideredasrandom,containingnousefulinformationfortheclassification.Thethreshold


17

optimizationdescribedaboveisthereasonthatallfeaturesperformedover50%.

ThefivefeatureswithhighestaccuracieswastheModifiedLowEnergyRatio(97%),thestandarddeviationofthederivativeoftheSpectralCentroid(96%),thestandarddeviationoftheZeroCrossing‐Rate(84%),thestandarddeviationoftheRootMeanSquare(83%)andthePulseClarity(81%).Surprisingly,noneoftheMFCCfeaturesshowedanyusefulresultsinthesetestandgottheworstresultsofallfeatures.TheextractedvaluesforeachMFCC‐parametersgavejustabove50%accuracieswiththeoptimalthreshold.Sincesmallvariationscandependontherathersmalltestdatabase,theMFCC‐featuresthemselvescanberegardedasuselessforthesekindsofclassificationtasks.Thereasonthatitgotover50%isalsobecausethebestthresholdvalueissought,andthereforegivesahigheraccuracy.However,MFCCfeaturescanstillbeusedinotherwaysdiscussedlaterinthispaper.

Thecorrelationbetweenthefourtopfeaturescannotbecalculateddirectlysincetheyusedifferentlengthofanalysisframesandthereforegeneratedifferentnumberofdatapoints.TheMLERfeatureusesthepausesinspeechtodiscriminatebetweentheclassesandsodoestheZCR,sotheywillshowhighcorrelationbetweenthem.Unfortunatelyallfourfeatureshaveproblemwiththesamekindofaudio,namelyspeechwithbackgroundnoiseofsomekind,whichareoftenclassifiedasmusic.Thecorrelationbetweenthetopfourfeatureshasbeencalculatedbycomparingthemeanvaluesforeachfeatureforeveryfileinthefirstsetinthetestdatabase.

MLER SC|SDD ZCR|SD RMS|SD PCMLER ‐ SC|SDD 0.93 ‐ ZCR|SD 0.97 0.97 ‐ RMS|SD 0.97 0.93 0.97 ‐ PC 0.87 0.77 0.77 0.83 ‐Table2.Cross‐correlationofthetopfourfeaturesfromtheaccuracytests.

AsseeninTable2,allthetopfourfeaturesarestronglycorrelated.ThePulseClarityfeatureshowstheleastcorrelationwiththeotherswhilethestandarddeviationoftheZeroCrossingRateshowedashighcorrelationsas97%withboththeModifiedLowEnergyRatioandthestandarddeviationofthederivativeoftheSpectralCentroid.

3.3.2 Analysis of Low-level features TheLow‐levelfeaturesareinterestingbecausetheyincurrelativelylowcomputationcosts.However,bothRMSandZCRshowedweakresultsinthefeaturetests.RMSperformedsomewhatbetterwhichcanbeseeninthehistogramsbelow,wherebigdifferencesbetweentheredandbluegivesgooddiscriminationpossibilities.BothRMSandZCRwhereextractedforeveryframe.SpeechtypicallycontainsmanyframeswithRMSvaluesclosetozero,whilemusichasalmostnozeroenergyframesandapeakatabout0.4.The


18

closetozeroenergyframesinspeechrepresentsthepausesbetweensyllablesthatalwaysexistsinspeechrecordingswithoutbackgroundsound.TheZCRvaluesarecenteredroundthesamemeanvalue,butmusichasahigherpeak,i.e.alowerstandarddeviation.ThisisalsowhatgivesthegoodresultsforZCR|SDandZCR|V.

Figure9.Histogramsoffeaturevalues.Left:RMS.Right:ZC.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredaroundtheX‐values.

3.3.3 Analysis of MFCC TheMFCCfeatureshavebeenfrequentlyusedinMIRalgorithms.However,asseeninthefigurebelowalltheMFCCvaluesarecenteredroundthesamevalues,butwithsomedifferencesinthestandarddeviations.Asseeninthetestresultsthesevariationsofthefeaturesalsoscoredhigherthantheextractedvalues,butstillwithverylowtopscores.


19

Figure10.Histogramoffeaturevalues.Topleft:MFCC1.Topright:MFCC2.Bottomleft:MFCC3.Bottomright:MFCC4.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.

3.3.4 Analysis of Modified Low Energy Ratio AsseeninTable1,theabsolutevaluesofthefeaturesareoftennotagoodwaytodiscriminatebetweenthetwoclasses.AbetterwaymightbetousethefeaturechangesovertimeasdonebytheMLERfeature.

Figure11.MLER.Left:Thresholdoptimization,Y‐axisshowscorrectlyclassifiedframepercentageandX‐axisshowsthethreshold.Right:Withthreshold0.26.Blueshowsmusicandredshowspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.


20

Figure11showstheModifiedLowEnergyRatioasdescribedearlierinthereport.Whenusingathresholdforlowenergyat0.26ofthemeanRMSvalue,theclassificationachieved97%accuracy.IftheMLERequalszerotheframeisclassifiedasmusic,otherwiseitisclassifiedasspeech.AlmostallthespeechframeswithzeroMLERcomefromsportcommentarywithanaudienceinthebackground.Theleftgraphshowstheoptimizationofthetotalerror‐rate.Insomeapplicationsitmightbebettertooptimizeitforeitherspeechormusic.

3.4 Segmentation tests Thesegmentationtestsweredoneontheaudiofilescontainingatleastonetransitionbetweenclasses.Theexactpositionsofthetransitionswithinthefileswerenotedasagroundtruthforthetests.Thesepositionsarenotabsoluteandasmalltestshowedthatdifferentpeoplewouldplacethesepositionsdifferently.Sometimesatransitioncanbeuptohalfasecondlongsegmentofsilence,anditwouldbeacceptableforthesegmentationtoplacethetransitionanywhereinthissilence,sincesilencewasnotdefinedaseitherspeechormusic.Forthesetworeasons,adeviationof100mswasallowedandcountedasacleanhit.Adevianceof0.1‐1secondcountedasahit,andthedistancefromtheborderofthecleanhitsegmentwascalculated.Ifthesegmentationmarkedatransitionmorethan1secondawayfromthegroundtruth,thiscountedasamiss.

Threemeasureswerethenusedtoevaluatethesegmentationtechniques:

Hitefficiency: Ameasureofhowmanyhitswerefound.Themissesissubtractedfromthehits,andthedifferenceisthendividedbythetotalnumberoftransitions.

€

Hiteff =Hits−MissesTransitions

(9)

Hitaccuracy: Anaccuracymeasurewhereonlythehitsareconsidered.Allthedistancesareaddedandameanvaluewascalculatedbydividingbythetotalnumberofhits.Acleanhitcountsaszerodistance.

€

Hitacc =

Hitposn −Transposnn=1

N

∑N

(10)

WhereNisthenumbersofhits.

Hitrate: Asimplemeasurewhereonlythehitsareconsideredonceagain.Thetotalnumberofhitsisdividedbythetotalnumberoftransitionsinthetestfiles.

€

Hitrate =Hits

Transitions

(11)


21

Noneofthesemeasuresconsidersthecomputationtimes:Thesearediscussedinthetestevaluationsbelow.

Inthefirstpartofthetest,theregiongrowingtechniquewastestedagainsttheneighboringdifferencetechnique,bothdescribedin2.3.Bothtechniquesweretestedwiththesamefeatures.

Technique Hitefficiency Hitaccuracy HitrateRegiongrowing 81% 79ms 100%Neighboringdifference 94% 44ms 99%Table3.Segmentationtechniquetestresults.

AsseeninTable3,theneighboringdifferencetechniqueachievedbetterresultsatboththeHitefficiencyandtheHitaccuracy.TheHitratemeasureishardertoevaluate,butresultsascloseto100%aspossiblearegood.Theneighboringdifferencetechniquedidnotdetect1%ofthehits,butascanbeseenintheHitefficiencymeasure,didnotdetectasmanymisseseither.AchoicewasmadetopursueonlytheNeighboringdifferencetechniquebecauseofitsefficiencyandaccuracy.

Feature Hitefficiency Hitaccuracy HitrateMLER 95% 19ms 99%SC|SDD 93% 70ms 99%ZCR|SD 88% 24ms 99%RMS|SD 87% 27ms 99%PC 88% 72ms 99%Table4.Segmentationfeaturetestresults,showingthetop5features.

Inthesecondtestallfeaturestestedinthefeaturetestsweretestedfortheneighboringdifferencetechnique.Thesamekindoffeaturesperformedwellalsointhesetests,althoughfeatureswithlowanalysiswindowsandshortframesperformedbetterandshowedbetterHitaccuracyresults.Table4showsthatthetop4featuresalldetected99%ofthetransitions,buttheMLERfeatureachievedthebestresultsbothfortheHitefficiencyandtheHitaccuracymeasures.FurthertestswerethendonewiththeMLERfeature,andboththeefficiencyandtheaccuracyimprovedwhenusingshorterframelengthsandinsteadusedacombinationofthemeanRMSandthevarianceofRMS.AHitefficiencyof96%andaHitaccuracyof17mswerethenachieved.

3.5 Design criteria Therearemanyaspectstoconsiderwhenchoosingfeaturestouseinthealgorithm.Thetestresultsneedtobethoroughlyanalyzedregardingtheseaspects:

• Accuracyofthefeaturetobeabletodiscriminatebetweenthetwoclasses,neededforbothclassificationandsegmentation.Idealfeatureshavesimilarvalueswithinoneclassandacleardifferencefromotherclasses.


22

• Computationalcosts.Costlyoperationsshouldbeavoidedtomakethediscriminationtaskefficientandcheap.Evenofflinesystemsneedtobefasttoimprovethecostefficiency.

• Correlationbetweenchosenfeatures.Theuseoftwodifferentfeaturesneedstobemotivatedbyimprovedresults.Iftwofeaturesarehighlycorrelated,theimprovementsinaccuracywillberelativelycostly.

• Insensitivitytonoiseintheinputsignal.

TheMLERfeaturewaschosenforitsexcellentaccuracyandbecauseofitscombinationwiththeneighboringdifferencesegmentationtechnique.Boththesegmentationandtheclassificationarebasedononlyonesinglelowlevelfeature,theRMSamplitude.Thismakesthecomputationalcostsverylow.Sincethehighperformingfeatureswereallhighlycorrelateditwasnotmotivatedtoaddanotherfeature.Theimprovementsinaccuracywillbecostly,andanotherclassificationmethodwillbeneeded.WhentheRMSamplitudeisnormalizedoverthefilemaximumitisalsoinsensitivetosoundlevels,butthebackgroundnoisewillstillbeaproblemintheclassification

Thethreecalculatedmeasuresfromthesegmentationtestswhereconsideredwhenchoosingthesegmentationmethod.TheHitratewasconsideredthemostimportant,sincemisses(transitiondetectionswherenotransitioniswithinonesecond)canbediscardedinanalgorithmusingsegmentationrefinementmethods.TheHitefficiencywasonlyconsideredwhentwotestsshowedcommonresultsintheHitrate.Thecomputationtimestoo,wereconsideredwhenchoosingthefinalmethodforthealgorithm.

Theneighboringdifferencetechniqueshowedbetterresultsforallfeaturesandwasthereforethestrongestcandidate.ItisalsoagoodmatchwhentheMLERisusedfortheclassificationtasksinceitachievedgoodresultsinHitaccuracywhenusingRMS.Computationtimesarelargelyreducedsinceboththeclassificationandthesegmentationwillusethesameextractedfeatures.


23

4 Algorithm ThischaptercontainsadetaileddescriptionofthefinalalgorithmthatwascodedanddeliveredtoSwedishRadio.ThealgorithmiswritteninCcodeandusesthelibsndfilelibrary[13]toreadandwritetowavefiles.Thediscriminationisdoneonaudiofilesandhencethisisanofflineprocedure.

4.1 Signal preprocessing Beforeanythingisdonewiththeaudio,afewpreprocessesareperformedontheaudiosignal.

Thereisnoaddedinformationinthedifferenceoftwochannelsthatcanbeusedfortheclassificationorthesegmentation.Thereforeitisdesirabletohaveamonosignaltosimplifylaterprocesses.Thealgorithmchecksthenumberofchannelsoftheaudio.Ifthesignalhasmorethanonechannel,itismixeddowntomono.

Theamplitudeofthesignalisthennormalizedtothemaximumamplitudeofthewholefiletoremoveanyeffectstheoverallamplitudelevelmighthaveonthefeatureextraction.

4.2 Feature extraction Aftertheaudiosignalhasgonethroughthepreprocessingpart,itissplitinto21msnon‐overlappingframes.TheRMSamplitudeisthencalculatedforeachframeusingequation1.

OncetheRMSamplitudehasbeencalculated,theframesaregroupedtogethertoform1second(48shortframes)analysisframes.Thesearealsonon‐overlapping.FourfeaturesbasedontheRMSamplitudevaluesarethenextractedfromeach1secondframe,meanRMS,varianceofRMS,alocallynormalizedvarianceoftheRMSandaModifiedLowEnergyRatio(MLER).ThenormalizedvarianceisthevarianceofRMSdividedbythemeanRMS.

TheModifiedLowEnergyRatioistheproportionoflowenergyshortframeswithinthe1secondframe.ThethresholdforlowenergydependsonthemeanRMSamplitude.ThemeanRMSamplitudeintheanalysisframeismultipliedwithapredefinedvalueddefinedbythetestresults.SpeechcontainsmanysmallpausesbetweensyllablesandwordsandthereforehasahigherMLERthanmusic.Thisisusedlatertoclassifyeachsegment.

4.3 Segmentation Thetaskforthesegmentationpartofthealgorithmistofindtheexactpositionoftransitionsbetweentwoclasses.Thesegmentationisbased,justastheclassification,onthefeaturesextractedearlier.Every1secondframeis


24

examinedtolookforcandidatesfortransitionframes,andthenanexactpositionofthetransitionisfound.

Thisisamodifiedversionofthesegmentationdonein[2].Theadvantageofthismethodisthatitusesthesamebasefeature,RMSamplitude,astheclassification.Thismeansthatnofurtherfeatureextractionisneededandthissavescomputationtimeandminimizesthereadingoftheaudiofiles.TheRMSamplitudeisusedinanotherway,sincetheMLERfeatureusedforclassificationrequireslongeranalysisframes,whilethemeanandvarianceoftheRMSchangesasquicklyasdotheaudioclasses.

Thefirststepisdonebylookingattheframebeforeandtheframeaftertheexaminedframe.Ifthetwoneighboringframesaredifferenttheexaminedframeislikelytohaveatransition.Thetransitioncanbeanywherewithinthatsecondandtheexactpositionisdeterminedinthenextstep.Aproblemwilloccuriftheclasschangestwotimeswithintheexaminedframe.Thenthetwoneighboringframeswillnotdifferenoughtobechosenasacandidatefortransitionframes.Although,thesekindsoferrorsarenotcorrectedinthesegmentationsincesegmentssmallerthan2secondswillberemovedlater.

Figure12.Waveformwithtransitionfromspeechtomusic.Secondsaremarkedwithverticallines.

ThecomparisonbetweenneighboringframesarebasedonthemeanandthevarianceoftheRMSvalues.SincethedistributionoftheamplitudeoftheaudiosignalsisaLaplaciandistributionasshownin[2],aprobabilitydensityfunctionofthe

€

χ 2distributionisused,definedas

€

p(x) =xae−bx

ba+1Γ(a +1)

(12)

wherex≥0,Γisthegammafunctionandthetwoparameters,aandb,aredefinedas

€

a =µ2

σ 2 −1and

€

b =σ 2

µ

(13)

whereμisthemeanRMSandσ2istheRMSvariance.Thesimilaritymeasureisbasedontheprobabilitydensityfunction

€

p p1, p2( ) = p1 x( )p2 x( )dx∫ (14)

wherep1andp2referstotheprobabilityfunctionsofeachframe.Whenthe

€

χ 2distributioninequation(12)isinsertedinequation(14)thisgivesasimilaritymeasurecalculatedwith


25

€

p(p1, p2) =Γa1 + a22

+1

Γ a1 +1( )Γ a2 +1( )2a1 +a22

+1b1a2 +12 b2

a1 +12

b1 + b2( )a1 +a22

+1

(15)

Sincethemeasureiscalculatedwithtwoframeswithoneframebetween,theexaminedframei,thedissimilaritymeasureisdefinedas

€

Dissim i( ) =1− p pi−1, pi+1( ) (16)Thiswillgivehighprobabilitiesofchangeevenforthesurroundingframes.AsseeninFigure12,outofthefivesecondsmarkedbytheverticallinessecond2and4willdiffermostinmeanandvarianceofRMS.However,second3and5willalsogivehighvaluesinthedissimilaritymeasure.Todampentheeffectofthiserror,afilterneedstobeapplied.Thesimilarityvalueisthereforelocallynormalizedover5secondswiththeexaminedframeinthecentre.Thenormalizingiscalculatedby

€

Dissimnorm (i) =Dissim(i) × D(i) − D(i − 2) + ...+ D(i + 2)

5

max(D(i − 2),...,D(i + 2))

(17)

Thedissimilaritymeasureismultipliedbyapositivedifferenceofthemeanoftheneighborhood.Ifthedifferenceisnegative,itissettozero.Thisisthendividedbythemaximumvalueoftheneighborhood.

Athresholdforthenormalizeddissimilarityvalueisthensetaccordingtheresultsofthetestmaterial,todeterminewhichframesareselectedascandidatesfortransition.ThethresholdisvariableanddependsonthevarianceoftheRMSintheneighboringframes.

Whenthecandidatetransitionframeshasbeenchosen,anexactpositionforthetransitionneedstobefound.Thisisdoneinawaysimilartothelaststep.Forevery20msframethepreviousandthenextonesecondsarecompared.Avalueisgivenforevery20msframefortheprobabilityofchange,andtheframewiththehighestprobabilityismarkedastheexactpositionofthetransition.

4.4 Classification OnlytheMLERfeatureisusedfortheclassificationpart.ApredefinedthresholdissetandallsegmentsthathaveahigheraverageMLERthanthethresholdareclassedasspeech,andeverythingbelowthethresholdisclassedasmusic.Theuseofonlyonefeaturereducesthecomputationtime.Thealgorithmneedsnotrainingmaterialtowork,sincethethresholdissetaccordingtotheresultsofthetestmaterial.


26

4.5 Refinement Simplerefinementsofthesegmentsaredoneaftertheclassification.Iftwoconsecutivesegmentsaregiventhesameclasstheyaremergedtogetherandthetransitionbetweenthemiserased.

4.6 Output Whenalltheprocesses,segmentation,classificationandrefinementsaredone,theresultsarereadytobeoutput.Thealgorithmthencreatesasimpletextfilewithonelineforeachsegment.Thelinecontainstheexactpositionofthestartofthesegmentandabinarynumbershowingtheclassofthesegment.Alinewouldlooklikethis

1–0.0000000‐9.7708331–27.098166

wherethe0standsforspeech(1formusic)and9.770833istheexactpositionofthetransitionmeasuredinseconds.Thepositionofthetransitionismeasuredinsecondsbecauseofeasierhandlingforotherapplications.Ifthepositionweremarkedwithanexactframe,thethirdpartyapplicationwouldalsohavetoknowthesamplefrequencyoftheaudio.Thetransitionframeiseasilycalculatedby

€

F = t * sr (18)wheretisthepositionofthetransitionmeasuredinsecondsandsristhesamplerateoftheaudio.

AFlashplayerthatusestheoutputdatatomarkthesegmentscanthenreadthetextfileandmarkthesegmentsinthenavigationbar,asseeninFigure13.Thiscouldbeusedasaguidewheneditingtheaudio,orsimplytolookforerrorsintheoutput.

Figure13.Flashplayerwithmarkedsegmentsinthenavigationbar.Greenshowsspeechsegmentsandwhiteshowsmusicsegments.

TheBroadcastWaveFormatcontainsamarkerchunk,whichcouldbeusedtomarkthetransitionpointsofthesegments.Thisisnotyetintegratedintheapplication,butcouldbedoneforspecificimplementations.


27

4.7 Results Thealgorithmtestswereperformedonthefinishedalgorithm.AudiofilescontainingfulllengthsprogramsfromSwedishRadiowereusedasinputandboththesegmentationandclassificationwereevaluatedatthesametime.Theresultingaccuracyisthepercentagewheretherightclassisfoundattherighttime.Thelengthofthecorrectlyclassedaudioisdividedbythetotallengthoftheprogram.

Speech MusicSpeech 95,4% 4,6%Music 1,9% 98,1%

Table5.Algorithmresults.Leftcolumnshowsinputandtoprowshowstheoutputofthealgorithm.

Theleftcolumnofthetableshowstheclassoftheinputtedaudio,andthetoprowshowstheclassoutputtedbythealgorithm.Speechreachesaloweraccuracybecauseofthesportcommentarysegmentsthatareoftenclassedasmusic,whilemusiciscorrectlyclassedasmusicasoftenas98,1%ofthetime.

Right WrongTotal 97,3% 2,7%

Table6.Summarizedresultsforallinputstothealgorithm.

Thisgivesatotalaccuracyof97,3%sincethetestmaterialcontainedmoremusicthanspeech.

Thecomputationtimeofthealgorithmvariesdependingontheformatoftheinputandthenumberofinputchannels.Thecomputationtimedidnotexceed1%ofthelengthoftheaudioinanyofthetestfiles.


28

5 Conclusions Thefinalalgorithmgivesanaccuracyofover97%inthetestsperformedwithmaterialfromSwedishRadio.Thismatchestheresultsreportedinearlierwork,yetwithoutanyadvancedfeaturesthatrequirelongcomputationtimes.97%isalsoenoughformostapplications.Insomeapplicationsthemisclassifiedaudiostillneedstobeconsidered.Sinceweknowfromthetestswhatkindofaudiothatgiveslowaccuracies,thiscanbedonebymanuallydiscriminatingthesefiles.

Agraphicalinterfacewherethesegmentsaremarkedcouldbeofgooduseformanualworkwiththeaudio.Thealgorithmdoesthediscriminationjob,buttheresultsmightstillneedrefinement.Thisistrueforanapplicationforautomaticeditingforpodmaterial,wheretheresultsofthediscriminationcanbeusedasaguide,butsomeeditinglikecross‐fadesisstillneededtogetagoodsoundingresult.

Offlinesystemsbenefitmostfromfastercomputationtimes.Real‐time,onlinesystemsstillhavetousethesamelengthfortheanalysiswindowtogathertheneededstatisticsandcanbenefitonlybybeingabletousemoreandmoreadvancedfeatures.Offlinesystems,ontheotherhand,canbothusemoreadvancedfeaturestoachievehigheraccuracyandatthesametimecomputefaster.


29

6 Future work Therearestillendlessfeaturesandfeaturecombinationstobetestedforthistask.Thefeaturestestedduringthisworkarestillquitesimple.Eventhemodifiedspecialfeaturesliketheoneusedinthealgorithmarebasedonlyononelowlevelfeature.FurthertestscouldalsobedonewithMFCCfeatures,eventhoughtheyshowedpoorresultsinthepresenttests.CombinationsofdifferentMFCCfeatureshavebeentestedinearlierwork.FurtherexploringrelationsbetweendifferentMFCC,suchasdistancesbetweenthesecondandthirdcoefficient,couldgivegoodresults.

Morecomplexfeatureslikepitchcurvesextractionhavebeendiscussedandtentativelytested,butthereisstillmuchtoexploreinthisarea.PitchfeaturesshouldbeagoodcomplementfortheMLERfeaturesincetheyworkondifferentaspectsofthesound.AlsofeaturesbasedonrhythmcouldbeagoodalternativetocomplementtheMLERfeature.

Addingsubclassestoboththemusicandthespeechclasseswouldbeusefulformanyapplications.SwedishRadiohasalreadyrequestedthepossibilitytobeabletodiscriminatebetweenmaleandfemalespeakers.Musiccouldbesplitintogenresub‐classesforfurtherdiscrimination.Thereisalreadysomeworkdonewithgenreclassification,butmoreresearchisneededtocreateausefulapplication.

Constructingalowcostreal‐timediscriminatorthatcouldbeinsertedincarradioreceiverscouldbeanotherkindofproject.Listeningincarsofteninvolvesnoisyenvironmentswherespeechneedstobeamplifiedmorethanmusictoincreasetheaudibility.Differentlisteningenvironmentsdemanddifferentamplificationsofspeech.Insuchanapplication,theprocessingcannotbedoneinthetransmitterandneedsbeprocessedinreal‐time.


30

7 Acknowledgements ThisworkhasbeendonewithmuchhelpfrombothSwedishRadioandtheRoyalInstituteofTechnology.Specialthanksto:

MysupervisoratSwedishRadio,BjörnCarlsson,forallthehelpwithtechnicalquestionsaboutradioandalsoforalltheencouragementduringthework.

MysupervisorattheRoyalInstituteofTechnology,AndersFriberg,forallthefeedbackandideas.Alsoforallthespecialknowledgeinfeatureextractionandclassification.

HasseWessmanandLarsJonssonatSwedishRadio,TechnicalDevelopmentforgivingmetheopportunitytodothisprojectandgivingmeaninspiringworkingenvironment.

TherestofthestaffatSwedishRadio,TechnicalDevelopmentforfeedbackandmotivation.

MyfriendJohnHäggkvist,whohashelpedmeoutduringthecodingofthealgorithm.


31

8 References [1] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Acomputationally

efficientspeech/musicdiscriminatorforradiorecordings”,inUniversityofVictoria,ISBN:1‐55058‐349‐2,pp.107‐110,2006.

[2] Panagiotakis,C.&Tziritas,G.“ASpeech/MusicDiscriminatorBasedonRMSandZeroCrossings”,IEEETransactionsonMultimedia,vol.7(1),pp.155‐166,Feb.2005.

[3] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Speech/MusicDiscriminationforradiobroadcastsusingahybridHMMBayesianNetworkarchitecture”,inProc.ofthe14thEuropeanSignalProcessingConference(EUSIPCO‐06),September4‐8,2006,Florence,Italy.

[4] Muñoz‐Expósito,J.E.,Garcia‐Galán,S.,Ruiz‐Reyes,N.,Vera‐CandeasP.&Rivas‐Peña,F.“Speech/MusicdiscriminationusingasinglewarpedLPCbasedfeature”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2005.

[5] McKinney,M.F.&Breebart,J.”FeaturesforAudioandMusicClassification”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2003.

[6] Karnebäck,S.”Speech/MusicDiscriminationUsingDiscreteHiddenMarkovModels”,TMH‐QPSR,KTH,Vol.46,41‐59,2004.

[7] Alnabadi,M.S.”RealTimeSpeechMusicDiscriminationUsingASingleFeature”,DurhamUniversitySchoolofEngineering,2007.

[8] Wang,W.Q.,Gao,W.,Ying,D.W.“AFastandRobustSpeech/MusicDiscriminationApproach”,InProceedingsoftheInternationalConferenceonInformation,CommunicationsandSignalProcessing,2003.

[9] Saunders,J.“Realtimediscriminationofbroadcastspeech/music”,inProc.IEEEIntern.Conf.onAcoustics,SpeechandSignalProcessing,1996.

[10] El‐Maleh,K.,Klein,M.,Petrucci,G.&Kabal,P.“Speech/Musicdiscriminationformultimediaapplications”,ICASSPOO,2000.

[11] Ajmera,J.,McCowan,I.&Bourlard,H.“Speech/MusicsegmentationusingentropyanddynamismfeaturesinaHMMclassificationframework”,inSpeechCommunication40,pp.351‐363,2003.

[12] SwedishRadiowebpage,http://www.sr.se/sida/default.aspx?ProgramId=2438.Retrieved6/12010.

[13] Libsndfilewebpage,http://www.mega‐nerd.com/libsndfile/.Retrieved6/12010.

[14] Burred,J.J.,“AnObjectiveApproachtoContentBasedAudioSignalClassification”,TechnischeUniversitätBerlin,2003.

[15] Audacitywebpage,http://audacity.sourceforge.net/.Retrieved24/22010.

[16] SonicVisualiserwebpage,http://www.sonicvisualiser.org/.Retrieved24/22010.

[17] Vamppluginswebpage,http://vamp‐plugins.org/.Retrieved24/22010.


32

[18] CentreforDigitalMusicwebpage,http://www.elec.qmul.ac.uk/digitalmusic/.Retrieved24/22010.

[19] MIRtoolboxwebpage,https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox.Retrieved24/22010.

[20] UniversityofJyväskyläwebpage,https://www.jyu.fi/en/.Retrieved24/22010.

[21] Leijon,A.“SoundPerception:IntroductionandExerciseProblems”.RoyalInstituteofTechnology,Stockholm,2007.

automatic speech/music discrimination in audio files · automatic speech/music discrimination in...

Documents