ieee transactions on audio, speech, and language ... · ieee transactions on audio, speech, and...

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 10, DECEMBER 2012 2637

Structural Segmentation of Multitrack AudioSteven Hargreaves, Student Member, IEEE, Anssi Klapuri, Member, IEEE, and Mark Sandler, Senior Member, IEEE

Abstract—Structural segmentation of musical audio signals isone of many active areas of Music Information Retrieval (MIR)research. One aspect of this important topic which has so far re-ceived little attention though is the potential advantage to be gainedby utilizing multitrack audio. This paper gives an overview of cur-rent segmentation techniques, and demonstrates that by applying aparticular segmentation algorithm to multitrack data, rather thanthe usual case of fully mixed audio, we achieve a significant andquantifiable increase in accuracy when locating segment bound-aries. Additionally, we provide details of a structurally annotatedmultitrack test set available for use by other researchers.

Index Terms—Audio, multitrack, music information retrieval(MIR), structural segmentation.

I. INTRODUCTION

T HE manner in which humans listen to, interpret and de-scribe music implies that it must contain an identifiable

structure. The terms used to describe that structure will vary ac-cording to musical genre, but commonly it is easy for humansto agree upon musical concepts such as chorus, verse, melody,beat, bass, movement, solo, noise and so forth. The fact thathumans are able to distinguish between these features impliesthat the same might also be achieved via signal processing; in-deed, over the last few years increases in computing power andadvances in music information retrieval have resulted in algo-rithms which can extract features such as timbre [4], [24], [37],tempo and beats [27], note pitches [23] and chords [26] frompolyphonic, mixed source digital music files (e.g. mp3 files, aswell as other formats).

A significant problem when attempting to extract fea-tures from mixed source signals is that some complex timeor frequency domain signal decomposition must usually beperformed in order that the salient parts of a signal may be ana-lyzed in isolation; for example a beat tracker will probably needto disregard long term orchestral swells, whilst an algorithmdesigned to extract melodic information must ignore percussivetransients. One related area of research which has begun tobe investigated by Fazekas et al. [15]–[17] and which avoids

Manuscript received December 16, 2011; revised May 09, 2012; acceptedJune 28, 2012. Date of publication July 18, 2012; date of current version October01, 2012. This work was supported by EPSRC studentship no. EP/505054/1.The associate editor coordinating the review of this manuscript and approvingit for publication was Mr. James Johnston.

The authors are with the Centre for Digital Music, Department of ElectronicEngineering, Queen Mary University of London, London, E1 4NS, U.K.(e-mail: [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2012.2209419

this issue, is that of applying the techniques outlined aboveto the collection of individual source audio tracks availableduring the recording and/or production stage in the studio.Sophisticated Digital Audio Workstations (DAWs) are nowcommonplace, not only in professional recording studios butalso in the amateur musician’s home studio, enabling con-sumers to exploit the kind of music recording and productiontechniques previously only available to a minority who werefortunate enough either to have the budget or opportunity togain access to expensive studio time. A facility still lackingthough is the ability to quickly navigate around the structureof recorded audio. We now take it for granted that we are ableto navigate around a word-processed document by character,word, sentence, paragraph, section or chapter, whilst beinglimited within a DAW to either a fast-forward (or backward)search, a jump to manually entered temporal label, or a manualscroll through the audio. When describing an audio browser forannotation purposes, Tzanetakis and Cook [36] point out that“The typical ‘tape-recorder’ paradigm for audio user interfacesis time-consuming and inflexible.” The user must rely on audioor visual cues (in the case of examining a waveform display)and his or her own ability to interpret those cues in order tolocate a section of interest.

Having access to the original multitrack source audio filestheoretically enables us to obtain both a more accurate segmen-tation, and a richer set of metadata in general, since salient audiofeatures which might otherwise have been occluded to some ex-tent in the mixed version of a song are now able to exert greaterinfluence in our analysis.

It is already possible, using various methods (described inSection II) to structurally segment mixed polyphonic music toa certain extent. The aim of this research paper is to demon-strate an improvement in segmentation accuracy when multi-track rather than mixed audio data is analyzed. We achieve thisby applying one particular segmentation technique to multi-track audio and comparing the results against (a) results ob-tained from the same technique applied to mixed versions ofthe same songs, and (b) results obtained using a state-of-the-artsegmentation algorithm (Mauch et al. [26]), again applied tothe mixed versions. The potential applications of such a tech-nique are manifold, and include improved synchronization ofaudio clips across multiple tracks, segment-specific applicationof audio effects, improved comparison of recording takes, andgeneral editing and navigation tools.

The particular way in which one would structurally segmentmusic is closely tied to the musical genre under consideration.Intuitively, rock and pop music seems like one of the more un-ambiguous types of music for humans to segment (comparedto classical or improvisational jazz, for example) due to thecommon repetition of melodic phrases, chord progressions andbeats. These musical entities are typically repeated every few

1558-7916/$31.00 © 2012 IEEE

2638 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 10, DECEMBER 2012

bars, and sequences of these bars themselves form verses or cho-ruses. In reality though, whatever the genre, it is in the very na-ture of music that rules exist to be broken, and so we shouldnever rely too rigidly upon assumptions about metrical struc-ture, chord progressions, time signatures or anything else.

Consequently this research will focus mainly on rock and popmusic (we include in this definition genres such as soul, R ‘n’B, blues, dance, latin pop, easy listening, folk and electronica).Classical music and jazz will not be considered.

The rest of this paper is set out as follows; in Section IIwe present a brief overview of some of the techniques com-monly used in music segmentation. Our particular method ofsegmenting multitrack audio is described in detail in Section III.Results are stated and discussed in Section IV, and finally wepresent our conclusions in Section V.

II. SEGMENTATION BACKGROUND

Several researchers have described methods of carrying outstructural segmentation of music. Abdallah et al. [1] employan unsupervised Bayesian clustering model to classify signalframes according to their audio properties. In their case, audioproperties are obtained by calculating a constant-Q log-powerspectrum, the dimensionality of which is then reduced usingprincipal component analysis. Acouturier et al. [4] use MFCCsas the audio feature, and a Gaussian mixture model to estimatethe distribution of these features, Mauch et al. [26] search forrepetition of chroma sequences, whilst Barrington et al. [5]describe a dynamic texture model based upon both timbraland rhythmical features. We concentrate here on the commonthemes of audio features, self-distance (alternatively known asself-similarity) matrices, homogeneity detection and repetitiondetection, and the reader is referred to [29] for a comprehensiveoverview of music structure analysis techniques.

A. Audio Features

In order to start making meaningful inferences about themusic represented by an audio signal, it is common to firsttransform it into quantifiable measures which are more closelyaligned with human perception of music than simple ampli-tude variations (although amplitude does of course play animportant role in musical perception). This is the process ofconverting the audio signal into a sequence of audio featurevectors , and there are numerous types of audiofeatures which may be of interest to us.

Perhaps the easiest feature to extract is root mean square(RMS) energy. Tzanetakis and Cook [36] identify the relevanceof the RMS energy audio feature to segmentation by noting thatit is a measure of loudness, and changes in loudness are impor-tant cues for new sound events.

The chroma representation of pitch, proposed in 1964 byShepard [35] indicates the relative levels of each of the 12notes of the equal-tempered chromatic scale present in anaudio sample, without indicating the octave to which each notebelongs (an alternative explanation is also given by Bartsch andWakefield [6]). Clearly this has direct relevance to analysis ofwestern music; the ability to determine the relative strengths ofeach note as time varies offers us the potential to identify both

melody and harmony as well as the repetition and variation ofsequences of notes, phrases and chord progressions. Chromaaudio features have been successfully utilised in applicationssuch as chorus identification [6], [19], music thumbnailing [6],[11], and cover song identification [14], [34].

Mel Frequency Cepstral Coefficients (MFCCs) are audio fea-tures which are commonly used to quantify the “timbre” of asample of audio. Timbre itself is not clearly defined, but is typ-ically taken to be an indication of the “quality” of a sound.The American National Standards Institute provide this defini-tion: “Timbre is that attribute of auditory sensation in terms ofwhich a listener can judge that two sounds similarly presentedand having the same loudness and pitch are dissimilar” [2]. Inthe context of solo instruments we use the term to describe theunique sound of a particular instrument (for example the soundof a clarinet compared to that of a saxophone), and beyond that,we would also talk about the difference in timbre of different in-struments of the same class, in order to distinguish (for example)one clarinet from another, or the playing styles of different mu-sicians. In a more general sense we use the term ‘polyphonictimbre’ to describe the overall sound or texture of mixed, poly-phonic audio [4]; for example we might say that each bar ofa verse in a pop song has similar timbre, whilst the timbres ofthe verse and the chorus differ. From a technical point of viewMFCCs are calculated by converting the linear frequency scaleinto a logarithmic scale (known as the mel scale) which is moreclosely related to the critical bands of the human ear, and thelog-power in each of these bands is calculated. The MFCC co-efficients are then calculated by discrete cosine transforming thelog-power spectrum

(1)

where is the coefficient index, is the subband number, isthe total number of sub bands, and is the log energy of sub-band . Timbre modeling is the key feature used by Aucouturieret al. [4] to perform segmentation via long-term similarity andpattern identification.

Jensen [22] describes an audio feature, the rhythmogram,which quantifies the degree of rhythmic change within a pieceof audio. First, we calculate the Perceptual Spectral Flux as

(2)

where is the frame index, and are the magnitude and fre-quency of the bin of the short-time Fourier transform (STFT)obtained using a Hanning window, and is the FFT length. Wis the frequency weighting used to represent an equal loudnesscontour. The rhythmogram itself is then calculated using auto-correlation over a short time window (e.g. 2 seconds) from

(3)

where is the length of the summing window, is the frameindex, and is the feature index for frame .

HARGREAVES et al.: STRUCTURAL SEGMENTATION OF MULTITRACK AUDIO 2639

Other features which may be of interest include those sug-gested by Tzanetakis and Cook [36] for multi-feature segmen-tation (namely spectral centroid, spectral roll-off, spectral flux,and zero crossings), and normalized constant Q spectra sub-jected to Principal Component Analysis (used by Levy et al.[25] and Abdallah et al. [1] for high-level musical structureanalysis).

Careful selection of either a single type of audio feature, or,as suggested by Tzanetakis and Cook [36], a combination offeatures, allows us to proceed to a study of the higher levels ofmusical information contained within the audio signal; for ex-ample the variations and repetitions of pitch, melody, dynamics,chords, harmony and so forth. Segment boundaries themselvesare often indicated by significant changes of multiple features[7]. More evidence of the usefulness of multiple features is givenby Bruderer [10], who finds that important cues are harmonicprogressions, change in timbre, change in tempo and change inrhythm.

B. Self-Distance Matrices

Audio features alone do not present us with a structural seg-mentation of musical audio, we must perform further processingof these features in order to deduce the locations of regions ofsimilarity or repetition. One possible step in this process is toemploy a widely used technique known as self-similarity (oralternatively self-distance) matrix calculation, as proposed byFoote [18]. Using a suitable distance measure such as the co-sine angle between two audio feature vectors

(4)

where is the feature vecture associated with frame , and andare frame indices, the signal is compared with itself in terms

of one or more audio features.The result of calculating these distance measures across all

feature vectors is a two-dimensional representation of the self-distance matrix. By assigning different colors to the values inthis matrix we are able to produce an informative visualizationof the self-similarity in the audio signal. An example derivedfrom the chroma features of the song “People let’s stop the war”by Brad Stanfield (a pop/rock song with clear chorus, verse andbridge sections) is shown in Fig. 1. The temporal locations of theground truth segment boundaries are shown above the self-dis-tance matrix; a clear correlation can be seen between the groundtruth locations and the vertical lines dividing regions of homo-geneous colour in the matrix image.

C. Beat-Aligned Frames

When we later come to pick out segment boundaries from thisself-distance matrix, the temporal accuracy to which we are ableto operate will inevitably be limited by the length of the audioframes we use to calculate each audio feature vector. As long asthese frames correspond to sufficiently short periods of time wewill be able to pinpoint temporal locations to a desirable level ofaccuracy. Intuitively we might expect that boundaries are more

Fig. 1. Ground truth segments (top) and visualisation of a self distance matrixfor a musical audio signal (bottom).

likely to fall on strong beats (as opposed to either weak or nobeats), and research into boundary perception by Bruderer [10]supports this hypothesis. Consequently it would be helpful if wewere to choose the length of the audio frames such that they cor-respond to beat intervals present in the audio. By first analyzingthe audio using a beat tracking algorithm, we are then able tochoose lengths such that any frame we select as a boundary isguaranteed to coincide with a beat (assuming the results of thebeat analysis are sufficiently accurate). This enables us to copewith changes in tempo by varying the frame lengths in accor-dance with variations in the distances in time between beats.Furthermore, the number of elements present in self-distancematrix, and also therefore the computational complexity, is sig-nificantly reduced.

D. Homogeneity Detection

Visualizations of self-distance matrices offer a valuable in-sight into the structure of a piece of music. We still need, how-ever, to perform further analysis of the data in order to derivea set of segment boundaries. Noting that areas of homogeneityare represented as square or rectangular blocks in the visual-izations, Foote [18] proposes a method wherein we determinethe variation in the level of correlation (the ‘novelty score’) be-tween a simple binary checkerboard pattern (a kernel) and theself-distance matrix as we slide the kernel along the main di-agonal of the self-distance matrix. Explicitly, we first create an

kernel matrix (the two-by-two case is shown in (5)).

(5)

The time scale upon which variations can be detected is pro-portional to the size of the kernel, and so if we require larger


Fig. 2. Ground truth segments (top) and (normalised) novelty score for a mu-sical audio signal (bottom).

kernels, they are formed by taking the Kronecker product ofand a matrix of ones, e.g. (again, using the two-by-two example)

(6)

The novelty score is then calculated as:

(7)

where is the frame number, is the width (lag) of the kernelcentered on (0,0) and is the self-distance matrix. Peaks

in the novelty score correspond to significant position changesin our multidimensional feature space. Consequently, locatingsegment boundaries becomes a matter of determining whichof the peaks represent a sufficiently large change in featurespace position as to constitute a segment boundary. The noveltyscore derived from the same chroma features used for Fig. 1is shown in Fig. 2, along with the same ground truth segmentdata. Again, good, although not perfect, correlation between theground truth segments and the peaks in the novelty score canbe seen. Segment boundaries are therefore found by employingsome method of selecting the peaks in the novelty score; inour case we try two different methods for comparison purposes(see Section III-B).

E. Repetition Detection

As an alternative to searching self-distance matrices for re-gions of homogeneity, we may also look for repeat sequences,which manifest themselves as stripes (diagonal lines off the

main diagonal). This is the technique employed by Mauchet al. [26] in the algorithm used as a benchmark later in thispaper. After constructing a self-distance matrix from beat-syn-chronous chroma features, candidate segments are identified bysearching for stripes. Computation time is reduced by assuminga minimum segment length of 12 beats, and a maximum of128. Only beats exhibiting a correlation above an empiricallyderived threshold value are considered as segment beginnings,and further refinement is achieved via the calculation of “likelybar beginnings”; local maxima in the convolution of a functionrepresenting likelihood of harmonic change with a kernel ofspikes every 2 beats. Finally a greedy algorithm is used todecide which of the candidate segments are true segments.

F. Hidden Markov Models

Hidden Markov Models (HMMs) enable us to determine theprobability that a system is in a particular state at time , giventhat we have observed some other variable ad know the tran-sition probabilities from to . They have been success-fully applied to pattern recognition applications such as speechrecognition (Rabiner [32] provides a good tutorial and a com-prehensive list of further references in this field), whilst someauthors have also used them as an alternative (or in some cases[30], in addition) to using self-distance matrices, to reveal thestructure of music. Aucouturier and Sandler [3] show that usingHMMs can be beneficial when attempting to segment complexmusic such as classical, however in the simpler case of rock/poptheir use is unnecessary. Raphael [33] also uses HMMs to seg-ment classical music, but at the lower level of individual notesand rests.

III. MULTITRACK AUDIO SEGMENTATION

This section describes an experiment which applies some ofthe techniques described in Section II to the task of identifyingthe temporal locations of structural boundaries in multitrackaudio. We extend the audio feature extraction and self-similarityphases such that features are extracted separately from all ofthe source tracks present in a multitrack project, rather than theusual case of from a single mono or stereo mixdown audio track.

A. Hypothesis

Commercially recorded music usually starts life in thestudio as a multitrack recording before being mixed down tostereo during the production phase. There are exceptions; liveperformances, especially of classical or jazz music, might berecorded using ‘ambient’ rather than close miking techniquesfor example. We concern ourselves here with the former typeof recording (see [20] for a general overview of recording tech-niques). Typically, multitrack recordings will have somewherebetween eight and 24 tracks, although there is no hard upperlimit. Each track is usually a recording of a single instrumentor voice, although in some cases (for example drum kits, stringsections or choirs), there might intentionally be multiple soundsources recorded on to a single track, or unintentionally in othercases due to microphone ‘bleed.’ For the duration of a recordedpiece of music, some of these individual sources might beproducing little or no sound. The temporal changes in activity


Fig. 3. Extracting beats from a simple mix.

of individual instruments is potentially lost to a certain degreein the final mix, and our hypothesis is that having access to themultitrack version of a recording enables us to avoid this lossof relevant information by calculating features from all of theindividual source tracks, rather than just the final mixdown asis usually the case in this research area.

B. Experimental Method

Our method starts, as is common in segmentation tasks, bycalculating audio features for frames of audio which are time-aligned to beats. Beat tracking algorithms are capable of ana-lyzing single channel (or stereo) audio files and producing listsof predicted temporal beat locations; in our case though we havemultiple audio channels, so we perform a simple mixdown stepfirst. In the absence of an “official” version of the final mix wesimply sum the individual source tracks and normalize. We thenuse the beat tracking method described by Ellis [13] to find thebeat locations within this simple mixdown audio file (Fig. 3).The mixdown file is used purely for finding beats, we now dis-regard it and return to consider the multitrack audio files.

As described in Section II, there are several different types ofaudio features we could choose to extract and analyze. At thisstage we do not know which will produce the most effectiveresults, although in the case of final mix audio several authorshave carried out investigations in order to determine the relativemerits of the different types of audio features with regard to seg-mentation. Paulus and Klapuri [28] conclude that either MFCCor chroma features alone work well, however, they worked withfinal mix audio.

Bruderer et al. [8] conclude that changes in timbre, changes inlevel, repetition, and breaks/pauses provide strong cues for theperception of structural boundaries in music. Timbre has beensuccessfully modelled by Aucouturier and Sandler [4] usingMFCCs in conjunction with Gaussian Mixture Models, whilst asimple measure of RMS energy will provide a measure of signallevel, including breaks and pauses. Jensen et al. [21] reportedgood results when using the rhythmogram to segment popularChinese music.

It is instructive to examine some example self-distance ma-trices derived using these features, in order to gain an intuitivegrasp of how well each one is able to model the structuralchanges in some example songs. Fig. 4 shows the self-distancematrix images obtained using MFCCs (left), and chroma (right)features, both for the same song, “Sunrise” by Shannon Hurley.Musically, the song is a fairly traditional sounding pop ballad,containing a typical verse/chorus/bridge structure, with strongmelodic content and percussion. The MFCC-derived imageshows both a very clear block structure at an appropriatelylarge timescale, as well as very definite stripes (e.g. in theregion between approximately 140 s and 180 s on both axes)indicating repetition of a sequence of timbral changes. Itseems likely in this case that further analysis could provideus with plausible segment boundaries. Conversely, whilst thesame block structure is again evident to a certain degree in thechroma-derived image, the detail is less well-defined; the imagehas an overall fuzziness, possibly implying that further analysiswill be less likely to reveal the structure we’re seeking (albeitthe stripes are still strongly evident). By way of comparison,Fig. 5 shows the RMS energy-derived (left) and chroma-derived(right) self-distance matrix images for the song “Hyperpower”by Nine Inch Nails. This song is overwhelmingly definedby changes in timbre and dynamics, on both small and largetimescales, with little to no melodic variation. Accordingly, theRMS energy-derived image exhibits a particularly clear blockstructure on a large timescale ( 20 s), reflecting the obviouslevel differences between each structural segment, whilst thechroma-derived image is slightly less well defined on the largertimescale, but does show good definition over smaller timeperiods ( 1 s).

We surmise from the similarities (and dissimilarities) evidentin Figs. 4 and 5 that all three of these audio features providerelevant, yet different, insights into the structure of a piece ofmusic. For completeness, we also add the rhythmogram featureto this list. We will run multiple experiments using differentweightings of these four features in order to determine the mosteffective combination.

These audio features are calculated for each (beat-aligned)frame of audio from each individual track. In each case the RMSenergy feature is a single number, MFCC is a thirteen elementvector (each frame is characterised by 13 cepstral coefficients),the chroma feature is a twelve element vector (each element in-dicating a measure of the level of one of the 12 chromatic pitchespresent in that audio frame), and the rhythmogram feature is atwo hundred element vector, representing the rhythmic changeat 10 ms steps over a 2 s window.

For MFCC feature calculation, we use Ellis’1 matlab functionwith a 25 ms window length, 10 ms hop time, mel filter bandedges at 0 Hz and 20000 Hz, 40 warped spectral bands, type2 discrete cosine transform type, and no liftering, preemphasisfilter, or dither.

For chromagram feature calculation, we use the LabROSA-coversongID2 matlab code with 4096 FFT length.

1http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/mfccs.html2http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/chromagram_IF.m


Fig. 4. Self-distance matrix images for “Sunrise” by Shannon Hurley.

Fig. 5. Self-distance matrix images for “Hyperpower” by Nine Inch Nails.

The perceptual spectral flux component of the rhythmogramfeature is calculated using a step size of 10 ms and a block sizeof 46 ms. An equal loudness contour with a phon value of 60 wasfound empirically to work best for the frequency weighting W.The rhythmogram itself is then calculated using a 10 ms stepsize and summing window length of 50 frames, before beingbeat-synchronised.

For each frame, we stack all of these features into a singlevector. For an 8 track case this vector would be 1808 elements

long; 13 MFCC values, 12 chroma values, 200 rhythmogramvalues and one RMS energy value for each of the eight tracks.In the case where we are analyzing the final mix only, the fea-ture vector would be 226 elements long. We then apply oneof the feature weighting combinations under investigation tothese vectors; an example would be to multiply the MFCC fea-tures by 100, the rhythmogram features by 10, the RMS featureby 10 and leave the chroma feature untouched. The full set ofweightings under consideration is formed by varying the rela-


Fig. 6. Stacking the feature vectors.

tive weights of each feature by 1, 10, and 100 with respect toevery other feature, resulting in 65 different configurations, andthese weightings are applied after the individual features havebeen normalized using z-score.

The resulting collections of vectors calculated for all audioframes then form the sets of input data for our self-distancematrix calculations. Fig. 6 illustrates the technique for a hy-pothetical situation in which there are three audio tracks, eachproducing a four element feature vector, and for one particularweighting combination.

The self-distance matrix is calculated from these feature vec-tors using (4), and we then calculate the standard novelty score(an example is shown in Fig. 2) using (7), with a kernel size of32. As stated in Section II-D, we evaluate two different methodsof picking peaks from the novelty score. In the first method,our segment boundary locations are simply selected as the lo-cations of all those peaks in the novelty score whose height ex-ceeds some scaled factor of the average peak height for eachsong; the precise value of this scaling factor must be learnt,and we try factors of 0.5, 0.75, 1, 1.25, 1.5, 1.75 and 2 timesthe average peak height. Hence, using this method, we obtainseven different candidate sets of segment boundaries for everysong. There is certainly scope to improve on this simple methodthough, and consequently our second method (Brennan’s [9],described in Appendix A) is more complex, utilizing low-passfiltering in order to remove the theoretically irrelevant small-scale peaks from the novelty score. This method produces threecandidate sets of segments boundaries. We present the results ofboth methods in order to demonstrate the potential advantage offine-tuning the peak-picking method, however we do not claimthat either method is optimal. Rather, we concentrate on demon-strating that in either case, there is an advantage to be gainedfrom using multitrack rather than mixed audio.

As a final step, given the knowledge that the Structural Anal-ysis of Large Amounts of Music Information (SALAMI) anno-tation guidelines (see Section III-C) dictate that the very startand end of the song are marked as boundaries (even if they con-sist of periods of silence), we check whether or not the algorithm

has picked out these locations and if not, we add boundariesthere.

C. Evaluation

In order to evaluate any music segmentation technique, a testset of human-annotated musical audio is required. Several al-ready exist3 4 for fully mixed audio, but no ground truth an-notations for multitrack audio projects exist. To that end, wehave created a publicly accessible5 test set consisting of 104multitrack pop and rock songs (see Appendix B for details ofthe source material), annotated with 3119 ground truth struc-tural segmentation boundaries according to the guidelines setout in the SALAMI6 project. In brief, these guidelines describeconventions for annotating high-level musical structures suchas intro, chorus and verse, as well as mid-level structures suchas a melodic phrases or chord progressions spread over a smallnumber of bars. It must be noted however that the accuracyof any particular segmentation is subjective to some degree;Peeters and Deruty [31] offer a good discussion regarding therobustness of segmentation evaluation techniques, whilst Brud-erer et al. [8] demonstrate that despite the subjective nature ofmusic segmentation, there is a correlation between the numberof subjects identifying a particular boundary and the level ofsalience attached to it.

Our experimental results consist of either seven (in the caseof the simple novelty score peak picking method) or three (forBrennan’s [9] method) alternative segmentations for every fea-ture weight configuration applied to every song.

We employ n-fold (where in our case) cross-validationto determine the pairwise F-value, precision and recall of everyset of training songs, and for each feature weighting and seg-mentation level configuration. Pairwise F-values are calculatedas described by Levy and Sandler [24], by comparing our exper-imentally derived segment boundaries against the ground truthdata using

(8)

(9)

(10)

where is pairwise precision, is pairwise recall, is pairwiseF-value, is the set of segment boundaries identified experi-mentally and is the set of segment boundaries identified bya human annotator.

This allows us to select an optimum feature weight andsegmentation level configuration for each training song set. Wethen take the test song segments derived using the optimumparameter configurations for the corresponding training groupset, group them into one collection of segments (i.e. as if all 104

3http://www.cs.tut.fi/sgn/arg/paulus/beatles_sections_TUT.zip4http://www.ifs.tuwien.ac.at/mir/audiosegmentation/dl/

ep_groundtruth_excl_Paulus.zip5http://www.eecs.qmul.ac.uk/~stevenh/multi_seg.html6SALAMI guidelines http://salami.music.mcgill.ca/wp-content/uploads/

2010/12/SALAMI-Annotator-Guide.pdf


TABLE ISEGMENT BOUNDARY PAIRWISE COMPARISONS WITH GROUND TRUTH DATA

songs formed one long song), and calculate the final F-value,precision and recall values by comparison with the groundtruth data. The optimum weight and segmentation level config-urations are deduced by taking the average of the optimumtraining set configurations.

This process is applied to:1) Segments derived using our method applied to the multi-

track data2) Segments derived using our method applied to the single-

channel mixed dataF-values are also calculated from the segments derived using

Mauch et al.’s state-of-the-art7 method [26] method applied tothe single-channel mixed data.

When comparing segment boundaries against ground truthboundaries, tolerance ranges of 1 s ( 0.5 s) and 3 s ( 1.5s) were used, the same as in the 2009 MIREX Music StructureSegmentation Task.

IV. RESULTS

The pairwise F-values for the complete set of segmentsobtained from the 104 song test set, obtained using the threedifferent methods (Mauch applied to mixdowns, Hargreavesapplied to mixdowns, and Hargreaves applied to multitracks),and for both peak-picking methods (simple and Brennan’s),are shown in Table I. The F-value figures achieved when ana-lyzing full multitrack data show significant improvement whencompared to the results for mixed data, regardless of whetherour own or Mauch’s algorithm is used. The greatest improve-ment is achieved using our method together with Brennan’speak-picking algorithm. Fig. 7 shows the corresponding op-timum feature weight configurations used to generate theF-values in Table I (these values are obtained by averaging thefour different optimum configurations found during four-foldcross-validation). For clarity, we omit the configurations de-rived when using the simple peak-picking method from thisFigure. In all cases, the rhythmogram feature offers very littlebenefit to the segmentation; this result is consistent with earlierfindings by Paulus and Klapuri [28]. When full use of themultitrack data is being made (the 2 leftmost sets of results),an equal weighting of chroma, RMS energy and MFCCs isfound to be optimal, whilst when fully mixed data is used (the2 rightmost sets of results) the importance of the RMS energyand chroma features declines to certain extents depending uponwhich tolerance level is under examination. It is important tonote though that for reasons of tractability, we limited ourselvesto relative weightings of 1, 10 and 100—if more combinationshad been tested it is less likely that any optimum weightings

7Ranked first in the 2009 MIREX Music Structure Segmentation Task

Fig. 7. Optimum feature weights.

would have been exactly equal. Additionally, when usingBrennan’s peak-picking algorithm, the second, or mid, level ofsegmentation produced was found universally to be optimum.In the cases where we used the simple peak-picking method,the optimum peak height threshold ranged from 1 to 1.5 timesthe average peak height.

A. Discussion

The most important results (shown in bold) in Table I demon-strate the significant improvement in F-value, precision and re-call achieved when we make full use of the multitrack data asopposed to just the final mix (the F-value rises from 0.30 forfull mix data at a 1 s tolerance, to 0.38 for multitrack data whenwe use Brennan’s peak-picking algorithm, and likewise from0.53 to 0.6 at a 3 s tolerance). This result strongly supports ourhypothesis; that having access to multitrack data enables an in-crease in segmentation accuracy. A slightly surprising result isthat Mauch’s algorithm only achieves similar or worse F-valuesto our own, relatively simple, algorithm when it too is appliedto the full mixes. It is worth noting though that Mauch’s algo-rithm has both the highest precision and the lowest recall valuesof all the methods, indicating that although a lot of true segmentboundaries were missed altogether, those that were producedwere very accurate. This, together with the fact that our algo-rithm achieved the best results when we used the second (mid)level of segmentation, perhaps implies that typically there arehigher numbers of segment boundaries present in our groundtruth data than in that used for the MIREX 2009 segmentationtask. Indeed, closer inspection of both sets of ground truths re-veals that, on average, each MIREX ground truth song con-tains 11.2 segment boundaries whilst our SALAMI style ground


truths contain 30. The MIREX ground truth data was not, asfar as we can ascertain, produced according to the SALAMIguidelines. This goes some way to explain the lower F-valuesachieved using Mauch’s algorithm, whilst not invalidating theimprovement observed when applying our own algorithm tomultitrack rather than mixed audio.

Our algorithm produced optimum results when analyzingfull multitrack data by using equal weightings of chroma, RMSenergy and MFCC features (compared to the dominance ofMFCCs when analyzing fully mixed songs); this result lendsstrong support to the idea that the relevance of certain featuresdepends upon the nature of the instrumentation present in thesongs. An interesting direction for further work would be tooptimize the choice of feature used on each individual audiotrack (stem) according to the instrument recorded. Additionally,the particular method of segment boundary picking used hereis designed to search for regions of homogeneity, howeverrepetition is also an important cue in structural segmentation.A further refinement of the experiment would be to incorporatea repetition-based method; several possibilities are listed byPaulus et al. [29].

It was not entirely surprising that the rhythmogram featurescored so poorly in the optimum feature weight configurations.This feature is calculated over a relatively large time window(2 s), resulting in poor temporal accuracy, whereas all other fea-tures are calculated on a frame-by-frame basis.

As an aside, an interesting observation was made duringsome early tests when our dataset was much smaller (approxi-mately 20 songs). One song, the nature of which happened tobe a contemporary, relatively experimental piece based mainlyupon the presence or absence of subtle layers of instrumentsand loops, had to be taken out of the test dataset for copyrightreasons. It was replaced by a far more traditional pop/rock songfrom the late sixties, and the effect was a degradation of theF-values achieved using the multitrack data, and concurrentlyan improvement in those obtained from the final mixes. Whilstthis is certainly not a robust result, it is interesting in that itsuggests that certain genres of music which don’t follow the tra-ditional verse/chorus pattern are more easily segmented whenwe have access to the multitrack recordings, in which subtlechanges in instrumentation are more amenable to analysis.

V. CONCLUSION

Many methods exist for determining the high-level structureof fully mixed musical audio. Inevitably, all of these methodsneed to extract relevant musical cues from the ensemble of in-struments present in most recordings. We have shown that thereis a quantifiable and significant advantage to be gained, whensegmenting music, by exploiting the source-separated versionsof audio recordings if they are available as multitrack projects.We have used a relatively simple algorithm based upon fouraudio features, self-distance matrices and homogeneity detec-tion, which pays no attention to the particular type of instru-ment present in each source track, and we predict that evengreater segmentation accuracy and/or reduced computationalcomplexity could be achieved by selecting audio features ac-cording to the instrumentation or musical function of each track.

It has been implicitly assumed so far that at the point ofanalysis, all the source audio tracks required prior to producingthe final mixdown are present. However, at intermediate stagesof the recording process, only a subset of these tracks will exist.An interesting direction for future work would therefore be todetermine how accurately we are able to segment incompletesubsets of multitrack projects. Answering this question willalso enable us to establish whether or not either a single or aminimal number of tracks of certain instrument combinationsare sufficient for the derivation of an accurate segmentation,without needing to perform analysis of tracks which offer littleor no new information. Given that DAW multitrack projectsfrequently contain around 8, 16 or 24 tracks, this would have theadded benefit of greatly reducing the computational complexityof our segmentation algorithm.

In addition to high-level verse/chorus type segmentations, weshould also expect to be able to achieve lower (i.e. bar and beat)level segmentations. Possible ways to achieve this might be byanalysis of the sub-structure of self-distance matrices (i.e. recur-sively analyze an SDM after first identifying high-level bound-aries) or by using existing beat tracking or transcription algo-rithms, for example.

In this paper we have only discussed methods of locating seg-ment boundaries, however we do not have to limit ourselvesto this narrow goal. To date, and to the authors’ knowledge,all MIR research—be that structural segmentation, genre/artist/mood classification, music similarity measurement, onset/keydetection, cover song identification, or chord/melody extraction,has been undertaken using either fully mixed music or single in-strument recordings. The potential advantages offered by earlycapture of a more accurate and rich set of metadata from mul-titrack sources in the studio are vast, and, because the metadataneed not stay tightly bound to the commercial audio recording,are not limited to the improvement of studio editing tools. Tech-nologies such as the World Wide Web Consortium (W3C) Re-source Description Framework (RDF) metadata model are al-ready being used to enhance on-line artist information web-sites such as that provided by the BBC8, and the publicationof enhanced metadata alongside commercial audio recordingswould only increase their versatility. Instead of concentrating oncomplex signal processing forms of ‘reverse engineering’ fullymixed audio, the MIR community may instead concentrate onexploiting an already present, easy to query and potentially vastamount of metadata via logical inferencing.

APPENDIX A

BRENNAN’S PEAK-PICKING METHOD

An alternative method of picking peaks from a novelty score,proposed by Brennan [9], consists of the following steps:

1) Calculate the standard novelty score using (7), with akernel size of 32.

2) Generate 6 more novelty scores with increasing degreesof smoothness, using zero-phase versions of the 6th orderlow-pass Butterworth design filters having normalized(with respect to half the sample rate) cutoff frequenciesof 0.5, 0.25, 0.125, 0.1, 0.075 and 0.05. We will call the

8http://www.bbc.co.uk/music


resulting novelty scores , where is theoriginal, unfiltered novelty score, has a normalizedcutoff frequency of 0.5, and has a normalized cutofffrequency of 0.05

3) Despite using zero-phase filters, the low-pass filtered peaksmight span several higher frequency peaks in the unfilterednovelty score. Compensate for this peak-smearing by com-paring the peak locations of each novelty score with thoseof the novelty score two numbers lower (i.e. compare to

, to , etc. In the case of , compare to the originalnovelty score ). To make each pairwise comparison, weuse Dixon’s [12] peak-picking method to locate the peaksof both novelty scores, then search six beats either side ofeach peak in the smoother of the two novelty scores (i.e. theone filtered with the lower cutoff frequency) for a matchingpeak in the other novelty score. This gives us seven alter-native sets of peaks (including the peaks from the originalunfiltered novelty score).

4) Further refine the peak locations by comparing each setwith the peaks from the original novelty score (again, bysearching six beats either side of each peak for a match inthe original set).

5) For each peak in the original novelty score, count howmany of the other six sets of peaks also contain the samepeak.

6) Peaks appearing in six or more sets constitute the highestlevel set of segment boundary temporal locations, those ap-pearing four times or more make up the mid-level segmentboundaries, and peaks appearing in at least two of the setsgo to make up the lowest level of segment boundaries.

APPENDIX B

MULTITRACK AUDIO TEST SET DETAILS

The multitrack audio projects which form our test setcome from a number of sources; the Creative Commons‘ccMixter’ website9, artist websites, donations from friendsand colleagues, and a commercial karaoke song website10,providing individual audio tracks including vocals fromcover versions of popular western songs. The ground truthannotations which accompany them took approximately280 man-hours to create. All of the annotations and audiofiles, apart from the commercial karaoke material (whichinstead may be purchased), are available for download fromhttp://www.eecs.qmul.ac.uk/~stevenh/multi_seg.html, and theauthors would like to encourage researchers to make use ofthem. The multitrack test set consists of 104 songs, with a totalof 3119 ground truth structural segmentation boundaries, anaverage of 9 tracks per song (minimum 4, maximum 17), andan average song duration of 3 min 56 s (minimum 1 min 36 s,maximum 10 min 3 s).

ACKNOWLEDGMENT

The authors would like the thank Queen Mary undergraduatesRichard Flanagan and Sina Hafezi for their work annotating themultitrack testbed.

9http://ccmixter.org/10http://www.karaoke-version.co.uk/

REFERENCES

[1] S. Abdallah, K. Noland, M. Sandler, M. Casey, and C. Rhodes, “Theoryand evaluation of a Bayesian music structure extractor,” in Proc. Int.Conf. Music Inf. Retrieval (ISMIR), 2005, Citeseer.

[2] USA Standard Acoustical Terminology, American National StandardsInstitute, Tech. Rep. S1.1-1960, 1960.

[3] J. Aucouturier and M. Sandler, “Segmentation of musical signals usinghidden Markov models,” Preprints-Audio Eng. Soc., 2001.

[4] J. Aucouturier, F. Pachet, and M. Sandler, ““The way it sounds”:Timbre models for analysis and retrieval of music signals,” IEEETrans. Multimedia, vol. 7, no. 6, pp. 1028–1035, Dec. 2005.

[5] L. Barrington, A. Chan, and G. Lanckriet, “Modeling music as a dy-namic texture,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no.3, pp. 602–612, Mar. 2010.

[6] M. Bartsch and G. Wakefield, “Audio thumbnailing of popular musicusing chroma-based representations,” IEEE Trans. Multimedia, vol. 7,no. 1, pp. 96–104, Feb. 2005.

[7] A. Bregman, Auditory Scene Analysis: The Perceptual Organization ofSound. Cambridge, MA: MIT Press, 1994.

[8] M. Bruderer, M. McKinney, and A. Kohlrausch, “Structural boundaryperception in popular music,” in Proc. 7th Int. Conf. Music Inf. Re-trieval, pp. 198–201, Citeseer.

[9] T. Brennan, “Music Structure Analysis,” M.S. thesis, Queen MaryUniv. of London, London, U.K., 2010.

[10] M. Bruderer, “Perception and modeling of segment boundaries in pop-ular music,” Ph.D. dissertation, School of Electron. Eng. and Comput.Sci. Queen Mary University of London, Queen Mary Univ. of London,London, U.K., 2008.

[11] W. Chai and B. Vercoe, “Music thumbnailing via structural analysis,”in Proc. Eleventh ACM Int. Conf. Multimedia, ACM New York, 2003,pp. 223–226.

[12] S. Dixon, “Onset detection revisited,” in Proc. 9th Int. Conf. DigitalAudio Effects, 2006, pp. 133–137.

[13] D. Ellis, “Beat tracking by dynamic programming,” J. New Music Res.,vol. 36, no. 1, pp. 51–60, 2007.

[14] D. Ellis and G. Poliner, “Identifying cover songs’ with chroma featuresand dynamic programming beat tracking,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’07), 2007, vol. 4.

[15] G. Fazekas and M. Sandler, “Structural decomposition of recordedvocal performances and it’s application to intelligent audio editing,”in Proc. 123rd AES Conv., 2007.

[16] G. Fazekas and M. Sandler, “Intelligent editing of studio recordingswith the help of automatic music structure extraction,” in Proc. 122ndAES Conv., 2007.

[17] G. Fazekas, Y. Raimond, and M. Sandler, “A framework for producingrich musical metadata in creative music production,” in Proc. AES125th Conv., San Francisco, CA, 2008.

[18] J. Foote, “Automatic audio segmentation using a measure of audio nov-elty,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2000, vol. 1, pp.452–455.

[19] M. Goto, “A chorus section detection method for musical audio signalsand its application to a music listening station,” IEEE Trans. Audio,Speech, Lang. Proces., vol. 14, no. 5, pp. 1783–1794, Sep. 2006.

[20] D. Huber and R. Runstein, Modern Recording Techniques. Waltham,MA: Focal, 2005.

[21] K. Jensen, J. Xu, and M. Zachariasen, “Rhythm-based segmentation ofpopular chinese music,” in Proc. ISMIR, 2005, Citeseer.

[22] K. Jensen, “Multiple scale music segmentation using rhythm, timbre,and harmony,” EURASIP J. Appl. Signal Process., vol. 2007, no. 1, p.159, 2007.

[23] A. Klapuri and M. Davy, Signal Processing Methods for Music Tran-scription. New York: Springer-Verlag, 2006.

[24] M. Levy and M. Sandler, “Structural segmentation of musical audioby constrained clustering,” IEEE Trans. Audio, Speech, Lang. Process.,vol. 16, no. 2, pp. 318–326, Feb. 2008.

[25] M. Levy, M. Sandler, and M. Casey, “Extraction of high-level musicalstructure from audio data and its application to thumbnail generation,”in Proc. ICASSP, 2006, pp. 1433–1436, Citeseer.

[26] M. Mauch, K. Noland, and S. Dixon, “Using musical structure toenhance automatic chord transcription,” in Proc. ISMIR, 2009, pp.231–236.


[27] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri, “Eval-uation of audio beat tracking and music tempo extraction algorithms,”J. New Music Res., vol. 36, no. 1, pp. 1–16, 2007.

[28] J. Paulus and A. Klapuri, “Acoustic features for music piece structureanalysis,” in Proc. 11th Int. Conf. Digital Audio Effects, pp. 309–312,Citeseer.

[29] J. Paulus, M. Müller, and A. Klapuri, “Audio-based music structureanalysis,” in Proc. 11th Int. Conf. Music Inf. Retrieval, 2010.

[30] G. Peeters, A. La Burthe, and X. Rodet, “Toward automatic music audiosummary generation from signal analysis,” in Proc. Int. Conf. MusicInf. Retrieval, 2002, pp. 94–100, Citeseer.

[31] G. Peeters and E. Deruty, “Is music structure annotation multi-dimen-sional? A proposal for robust local music annotation,” in Proc. 3rdWorkshop Learn. the Semantics of Audio Signals, pp. 75–90, Citeseer.

[32] L. Rabiner, “A tutorial on hidden markov models and selected applica-tions in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,Feb. 1989.

[33] C. Raphael, “Automatic segmentation of acoustic musical signals usinghidden Markov models,” IEEE Trans. Pattern Anal. Mach. Intell., vol.21, no. 4, pp. 360–370, Apr. 1999.

[34] S. Ravuri and D. Ellis, “Cover song detection: From high scores togeneral classification,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP’10), 2010, pp. 65–68.

[35] R. Shepard, “Circularity in judgments of relative pitch,” J. Acoust. Soc.Amer., vol. 36, no. 12, pp. 2346–2353, 1964.

[36] G. Tzanetakis and P. Cook, “Multifeature audio segmentation forbrowsing and annotation,” in Proc. IEEE Workshop Applicat. SignalProcess. Audio Acoust., 1999, pp. 103–106, Citeseer.

[37] J. Wellhausen and M. Hoeynck, “Audio thumbnailing using mpeg-7low level audio descriptors,” in Proc. ITCom’03, Citeseer.

Steven Hargreaves (S’11) was born in 1970. Hereceived the B.Sc. degree in Applicable Mathe-matics from Manchester Metropolitan University,Manchester, U.K., in 1993, the M.Sc. degree inComputer Science from the University of Edinburgh,Edinburgh, U.K., in 1998, and the M.Sc. degree inAudio Acoustics from the University of Salford,Salford, U.K., in 2005. He has spent several yearsworking as both a software and an acoustic engineer,and is currently pursuing the Ph.D. degree in theCentre for Digital Music at Queen Mary University

of London, London, U.K. His research interests are music information retrievaland machine learning, with a particular focus on multitrack audio.

Anssi Klapuri (M’06) received the M.Sc. and Ph.D.degrees from the Tampere University of Technology(TUT), Tampere, Finland, in 1998 and 2004, respec-tively. In 2005, he spent six months at the Ecole Cen-trale de Lille, Lille, France, working on music signalprocessing. In 2006, he spent three months visitingthe Signal Processing Laboratory, Cambridge Uni-versity, Cambridge, U.K. He is currently Chief Al-gorithm Developer at Ovelin Ltd and a Lecturer inSound and Music Processing at the Centre for DigitalMusic at Queen Mary University of London, London,

U.K. His research interests include audio signal processing, auditory modeling,and machine learning.

Mark Sandler (SM’98) was born in 1955. Hereceived the B.Sc. and Ph.D. degrees from theUniversity of Essex, Essex, U.K., in 1978 and 1984,respectively. He is a Professor of Signal Processingat Queen Mary University of London, London, U.K.,and Head of the School of Electronic Engineeringand Computer Science. He has published over 350papers in journals and conferences. Prof. Sandler is aFellow of the Institute of Electronic Engineers (IEE)and a Fellow of the Audio Engineering Society.He is a two-time recipient of the IEE A. H. Reeves

Premium Prize.

ieee transactions on audio, speech, and language ... · ieee transactions on audio, speech, and...

Documents