event-based multitrack alignment using a …live rendition or they make use of backing tracks to...

Journal of New Music Research

Event-based Multitrack Alignment using a ProbabilisticFramework

A. Robertson and M. D. PlumbleyCentre for Digital Music, School of Electronic Engineering and Computer Science, Queen Mary

University of London, London E1 4NS, UK. Email address: [email protected]

Draft of January 15, 2015This paper presents a Bayesian probabilistic framework for real-time align-ment of a recording or score with a live performance using an event-basedapproach. Multitrack audio files are processed using existing onset detectionand harmonic analysis algorithms to create a representation of a musical per-formance as a sequence of time-stamped events. We propose the use of distri-butions for the position and relative speed which are sequentially updated inreal-time according to Bayes’ theorem. We develop the methodology for thisapproach by describing its application in the case of matching a single MIDItrack and then extend this to the case of multitrack recordings. An evaluationis presented that contrasts our multitrack alignment method with state-of-the-art alignment techniques.

Introduction

The studio environment offers musicians theability to use artificial devices such as overdub-bing, editing and sequencing in order to create arecording of a musical piece. However, when theythen come to perform these pieces live, such meth-ods cannot be used. Musicians then either createan alternative arrangement that is more suited to alive rendition or they make use of backing tracksto play some of the studio parts. At present, whenbands make use of this second option, the backingtracks are unresponsive to the timing variations oflive performers, thereby forcing the musicians tofollow the timing of the backing through use of aclick track.

Automatic accompaniment is the problem of

Thanks to the Royal Academy of Engineering and theEPSRC for funding this research. Thanks to SebastianEwert for assisting in the evaluation study and to SimonDixon for advice on the methodology.

real-time scheduling of events within a live musi-cal performance without such constraints as usingclick tracks. Applications include audio synchro-nisation, such as the case described above wheremusicians require additional parts that have beenoverdubbed in a studio recording to play auto-matically during live performances, and video andlighting synchronisation, where visual aspects ofthe show might have been programmed relative toa rehearsed version. In both cases, an automaticaccompaniment system would be expected to syn-chronize sufficiently accurately with the perform-ers so that any scheduled accompaniment, eitheraudio or visual, is perceptually ‘in time’.

In the studio, it is common to record instrumentsseparately using a dedicated microphone on eachinstrument channel. These individual recordingscollectively constitute the multitrack, so that au-dio tracks for each instrument are available. Therehave been increasing use of multitracks both incommercial games such as Rock Band, whereplayers attempt to ‘play’ each part in time withthe song, and album releases allowing others tocreate their own remix. Techniques for the auto-

1

2

matic mixing of multitracks have been proposed(Reiss, 2011) which choose parameters for equal-ization and level with the aim of creating a profes-sional quality stereo mix. Intelligent audio edit-ing (Dannenberg, 2007) analyses a set of multi-tracks using a machine readable score to identifyindividual notes and help the editing process. Inthis paper, we examine how multitracks might beused for automatic accompaniment using a prob-abilistic framework. First we shall look at someof the existing methods for automatic accompani-ment before examining how to go about designinga multitrack-based system for rock and pop music.

Score Following Systems

In the classical domain, this task has receivedconsiderable attention where it is often presentedin the context of score following (Orio et al., 2003),the problem of aligning a performer’s rendition totheir location in the score. Score following sys-tems were introduced independently at the 1984ICMC (Dannenberg, 1984; Vercoe, 1984). Thesefirst systems used a symbolic representation of theinput and made use of string matching to comparethe live stream with the score. Symbolic-basedmatching required human supervision and com-monly experienced difficulties when faced withcomplex events such as trills, tremelos and re-peated notes (Puckette, 1992). Audio transcrip-tion and symbolic-based matching using hashinghas been used to retrieve the corresponding pieceand score position from a database of scores (Arztet al., 2012).

A probabilistic method to tracking a vocal per-formance was introduced by Grubb and Dannen-berg (1997) in which the performer’s locationis modeled as a probability distribution over thescore. This distribution is then updated on the ba-sis of new observations from a pitch detector. Theprobability that the performer is between two loca-tions is then given by integrating the function be-tween these two points, making explicit the uncer-tainty for any given alignment.

An alternative probabilistic approach is the useof graphical models, which have been employed invarious forms. The hidden Markov model (HMM),successfully used in many sequential analysis tasks

such as speech recognition (Rabiner, 1989), wasused by Raphael (1999) and Orio and Dechelle(2001). In both formulations, a two-level HMM isemployed. One HMM level models the the higherlevel sequence of score events such as notes, trills,rests, and the other models lower level audio fea-tures that are observed during each event, such asattack, sustain, rest. The HMM thus gives rise toa probability distribution over all the hidden stateswhich constitute the model of the score. The An-tescofo system (Cont, 2008) also makes use ofMarkovian techniques within its real-time align-ment system and augments this with a tempo agentthat enables the integration predictive schedulingof electronic parts within the composition process(Cont, 2011). Joder et al. (2011) propose the useof the Conditional Random Field (CRF), a graphi-cal model structure that generalises Bayesian Net-works by removing the assumption of conditionalindependence between observations and neigh-bouring hidden states. For labelling tasks, a HMMcan be seen as a particular case of a CRF. A proba-bilistic framework using a score pointer with statesidentified at the level of the tatum (typically divi-sions of eighth or sixteenth notes) is used by Peel-ing et al. (2007).

One difficulty when designing such systems isincorporating a temporal model that accounts forthe fact that we expect notes to last for a given du-ration. Raphael (2006) has investigated the use ofhybrid graphical models in which both the scorelocation and tempo are modeled as two randomvariables. Antescofo has integrated semi-Markovmodels into its design in which label durations areexplicitly modeled. The system is reactive, allow-ing a high degree of flexibility to timing changes,but by modeling the current tempo, accompani-ment parts can be sequenced to happen in timewith anticipated events.

Otsuka et al. (2010) propose a method using aparticle filter where each particle has a score po-sition and tempo. At a fixed time step, a predic-tion stage updates the score positions for all parti-cles, then an update routine ascribes a measure toto each particle according to how well it matchesrecent observations. This iterative process allowsmany hypotheses to be followed in parallel. Mon-tecchio and Cont (2011) investigate the ability of a

3

particle filter to adapt to gradual and sudden tempochange. Duan and Pardo (2011) examine the useof particle filtering for score alignment using bothpitch and chroma features. The methodology pre-sented in this paper also has similarities with parti-cle filter approaches as we employ distributions forboth position and tempo and make use of predic-tion and update routines. An important differenceis that we represent the probability distributions ata fine level of discretisation (typically 1 msec forthe score position) and there is no re-sampling steprequired.

Cemgil at el. (2001) formulate tempo trackingin a Bayesian framework using the Kalman filter(Kalman, 1960), an efficient recursive filter usedfor estimating the internal state of a linear dynamicsystem from a series of noisy measurements. Thefiltering process uses two stages: prediction, inwhich the system’s model is used to create a pre-diction from the last state estimate, and an updatestage in which the prediction is used in combina-tion with observation to create the new estimatedstate. Our proposed method also employs predic-tion and update steps recursively.

Audio Synchronisation

Rather than align the live audio to a representa-tion of the score, an alternative approach to scorefollowing is to initially convert the score into au-dio using a MIDI synthesizer and then aligning thetwo audio streams (Dannenberg, 2005; Arzt et al.,2008). Dynamic Time Warping (DTW) is com-monly used to find the optimal alignment betweentwo sequences of audio features (Hu et al., 2003;Dixon, 2005; Ewert et al., 2009). The Match Tool-box (Dixon, 2005) is an online algorithm whichreduces the computation time by only calculatingthe similarity matrix for a limited bound aroundthe current best path.

Alignment accuracy is critical for some appli-cations of synchronisation such as automatic ac-companiment. Muller (2007) proposes an offlineonset-based score-audio synchronisation methodin which pitched onset events in the audio are firstaligned to a score with a coarse resolution usingDTW, and then a subsequent process aligns indi-vidual notes. Similarly, Niedermeyer and Widmer

(2010) improve the resolution of the DTW methodusing a multi-pass approach. Firstly, note onsetevents are identified using a coarse chroma-basedalignment and those with the highest confidenceare chosen to act as note anchors and the align-ment path is re-estimated. Performance statisticssuggest that for solo piano music, approximately90 % of notes are aligned within 50 msec. Arzt andWidmer (2010) introduce the use of simple tempomodels to improve accuracy when using synchro-nisation methods.

In this paper, we introduce the use of multi-tracks for the purpose of audio synchronisation.This enables reliable traditional onset detectionand pitch detection on individual instrument chan-nels to create a list of events, consisting of theevent time and an associated feature such as a pitchor chroma vector. This event list is then used toperform matching to the event list derived fromthe recorded audio, referred to as the score. Weassume that both the reference audio and the per-formance are available as multi-track audio stemscomprising of the same number and type of tracks.

We use a probabilistic framework in order tomatch these higher level audio events. Thisis an alternative to utilizing lower-level featuresand matching via a graphical model formulation.There is less computation time required for higher-level event matching since the update of the distri-bution is less frequent. The method is well-suitedto handling polyphony in cases where it is pos-sible to derive an appropriate representation fromthe performance. When discretizing the temporalspace for the relative position distribution, we use ahight resolution, typically 1 msec intervals. Whilstwe require accurate onset detection methods to doso, this has the advantage of improving the align-ment accuracy.

A System for MultitrackSynchronisation in Rock

and Pop music

In rock and pop music, there tends to be noscore in the classical sense. However, such mu-sic often retains the same high level features suchas drum patterns, chord progressions, bass linesand melodies. Gold and Dannenberg (2011) de-

4

51000 52000 53000 54000 55000Time (ms)

KickOnsets

BassOnsets

SnareOnsets

GuitarChromagram

Figure 1. Multitrack event-based representation for four channels: kick drum (top), bass (second), snare (third)and guitar (fourth). The pitches of the bass notes are indicated in Hertz. The guitar track shows the strength of thechromagram representation in each of the twelve bins that correspond to the chromatic notes.

scribe this area of music as falling between theextremes of the deterministic, such as classicallyscored music, on the one hand, and free improvisedperformances on the other. Such music has a semi-improvised element, but is strongly sectionalised;the tempo is approximately steady, but there aremore complex rhythm patterns. They introducethe term popular music Human-Computer MusicPerformance Systems (HCMPS), to describe thekinds of application we are looking to design here.Whilst they envisage additional features to such asystem, such as the ability to re-arrange structureon the fly, we shall be focussing solely on syn-chronisation between two performances where thehigher level structure is identical.

For rock and pop music, although there may bevariations in actual patterns and parts played, wecan expect that these will happen relative to thesame underlying structure as defined by bars, beatsand chords. We can expect that bass and drumswill constitute the rhythm section which createthe foundation over which guitars and keyboardsare typically played. Since drums are percussiveevents, for the purposes of live synchronisation,they might be sufficiently described by an event-based representation consisting of the onset time

and drum type (e.g. kick, snare, tom) rather thanusing precise audio features. Similarly a bass linemay be sufficiently represented using the pitch andtiming information of the individual notes.

The use of multiple instrument channels formatching requires that the results of differentmatching procedures can all be integrated within asingle framework. Our system does not have an ex-plicit score in terms of expected pitched notes anddurations. Instead, we shall analyse the multitrackdata to create a list of musical ‘events’ which canbe considered to function as a ‘score’. We definean event as a discrete musical observation, whichhas a start time in milliseconds. Onset detectionmethods (Bello et al., 2005) offer a way to map anaudio signal onto a set of time values when newmusical events begin. The score is created throughoffline analysis of the multitrack files using onsetdetection and thresholding to create a list of eventson each channel.

Figure 1 shows the events resulting from theanalysis of four multitrack channels. For drums(kick and snare), these events simply provide thetime of each event since the beginning of therecording. In the case of bass, we make useof the yin monophonic pitch detection algorithm

5

(Cheveigne & Kawahara, 2002) to provide a list ofonset times and associated pitches in Hz. For gui-tar and other polyphonic instruments, we make useof the chromagram representation, introduced byWakefield (1999) and based on the work of Shep-herd (1964), which provides a representation of theenergy found at each of the twelve notes in a chro-matic scale. It has been successfully used for audiothumbnailing (Bartsch & Wakefield, 2001) and inchord detection (Pardo & Birmingham, 2002). Thechromagram has also been used in DTW align-ment approaches (Hu et al., 2003; Ewert et al.,2009). One useful aspect of the chromagram forthese applications is that it discards timbral infor-mation, such as might be present due to the differ-ent orchestrations, but preserves information aboutthe harmonic content that can be used to comparethe two sets of audio features. For polyphonic in-struments, an onset can then be characterized asa chromagram of the audio that follows the onsetevent. These other attributes of events, such asa pitch or a chromagram representation, are thenused in the matching process to provide a measureof the extent to which one observed event matchesanother.

We approach the problem using a similar for-mulation to that employed by Grubb and Dan-nenberg (1997), who proposed modeling the dis-tribution of the performer’s location in the score.To achieve a high resolution in the probabilisticframework representing score position, we opt todivide the space into discrete units at small inter-vals, such as 1 msec. This contrasts with mostgraphical model approaches, where the discretiza-tion of the space is at the level of musical objects,such as a note or chord, with a corresponding lo-cation within the score. This probability densityfunction can be understood as quantifying our be-lief as to the performers location and thus peaksin the function correspond to the most likely lo-cations in the score. Figure 2 shows how such adistribution might look in practice where the prob-ability density function is overlaid upon a MIDIscore.

Whereas Grubb and Dannenberg employ a sim-plifying assumption that the tempo is a singlescalar value, here we make use of a separate distri-bution across all possible tempo values, where the

5000 6000 7000 8000Time (ms)

0

0.005

0.01

0.015

0.02

0.025

P(t)

Figure 2. An example distribution displayed relativeto a MIDI score.

tempo is expressed as the relative speed of the per-formance relative to the recorded version. Whilsttheir method will work well when the scalar tempois correct, the use of a distribution quantifies theuncertainty in the estimate which is transferred toa corresponding uncertainty in the position distri-bution that increases in proportion to the elapsedtime between observations. We are effectively ableto follow multiple tempo estimates whilst also at-tributing a probability to each. In order to synchro-nize an accompaniment to a live performance, weneed to continually update the two distributions,for position and tempo, after each new observation.The maximum aposteriori (MAP) estimate of theposition distribution is the most likely location ofthe performers with the ‘scored’ (or recorded) ver-sion. When performing these computations, boththe score position and the relative speed distribu-tions are discretized. In our implementation, wehave used bins of 1 msec width for score position,which allows a high resolution, and intervals of0.01 for the relative speed distribution.

An overview of the procedure for updating theposition distribution is shown in Figure 3. The sys-tem is initialized when the performance begins (atplaying time zero), so we can assume there is al-ways a prior distribution which refers to the previ-ously observed playing time. The process can beunderstood by analogy to the Kalman filter, con-sisting of recursive estimation using two processes:prediction (time update) and update (measurementupdate). In the prediction step, the last state es-

6

timate is used to generate a prediction accordingto the system model and in the second step thisprediction is updated using current measurementobservations to generate the next state estimate.Firstly we require a prediction for the distributionat the current performance time. Secondly, weshall need to specify how to calculate the likeli-hood function for the observed event by matchingthe observed event to events in the score for theappropriate instrument. Thirdly, we then need toupdate the prior distribution using the likelihoodfunction to calculate the new posterior distributionfor the performer’s location.

The first of these tasks translates the distributionaccording to the time that has elapsed between thelast update and the current event to predict the dis-tribution. However, since there are a range of pos-sible tempi under consideration, this will take theform of a convolution. This procedure is best un-derstood in the context of creating the prior usedto update the distributions and so we present thisin the next section once the update procedure hasbeen described. We shall now describe how togo about executing steps two and three in whichthe likelihood function is calculated for each newevent and the posterior distribution is updated.

Update of Position Distribution

The distribution for position, P(t), is a probabil-ity density function that reflects our belief as to theperformer’s current location in the score, where tis the time in milliseconds from the beginning ofthe score. The observed onset event can be char-acterized as being of several types such as a sim-ple onset, a pitched event defined by MIDI note orfundamental frequency, or by a chromagram vec-tor. However, the principle for updating the distri-bution is the same and requires a distance measuredescribing the extent to which events are alike or‘match’. Here, we shall use the example wherethe observed event is a discrete MIDI pitch. Theith observed event, oi, can be represented as a 2-tuple, (τi, µi), where τi is the playing time of theevent and µi is the MIDI pitch. We can assumethat the position distribution has been updated toreflect our belief at the playing time of the cur-rently observed event. We then wish to calculate

Start accompaniment: Initialise distributions

Watch for new event

Update position distribution using elapsed time(acts as new prior)

Calculate likelihoods from matching events

Update Posterior

Figure 3. Overview of the procedure for updating theposition distribution.

a likelihood function from the observed data thatspecifies the probability of observing this data ateach time point in the score. The score consistsof simple 2-tuple events with an onset time (hererelative to the beginning of the score rather thanthe live performance) and a MIDI pitch. Let the jth

such recorded event, r j, be denoted by the 2-tupleconsisting of the recorded onset time, t j, and theMIDI pitch, m j, so that r j = (t j,m j). The prob-ability of observing the given event is highest atthe locations in the score where there are matchingevents of the same pitch.

In general, for two events of the same instrumenttype, we define a similarity function that takes avalue between 0 and 1 that reflects the degree towhich they match. Here, we specify the functionto be 1 if and only if µi equals m j. Let us de-note the set of matching events to the event oi asM(oi). Then this is precisely those events in thescore which have identical pitch and can be defined

7

3500 4000 4500 5000 5500 60000

0.0005

0.001

0.0015

0.002

3500 4000 4500 5000 5500 6000Time (ms)

0

0.01

0.02

0.03

0.04

P(t)

P(o i|t)

Figure 4. (a) The observed performed event is compared with expected event list, in this case MIDI note events.Matching events are indicated by the white boxes. (b) The likelihood function consists of a constant noise floor,with Gaussians added centred upon the matching note events. (c) The likelihood function is used to update the priordistribution (dotted) to form the new posterior distribution (solid). The resulting peak here reflects a good degreeof certainty as to the performer’s location.

asM(oi) = {r j ∈ R|m j = µi} (1)

where r j is the recorded event with 2-tuple (t j,m j)and R is the set of all recorded events.

In Figure 4, we can see an example of howmatching notes in the score are used to generate

a suitable likelihood function which is then usedto update the posterior distribution. The likelihoodfunction, P(oi|t), determines the probability of ob-serving our new data given that the location in therecording is t ms. Where there are events that arestrongly matching, we expect there will be peaks in

8

the likelihood function since these are the points inthe score which we most expect to correspond toour current location, having observed the match-ing note data. The observed events are still sub-ject to expressive timing, detection noise, motornoise, and therefore we model each match usinga Gaussian of fixed standard deviation σP aroundthe actual location in the score. For every matchingevent in the set M(oi), a Gaussian centred on thecorresponding score location is added to the like-lihood function. We also attribute a fixed quantityof noise, νP, to account for the possibility that thenew event does not match an expected event in therecording. For example the event might be a mis-take or result from a faulty detection. This givesrise to the equation:

P(oi|t) = νP +(1 − νP)|M(oi)|

∑r j∈M(oi)

g(t, t j, σP) (2)

where t j is the recorded time of event r j measuredfrom the beginning in milliseconds, σP is a con-stant that determines the width of the Gaussian,and the Gaussian contribution is

g(x, µ, σ) =1

σ√

(2π)exp(−(x − µ)2

2σ2 ). (3)

Then to update the prior distribution, we simplytake the product with the likelihood function andnormalize:

P(t|oi) ∝ P(oi|t)P(t) (4)

Once the prior is updated, we denote the timewhere our the position distribution is maximal ast∗, our current best estimate. Modeling the distri-bution over the time spanned by the whole eventlist would be computationally expensive. Thecomputation of values for the distribution takesplace on a region between t∗ + ρ and t∗ − ρ, centredon the current best estimate, t∗. The computationof distributions is carried out only within the ob-servation window, determined by ρ.

Prediction of the Distribution

Our update procedure for the distribution, de-scribed in the previous section, proceeded on theassumption that we had already updated the priordistribution to the current observation time. How-ever, first a prediction step is required that updatesthe position distribution obtained at the last obser-vation time, tn−1, to an estimate for the position dis-tribution at the current time, tn, thereby providingour prior estimate for the performer’s location. Ifthe relative speed of the performance was knownexactly, we could simply translate by the equiva-lent amount of time that has elapsed. Here, therelative speed is represented using a distribution,which reflects an inherent amount of uncertainty,so the prediction step takes into account all the pos-sible relative speeds and the degree to which eachspeed is considered probable.

The elapsed time since the last observation, td, istn − tn−1, measured in ms. Let PT (τ) be the relativespeed distribution, over the range 0 to τmax. Wefirst transform PT into a position distribution, PD,corresponding to this elapsed time by calculatingthe distribution of a delta function centred at time 0ms according to the current speed distribution afterthe observed elapsed time, td:

PD(t) = PT (ttd

). (5)

Thus a single delta peak at relative speed 1.0 wouldresult in a delta peak at td ms, as expected. Wecan denote the position distribution at event timetn by PLn . In Figure 5, we show how the resultingdistribution PD appears for the a Gaussian-shapedspeed distribution for different lengths of elapsedtime between observations. As the time increases,the standard deviation of PD increases proportion-ally. In this case, even if the position in the scoreat time tn−1 was known exactly, such as representedby a delta function, uncertainty in the tempo distri-bution would contribute to uncertainty in the priorwhen updating the position distribution at the nextobservation time. PLn(t) then acts as the prior po-sition distribution, P(t), in the update process de-scribed in Equation 4.

To obtain the new position distribution , we con-volve the resulting distribution with the position

9

0.7 0.8 0.9 1 1.1 1.2 1.3Relative Speed

0

0.2

0.4

0.6

0.8

1

Nor

mal

ised

Pro

babi

lity

0 200 400 600 800 1000Time (ms)

0.2

0.4

0.6

0.8

1

Nor

mal

ised

Pro

babi

lity

t = 0mst = 200 mst = 400 mst = 800 ms

Figure 5. Relative speed distribution (top) and the resulting convolutions with a delta function at 0 ms in theposition distribution (bottom) after different elapsed intervals.

distribution at the previous observation time, PLn−1 ,to obtain the distribution at the new event time. So

PLn(t) = (PLn−1 ∗ PD)(t). (6)

Single Track Polyphonic Matching

Before proceeding to the more complex case ofmultitrack matching, we will examine how thismethod works in a test case of aligning a real-timeMIDI input to a MIDI score on a single instrumentchannel. The intention here is to use a simpler testcase to check that the method is functioning as in-tended before proceeding to the case of multitrackaudio alignment. However, this method might alsobe useful in cases where MIDI input is available

such as from a keyboard or Moog piano bar.

To evaluate the algorithm’s performance we re-quire MIDI performances whose timing differsfrom the score. The RWC dataset’s Classical se-lection (Goto et al., 2002) contains sixty one ex-cerpts of classical music with both audio perfor-mances and the corresponding score. In orderto test the algorithm on this dataset, we requirea MIDI transcription of the audio recordings. A‘warped’ version of the MIDI was made availableto us by Meinard Muller, using a technique thatfirst aligns the audio and MIDI files using chroma-onset features (Ewert et al., 2009) and then warpsthe MIDI file to align it with the score. Excerpts ofthese recordings and associated data are available

10

on their website 1.

Before we can use the proposed method to carryout these tests, we need to specify the model pa-rameters and define a process for how the relativespeed distribution will be updated. In Equation 2,we set the likelihood function noise, νP to be 0.8and the standard deviation of the Gaussians, σP, tobe 100ms. Whilst at present these parameters areset by hand, in the future it might be possible tomake empirical measurements to determine them.However, each parameter will be song and per-former specific, so in practice this would involveusing the same noise and standard deviation thathas been observed on several rehearsals of a givensong.

For the tempo process, ideally we would mea-sure the time interval between corresponding notesin both performances, and calculate the ratio overa selection of such intervals and use averaging togive an estimate of the relative tempo of the twoperformances. However, this presupposes that wehave performed accurate score following already.Thus we look to exploit the results of the notematching that is used to update the position distri-bution to identify the event in the score that corre-sponds to each observed event. For each observednote, oi, we find the most likely matching recordedevent, ri, which is the event of identical pitch forwhich current position probability density functionis greatest. Then for each recent observed note, ok,within a suitable timeframe (here 4 seconds), wecalculate the ratio of the time interval between thetwo observed note events and the time interval be-tween the two best matching notes in the score. Sofor each recent observed event, ok, we create anestimate for the relative tempo, ξk:

ξk =τi − τk

ti − tk, (7)

where τi is the time of the ith observed event and tiis the time of the associated recorded event, ri thatis the best match to oi.

We make use of a similar Bayesian techniqueto update the relative tempo distribution. First wecreate a likelihood function as a sum of constant

offset and a Gaussian around the tempo estimate:

P(ξk|x) = νT + g(ξk, x, σT ). (8)

Then the relative speed distribution, P(x), is up-dated by taking the product of the prior with thelikelihood function and normalising:

P(x|ξk) ∝ P(ξk|x)P(x) (9)

This process is carried out iteratively for all newestimates, ξk. This method allows the tempo esti-mate to respond to the strong variations in tempothat characterize classical music. One potentialweakness is that if the position distribution be-comes inaccurate then this will also affect thetempo process. However, the only clear alternativewould be a form of tempo pulse estimation akin tobeat tracking, which can prove unreliable for clas-sical music.

When running the algorithm on the 61 files inthe RWC database, we found that 47% of the notesare matched within 40 msec and 65% are matchedwithin 100 msec. When testing an audio synchro-nization algorithm on audio versions of the sameMIDI files, we found it to be very sensitive to vari-ations in timbre. Thus it is difficult to provide faircomparative statistics that meaningfully compareour method with alternative systems. Software anddemonstration videos of the MIDI-based matchingare available for download 2.

The method was observed to work best when thetempo estimate was approximately correct. Whenthe tempo of the performed MIDI varied signifi-cantly from the tempo of the MIDI score, there wasa potential for the system to become lost, particu-larly if there was a low density of notes. In con-trast, when the pieces consisted of a high densityof notes of varying pitch, the resulting distributionpeaks around the correct location as there is moreinformation to utilize. This concurs with what wemight expect in a human listener, where expecta-tion will be more accurate when the musical eventsare close together in time.

1 http://www.mpi-inf.mpg.de/resources/MIR/SyncRWC60/2 http://code.soundsoftware.ac.uk/projects/bayesian-

multitrack-matching

11

Multitrack Evaluation

To evaluate the algorithm for multitrack input,we used a collection of studio recordings of rockand pop genre songs, for which two alternativetakes exist for each song. These takes are from thesame recording sessions and were recorded one af-ter the other. Some differences exist such as drumfills or change in bass line. The instrumentationincluded drums, bass and guitar in all cases. Allwere recorded without the use of click track, so thetempo was free to fluctuate. Four channels wereused for the matching algorithm: bass, kick drum,snare and guitar. The offline processing, as previ-ously described, gives rise to an event-based repre-sentation as was shown in Figure 1.

For each channel, we then provide a suitablesimilarity measure. For both the kick and snaredrum channels, two events (on the same channel)are considered similar and the measure is 1. Forbass events, we set the similarity to 1 if the pitchescorrespond to the same chromatic note, otherwise0. For the guitar channel, we assign the similaritybetween two events by normalising each chroma-gram so that the maximum value is one and takingthe cosine distance using the dot product betweenthe two chroma vectors.

The parameters for the model were set by hand.The ratio of noise added, νP in Equation 2, was setto 0.1, 0.2, 0.6 and 0.5 for kick drum, snare drum,bass and guitar respectively. The standard devia-tion of the Gaussians, σP, was set to 6, 6, 30 and 50ms respectively for the same instruments. The un-derlying motivation behind this choice is the ideathat that drum events are accurately placed in timeand can be used to locate precisely the point weare at in the song. In Figure 6 we can see howthe likelihood function appears for a kick drumevent. There are several possible matches and ourlow values of νP and σP result in several sharppeaks around the candidate events. The resultingposterior peaks around the most likely event. Incontrast, guitar and bass events are matched usinga wider Gaussian and a larger noise parameter astheir intended function is to ensure we are in thecorrect general locality when matching the moreprecise drum events. In the case where there wasan instrumental intro section, we allowed an ini-

tialisation procedure for our algorithm, wherebythe position distribution could be set on cue to aGaussian around a chosen point, such as the startof the verse or where the drums enter.

For our tempo process, we assume that thetwo performances are are approximately the samespeed. In view of this, we initialize a Gaussianaround the relative speed ratio of 1.0 with a stan-dard deviation of 0.1. This allows a reasonableamount of variation in tempo without any requiringany matching of high level musical features suchas bars and beat. In case the two performancesare at marginally different speeds, we update ourestimate according to the actual synchronisationspeed that is sent out as a result of matching theevents in the position distribution. For each newevent, we look at the inter-onset interval observa-tions occurring on the same instrument channel.Assuming these correspond to an integer multipleof the beat interval, we calculate possible corre-sponding tempo observations and where each ofthese is close to the current estimate, we add aGaussian around the observation. When the ratioto the current estimate is outside the range 0.9 to1.1, we assume it to be erroneous. We then use themethod described in equations 8 and 9, where thelikelihood is created by adding Gaussians aroundthese tempo observations and updating, with theparameters standard deviation σT set by hand to 4msec and constant noise νT to 0.02.

Our implementation of the algorithm runs in aprogram created using openFrameworks. We usea MaxMSP patch to perform onset detection onboth live input and pre-recorded files used to sim-ulate a live performance environment. Althoughthe program runs at approximately 20Hz, the onsetevents are time-stamped in MaxMSP and the align-ment takes place using this accurate timing infor-mation. For each frame of the program, we updatethe projected alignment time and store this data.For ground truth annotations, we made use of anoffline beat tracker based on the methods of Daviesand Plumbley (2007) in the application Sonic Vi-sualiser (Cannam et al., 2006). These were cor-rected by hand to ensure the beats began at thecorrect point. There is an inherent ambiguity inspecifying ground truth annotations. If an offlinealgorithmic technique is used, as in this case, the

12

18000 18500 19000 19500 20000 20500Time (ms)

0.2

0.4

0.6

0.8

1

18000 18500 19000 19500 20000 20500Time (ms)

0

0.2

0.4

0.6

0.8

1

Nor

mal

ised

Pro

babi

lity

Figure 6. The likelihood function (dotted, top) consisting of a combination of noise and narrow Gaussians centredon several matching kick events (solid lines, top) in the matching window. The posterior distribution (solid, bottom)after updating the prior (dotted, bottom) with the likelihood function.

algorithm can be subject to performance errors, sothere is a limit on how accurate these can be. Ifhumans tap in real-time to annotate various pointsin the audio, these can be subject to similar errorssince they reflect the predicted time rather than theobserved time. One other option is to annotateby hand and verify the timing, in which case spe-cific events must be chosen that we can identify ineach recording, such as the kick drum on the firstbeat of the bar and so on. This would constitutea non-causal descriptive annotation since these an-notations describe where the beat actually occurredrather than where a human or algorithm predicted

it to be. Here we have opted for automatic annota-tions that were then verified manually.

Table 1 shows the results for all songs usingboth the offline and online techniques. Our methodimproves upon the Match algorithm and achievessimilar errors to the Ewert et al.’s (2009) algorithm.The offline methods are provided with the end-points of the two files as well as the start pointsand thus have considerable advantage over the on-line methods.

A mixdown of these tracks was used to al-low comparison with offline methods. We createdalignments of each pair of mixdowns using Match

13

Median absolute alignment error (ms)Song Title Online Offline

Bayesian Matcher Match OF Ewert et al. Match OBDiamond White 10.7 25.0 9.2 12.5Marble Arch 12.5 46.6 17.7 22.2Lewes 16.4 54.1 13.8 20.7Wanderlust 34.6 62.1 26.2 41.3Motorcade 13.9 38.7 13.4 19.7Festival 18.9 45.8 12.4 12.4Station Gate 23.0 70.2 14.8 27.2Penny Arcade 13.1 594.8 16.6 270.3Son Of Man 13.0 28.6 13.6 14.2New Years Resolution 15.9 75.6 13.6 24.5Stones 12.3 40.5 12.5 19.6

Table 1The median absolute alignment error in ms for each song.

Dixon (2005) in both the online (OF) and offline(OB) modes, and using the algorithm of Ewert etal. (2009). Since any discrepancy will be audi-ble, we require the synchronisation to be as accu-rate as possible. Seeking a bound for this, Lagoand Kon (2004) argue that synchronisation withinthe region of 20 to 30ms, equivalent to a distanceof approximately ten meters, should be sufficientlyaccurate so as not to be perceptible. The resultsfor all algorithms are shown in Table 1. With ourproposed method, we observed that 64% of theevents were recorded within 20ms of the anno-tated times and 89% within 40ms. These figurescompare well with those achieved by Ewert et al.’s(2009) algorithm for offline audio synchronisation,the current state-of-the-art, which scored 64% and87% for the same time limits. Our method is re-liant on the presence of a significant number ofpercussive events. Without these, the chromagramevents on their own are not sufficient to synchro-nise two sources and alternative methods shouldbe employed.

Live Testing

In order to verify the results from offline testsand to experience how this interactive systemmight be used in practice, we also conducted tests

with a three piece rock band (bass, drums andguitar) using a total of four songs. The elastic∼object for MaxMSP 3 which implements the z-plane timestretching algorithm 4 was used to mod-ify the playing speed of the backing audio to matchthe system’s optimal alignment position. We alsomade use of marker points so that the buttons of aMIDI footpedal could set the position distributionto a Gaussian around set positions in the song, suchas first verse or chorus. This proved to be a rela-tively unproblematic way to initialize the systemsuccessfully after a count-in or introduction sec-tion.

In all four cases the system succeeded in syn-chronizing backing parts in a musically acceptableway. The combination of drum and harmonic in-struments allows the system to recover from situ-ations where automatic synchronisation might bedifficult, such as when there is not a steady streamof events of different type. One of the difficul-ties encountered when testing the system in per-formance is the requirement to have some kind ofvisual feedback of how it is behaving. Our imple-mentation in OpenFrameworks allows the user toobserve the probability density function and verify

3 Purchased from http://www.elasticmax.co.uk/4 www.zplane.de

14

that the system is functioning as expected.

Conclusion

In this paper, we have presented a Bayesianprobabilistic framework for the real-time align-ment of a performance with a multitrack recording.Probability distributions for the position and speedof the live performance relative to the multitrackrecording are updated in real-time through the se-quential use of Bayes’ theorem. We have observedcomparable performance statistics to the use ofstate-of-the-art offline algorithms and confirmedthat the system functions well within a live bandscenario. These other algorithms were providedwith stereo mixes whereas our proposed methodrequired the multitrack audio.

The probabilistic framework allows for the in-tegration of data from multiple sources. Provid-ing the information can be expressed as a like-lihood function for each source, it is then possi-ble to update a global probability density functionfor the whole performance. The specification of atempo distribution as well as a position distribu-tion brings about a real-time dynamic system, inwhich uncertainty in the position distribution in-creases with the time between observations. Theframework allows for the outputs of other algorith-mic techniques to be used. For example, one po-tential development would be to incorporate beattracking into the model. Where there is a strongbeat, both tempo and position distributions mightbenefit from making use of the resulting tempo andphase estimates. This could be weighted accord-ing to the ‘confidence’ of the beat tracker. Anotherimprovement that could be made is to model howthe distribution might respond to the presence ofexpected events in the score which have not beenobserved.

Future work includes the incorporation of high-level musical knowledge. At present, the systemdoes not have a model for rhythm, beats or bars.Reliable real-time beat tracking algorithms couldimprove the tempo process by comparing the ob-served real-time beat period to the offline beat pe-riod in the recording. Tempo induction algorithmscould easily be integrated into the tempo process.Structural analysis of music might bring advan-

tages in the alignment process and such as a systemwould be able to provide a foundation on whichgenerative musical systems could be created. An-other potential area for development is the inclu-sion of training in rehearsal such as employed byRaphael (2010) and Vercoe (1985). Statistics fromrehearsals could provide information such as howprobable a given event is to be detected and thestandard deviation in the timing. Such informationcould then be used when determining the likeli-hood function of an event in the matching proce-dure.

A repository containing the source code is pub-licly available on the Sound Software website 5.This includes the C++ code for the openFrame-works project and MaxMSP patches which wereused to conduct the evaluations and to do live per-formance testing. We envisage that this can enableothers to reproduce the results contained in this pa-per and to build upon the methods described.

ReferencesArzt, A., Bock, S., & Widmer, G. (2012). Fast iden-

tification of piece and score position via symbolicfingerprinting. In Proceedings of the internationalconference on music information retrieval (ismir).

Arzt, A., & Widmer, G. (2010). Simple tempo modelsfor real-time music tracking. In Proceedings of the7th Sound and Music Computing Conference.

Arzt, A., Widmer, G., & Dixon, S. (2008). Automaticpage turning for musicians via real-time machine lis-tening. In Proceedings of the 18th European Confer-ence on Artificial Intelligence (ECAI).

Bartsch, M. A., & Wakefield, G. A. (2001). To catch achorus: Using chroma-based representations for au-dio thumbnailing. In IEEE Workshop on the Appli-cations of Signal Processing to Audio and Acoustics,2001.

Bello, J. P., Daudet, L., Abdallah, S., Duxbury, C.,Davies, M., & Sandler, M. (2005). A tutorial ononset detection in music signals. IEEE Transactionson Speech and Audio In Processing, 13(5, Part 2),1035-1047.5 http://code.soundsoftware.ac.uk/projects/bayesian-

multitrack-matching

15

Cannam, C., Landone, C., Sandler, M. B., & Bello, J.(2006). The Sonic Visualiser: A visualisation plat-form for semantic descriptors from musical signals.In Proceedings of the 7th International Conferenceon Music Information Retrieval (ISMIR-06).

Cemgil, A. T., Kappen, H. J., Desain, P., & Honing., H.(2001). On tempo tracking: Tempogram Represen-tation and Kalman filtering. Journal of New MusicResearch, 28(4), 259-273.

Cheveigne, A. de, & Kawahara, H. (2002). Yin, a fun-damental frequency estimator for speech and music.The Journal of the Acoustical Society of America,111(4), 1917-1930.

Cont, A. (2008). Antescofo: Anticipatory synchroniza-tion and control of interactive parameters in com-puter music. In Proceedings of the 2008 Interna-tional Computer Music Conference.

Cont, A. (2011). On the creative use of score followingand its impact on research. In Proceedings of the8th Sound and Music Computing Conference (SMC),Padova.

Dannenberg, R. B. (1984). An on-line algorithm forreal-time accompaniment. In Proceedings of the1984 International Computer Music Conference (p.193-198).

Dannenberg, R. B. (2005). Toward automated holisticbeat tracking, music analysis and understanding. InProceedings of the International Conference on Mu-sic Information Retrieval (p. 366-373).

Dannenberg, R. B. (2007). An intelligent multi-trackaudio editor. In Proceedings of the internationalcomputer music conference (p. 89-94).

Davies, M. E. P., & Plumbley., M. D. (2007). Context-dependent beat tracking of musical audio. IEEETransactions on Audio, Speech and Language Pro-cessing, 15(3), 1009-1020.

Dixon, S. (2005). Match: A music alignment tool chest.In Proceedings of the 6th International Conferenceon Music Information Retrieval (ISMIR-05) (p. 492-497).

Duan, Z., & Pardo, B. (2011). A state space modelfor online polyphonic audio-score alignment. In Pro-ceedings of the International Conference on Acous-tics, Speech and Signal Processing (p. 197-200).

Ewert, S., Muller, M., & Grosche, P. (2009). Highresolution audio synchronization using chroma onsetfeatures. In Proceedings of the International Confer-ence on Acoustics, Speech and Signal Processing.

Gold, N., & Dannenberg, R. B. (2011). A referencearchitecture and score representation for popular mu-sic human-computer music performance systems. InProceedings of the 2011 International Conference onNew Interfaces for Musical Expression (p. 36-39).

Goto, M., Hashiguchi, H., Nishimura, T., & Oka, R.(2002). Rwc music database: Popular, classical,and jazz music databases. In Proceedings of the 3rdInternational Conference on Music Information Re-trieval (ISMIR 2002) (p. 287-288).

Grubb, L., & Dannenberg, R. B. (1997). A stochasticmethod of tracking a performer. In Proceedings ofthe 1997 International Computer Music Conference(p. 301-308).

Hu, N., Dannenberg, R. B., & Tzanzetakis, G. (2003).Polyphonic audio matching and alignment for musicretrieval. In Proceedings of the 2003 InternationalComputer Music Conference (p. 185-188).

Joder, C., Essid, S., & Richard, G. (2011). A condi-tional random field framework for robust and scal-able audio-to-score matching. IEEE Transactionson Audio, Speech and Language Processing, 19(8),2385-2397.

Kalman, R. E. (1960). A new approach to linear filter-ing and prediction problems. In Transaction of theamse-journal of basic engineering (p. 35-45).

Lago, N. P., & Kon, F. (2004). The quest for low la-tency. In Proceedings of the 2004 International Com-puter Music Conference (pp. 33–36).

Montecchio, N., & Cont, A. (2011). A unified approachto real time audio-to-score and audio-to-audio align-ment using sequential montecarlo techniques. InProceedings of the 2011 International Conference onAudio Speech and Signal Processing (ICASSP 2011)(p. 193-196).

Muller, M. (2007). Information retrieval for music andmotion. Springer.

Niedermeyer, B., & Widmer, G. (2010). A multi-passalgorithm for accurate audio-to-score alignment. In

16

Proceedings of the 11th International Society forMusic Information Retrieval Conference (ISMIR),Utrecht (p. 417-422).

Orio, N., & Dechelle, F. (2001). Score following usingspectral analysis and hidden Markov models. In Pro-ceedings of the 2001 International Computer MusicConference (p. 105-109).

Orio, N., Lemouton, S., & Schwarz, D. (2003). ScoreFollowing: State of the art and new developments.In Proceedings of the 2003 Conference on New In-terfaces for Musical Expression (p. 36-41).

Otsuka, T., Nakadai, K., Takahashi, T., Komatani, K.,Ogata, T., & Okuno, H. G. (2010). Design and im-plementation of two-level synchronization for an in-teractive music robot. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence, AAAI2010 (p. 1238-1244).

Pardo, B., & Birmingham, W. (2002). Improved ScoreFollowing for Acoustic Performers. In Proceedingsof the 2002 International Computer Music Confer-ence.

Peeling, P., Cemgil, A. T., & Godsill, S. (2007). A prob-abilistic framework for matching music representa-tions. In Proceedings of th 8th international confer-ence on music information retrieval (ismir 2007) (p.267-272).

Puckette, M. (1992). Score following in practice. InProceedings of 1992 International Computer MusicConference (p. 182-185).

Rabiner, L. (1989). A tutorial on hidden Markov mod-els and selected applications in speech recognition.Proceedings of the IEEE, 77(2), 257-286.

Raphael, C. (1999). A probabilistic expert systemfor automatic accompaniment. Journal of Compu-tational and Graphical Statistics, 10(3), 467-512.

Raphael, C. (2006). Aligning music audio with sym-bolic scores using a hybrid graphical model. Ma-chine Learning, 21(4), 360-370.

Raphael, C. (2010). Music Plus One and machine learn-ing. In Proceedings of the 27th International Confer-ence o Machine Learning.

Reiss, J. D. (2011). Intelligent systems for mixingmultitrack audio. In Proceedings of the InternationalConference on Digital Signal Processing, DSP2011(p. 1-6).

Shepherd, R. (1964). Circularity in judgements of rela-tive pitch. Journal of the Acoustical Society of Amer-ica, 36, 2346-2353.

Vercoe, B. (1984). The Synthetic Performer in the con-text of live performance. In Proceedings of the 1984International Computer Music Conference (p. 199-200).

Vercoe, B., & Puckette, M. (1985). Synthetic Re-hearsal, training the Synthetic Performer. In Pro-ceedings of the 1985 International Computer MusicConference (p. 275-278).

Wakefield, G. H. (1999). Mathematical representationof joint time-chroma distributions. In Proceedings ofthe SPIE Conference on Advanced Signal ProcesingAlgorithms, Architectures and Implementations (pp.637–645).

event-based multitrack alignment using a …live rendition or they make use of backing tracks to...

Documents