transcriptive metaphor transcriptive non-transcriptive...

COMPUTATIONAL BEAT TRACKING AND TEMPO INDUCTION MODELS

Beat tracking and tempo induction methods can be categorized by several different strategies

based on features such as causality or input type. Causal beat trackers rely upon past or present

samples for a determination of present period and phase calculations, while non-causal methods

compute period and phase with access to the entire file. Classification by causality is not ideal, as

several techniques are capable of both causal and non-causal operation. A classification by input

type divides methods into those that use symbolic data and those that use continuous audio. A

drawback to this division is that some continuous audio-based methods first split the input into

segments representing notes or timbres, and then determine beat or tempo as do discrete models.

The classification system chosen for this review relies on the notion of transcriptive metaphor

(Scheirer 1996, 1998), a term from music perception meant to imply the reduction of continuous

music into discrete functional units in the listening process. A distinction is made between models

that assume musical structure (e.g., notes, timbres) from IOIs and use this knowledge towards a

decision of beat times or tempo (transcriptive), and those which make no such assumptions about

structure prior to calculation of beat or tempo (non-transcriptive). This categorization is similar to

one by input type; however, a clear distinction is made between methods that derive beat and

tempo directly from continuous audio and those that derive beat and tempo from symbolic data,

or audio that is segmented and treated as symbolic data.

TRANSCRIPTIVE APPROACHES

The models discussed in this section may be defined as transcriptive, as they seek to infer

either beat times or a tempo process (i.e., a continuous tempo curve) from a discrete set of

intervals (Scheirer 1996, 1998). These intervals are either derived from symbolic data, consisting

2

of onset times and durations, or audio data, which as a first step is segmented by onset detection

techniques.1

Rule-based Approaches

Early computational models of rhythmic perception were rule-based approaches comprised of a

series of simple commands that mimicked theoretical rules. The first such method was for

extracting the perceived pulse from symbolic representations of Bach fugues (Steedman 1977).

The most well known rule-based method is by Longuet-Higgins and Lee (1983) who establish

and refine beat length (i.e., metrical duration) hypotheses in a note-by-note process from a

symbolic onset list. This rule-based model is built on generative grammar from traditional music

theories (e.g., GTTM). The initial hypothesis assumes a future beat location at a distance of the

IOI from the most recent note. Upon receipt of successive notes, this hypothesis is updated by the

following set of rules. The CONFLATE rule rewards correct hypotheses (i.e., an onset lands

where expected) by ascending the metrical hierarchy. The STRETCH rule increases the length of

the likely period, allowing the system to adjust its hypothesis if the metrical assumption is not

probable. If a musical sequence does not begin on a downbeat, the UPDATE rule adjusts the

hypothesis to start on a metrically relevant event without changing the selected periodicity. The

LONGNOTE rule is activated when an UPDATE has been performed and the initial hypothesis is

based upon the IOI of an upbeat. This rule operates under the theoretical assumption that longer

notes are indicative of metrical accents (e.g., GTTM preference rules [Lerdahl & Jackendoff

1983]). While achieving moderate success on a variety of musical examples, the system is

incapable of addressing expressive timing and metrical change. Another limitation is that it is

incapable of handling polyphony.

The Serioso system (Temperey and Sleator 1999) is capable of deriving multiple levels of

metrical hierarchy from polyphonic input by maximizing a series of preference rules that follow a 1 see Bello et al. (2005) for a review of onset detection techniques

3

similar generative theoretical model to the GTTM (Lerdahl & Jackendoff 1983). The Event rule

prefers to align beats with onsets. The Length rule is similar to the LONGNOTE rule (Longuet-

Higgins & Lee 1983). The Regularity rule prefers evenly-spaced periodicities. User-definable

settings aid the system in its determination of beat times. Temperley and Sleator demonstrate the

program’s ability to locate the tactus within quantized and unquantized MIDI data; however,

occasional errors occur in the determination of metrical accents at higher levels such as

downbeats.

Oscillators

Adaptive Oscillator methods provide a perceptual mechanism to connect theories of musical

structure to perception. Large and Kolen (1994) present a method for tempo tracking of symbolic

musical events through the frequency- and phase-locking functions of an adaptive oscillator

(derived from biological models [Glass & Mackey 1988]). The oscillator is capable of modeling

expectancy of future events, through storing onset times weighted by their proximity to expected

metrical times. Synchronization between the oscillator and driving signal is achieved by the sine

circle map, which provides updated phase values for the oscillator as a function of previous phase

values, oscillator and driving signal periods, and a continuous coupling term. The coupling term

defines the expected range of association of onsets with metrical subdivisions, and is represented

as a distribution similar to that of a Gaussian, parameterized by a relative oscillator phase and an

expectancy curve. The result is a non-linear adaptation, in which the oscillator will modify phase

and period more readily to those onsets that are closer to metrical subdivisions. Onsets are

associated with either strong or weak beats, according to the amount of expectancy for each

metrical level. Tempo and meter values must be initialized by the user, and updates are made

upon each new onset identified. An implementation of the adaptive oscillator is also used by

Pardo (2004) to follow Jazz and Blues accompaniment.

4

Toiviainen (1999) identifies two problems with the Large and Kolen (1994) oscillator

representation: First, the rate of change for oscillator phase and period is subject to the driving

signal, causing abrupt changes at onset times. Second, the oscillator makes no distinction between

IOIs of different duration. Toiviainen adapts Large and Kolen’s model to provide continuous

adaptation and proportional emphasis on IOI length. The model addresses local timing deviations

and global tempo changes with separate short- and long-term adaptation models. Toiviainen

applies this model to extract a control signal for real-time interactive MIDI accompaniment.

Large and Palmer (2002) propose that listeners may use several of these oscillators in

tandem, each tuned to different metrical levels to establish regularity within temporally

fluctuating signals. Coupling between oscillators is shown to improve pulse tracking of complex

signals through forced phase modification of the lower-level oscillators by those of a higher level.

Multiple Agents

Multiple-agent methods from the field of Artificial Intelligence have been used extensively for

beat tracking and tempo induction. These methods create agents, a number of independent

hypotheses regarding the possible period and phase of the input signal, then attribute themselves

to different hypotheses and compete for superiority.

Among the first to use multiple agents for beat tracking were Allen and Dannenberg

(1990), who used a real-time beam search method to explore multiple concurrent interpretations

of a symbolic performance. For each onset received, the beat tracker generates several beat and

phase hypotheses that attempt to describe the most recent IOI and metrical position. A measure of

credibility is assigned to each hypothesis. Improbable hypotheses are pruned immediately through

the use of musical rules (e.g., quarter notes must begin on beats). A similarity measure is used to

combine states exceeding a chosen similarity threshold. Credibility for the remaining states is

established jointly from the credibility of the parent, the tempo deviation between the parent and

present state, and a penalty measure created by a set of heuristic music rules (e.g., quarter notes

5

will generally not cross beat points, or short notes are unlikely at downbeats). The tracker was

evaluated subjectively using several expressive performances and was found to track tempo

changes reliably; however, the user must supply the first metrical note of a performance.

Perhaps the most elaborate of the multiple-agent strategies for beat tracking is the Beat

Tracking System (BTS) by Goto and Muraoka (1995, 1999). This system was the first audio beat

tracker capable of tracking complex audio reliably, and is able to do so for music with or without

drum sounds. For music containing drums, BTS identifies patterns of detected bass drums and

snare drums and associates them to stored pattern templates. The system operates under the

assumptions that the tempo of the input audio is between 61 M.M. to 185 M.M. (i.e., Mälzel’s

Metronome), and has a 4/4 time signature. Beat times and types (i.e., strong or weak) are realized

in a two-stage process of frequency analysis and beat prediction. Frequency analysis is performed

using both onset finders and specialized bass drum and snare drum localization processes. Input

audio undergoes subband decomposition, and onsets are extracted by fourteen independent onset

finders. BTS learns the characteristic frequencies for the bass drum tones for each input signal

through the lowest accumulated frequency peak in a histogram function. Each onset is then

surveyed for its frequency content, and if it contains a spectral peak corresponding to the learned

bass tone, the onset is assumed to represent presence of the bass drum tone. Snare drums are

detected by determining the degree of widespread noise components, and a simple peak picking

function is then used to extract these onsets.

Twenty-eight multiple agents track beat period, beat phase, and beat type hypotheses.

Agents are grouped into fourteen pairs; each pair shares a common periodicity, and each agent is

one-half cycle out of phase. An expectancy curve is established between the two agents, in which

the predicted beat time of one agent inhibits the expectancy of the other. Onset information

extracted during frequency analysis is used to form the basis of these hypotheses. Beat period

estimates are created by pruning the maximum peak of an autocorrelation function of onset times.

Beat phase is then selected as the maximum peak of the cross-correlation of onset times and a

6

generated pulse sequence in which the IOI is equivalent to the estimated beat period. A reliability

measure is awarded to each hypothesis by testing the continuity provided by the current estimate

with the two prior beat times. A manager selects the most likely hypothesis based on the

reliability. Beat type is established for the extracted drum pattern via pattern matching. The

detected pattern is associated with the most similar of a set of stored patterns, and each onset is

then labeled as either strong or weak (according to its location in the pattern). Goto and Muraoka

(1995) evaluate the BTS system using a database of forty-four popular songs, from which forty-

two were judged to have correct beat periods and phases after only a few measures.

Goto and Muraoka (1999) extend the BTS model to determine the quarter-note, half-note

and measure periodicities in music without drums. Frequency analysis and hypothesis

formulation is largely unchanged from the drum-based model (Goto & Muraoka 1995).

Reliability of each hypothesis is assessed as follows: The beat period and phase estimates are

tested for continuity with the prior beat times. The estimates are also evaluated using top-down

eighth-note and quarter-note change possibilities. Using the present estimate, the input signal is

split into eighth notes (or quarter notes for the quarter-note change possibility) and the spectrum

is analyzed for significant changes. Within each slice, magnitudes in each bin are tallied across all

time points, creating a single envelope that characterizes the entire slice. Spectral peaks are then

extracted from the envelope and regularized with a gain function. A measure of chord change

possibility is then derived through comparison of these peaks between successive eighth notes (or

quarter notes). As with the earlier model (Goto & Muraoka 1995), hypotheses are handled by a

manager, which groups and selects winning hypotheses according to beat period and phase.

In an evaluation on forty popular music songs devoid of drum sounds, the system

correctly tracked quarter-note times with 87.5% accuracy. These results are vastly improved over

those of the previous system, which scored 22.5% on the same test set. There are two possible

reasons for such attractive results achieved by the system: First, 4/4 is the only possible time

signature, resulting in metrical timing without ternary or odd time signatures. Second, possible

7

tempi were between 61 and 120 M.M., eliminating the possibility of errors relating to tempo

doubling or halving. These are, however, reasonable assumptions to make, given the proposed

application. Goto and Muraoka use the Beat Tracking System alongside a melody and bassline

transcription method (Goto and Hayamizu 1999) for automated analysis and similarity

determination of musical sections (e.g., verse, chorus) within popular music (Goto 2003).

BeatRoot (Dixon 2001) is a multiple-agent beat tracker used to study expressive timing

within performances. This system was originally designed to accept MIDI input, but has recently

been extended with a discrete onset detection stage to operate with audio as well (Dixon 2006).

Input audio is transformed into a simplified continuous onset detection representation by the

spectral flux calculation (Duxbury 2004). A local maxima function selects onsets from the

continuous representation if the following considerations are met: first, the amplitude of the

present sample must exceed that of the prior and subsequent samples; second, it must exceed that

of a threshold above a fixed local mean; third, it must exceed another threshold derived from the

combined amplitude of prior and present samples.

IOIs are then clustered together; however, the clustering scheme is not limited to adjacent

intervals. Rather, IOIs are also understood at larger intervals that typically span more than one

IOI. This is done to reduce the effect of errors in the onset detection stage. Clustering results in a

categorization of intervals, the most populated of which are assumed to correspond to durations

of metrical hierarchy. Hypotheses are then generated regarding durations present in the clusters,

creating a ranked tempo hypothesis list used for beat detection. Multiple agents are used to test

the validity of each hypothesis initialized. Each agent is assigned a tempo hypothesis and initial

onset, from which it determines the time of the next beat. If the next onset is found within a small

window surrounding this predicted time, the system identifies this time as a beat. If the onset is

outside of this small window, but still within a larger window surrounding the predicted time, the

system assumes it to be a potential beat. Agents are created as needed to follow beat period and

8

phase possibilities that are not being investigated by other agents, and are terminated when

similar hypotheses are being followed concurrently, or if no beat times are reported.

BeatRoot received the highest performance score among five state of the art beat tracking

systems in the MIREX 2006 Audio Beat Tracking Evaluation (McKinney et al. 2007). The

systems were evaluated on 140 files (each file is annotated by approximately forty people) of

varied genre. The performance score is derived from a cross-correlation between an impulse train

generated from beat times, and an impulse train from each of the annotations (the cross-

correlation is then normalized by the maximum number of impulses between the two vectors).

Tempogram

Cemgil et al. (2001) seek to identify a hidden tempo process by formulating a probability

distribution based on a note onset list. Onsets are first convolved with a Gaussian function,

creating a continuous signal with peaks at onset positions. A tempogram representation is

generated by the inner product of this continuous signal and a tempo basis function. The

tempogram provides the probability of the onset list given different tempo tracks, and is similar in

response to that of a comb filter bank. The output of the tempogram is assumed to be ‘noisy’, and

Cemgil et al. smooth it with a Kalman filter. The Kalman filter takes advantage of the relative

strength of each tempo observation, and provides a balance between noise removal and

flexibility. Cemgil et al. implement a switching-state Kalman filter that operates in two modes

produced by a mixture of two Gaussians, which explicitly labels outlier tempo observations as

such—effectively reducing their effect on the overall tempo process. When the observation

resides within the expected range the filter operates in its normal mode. However, to prevent

outliers from skewing tempo predictions, the second Gaussian is set with a wider variance.

Quantization

9

Desain and Honing (1989) focuses on the problem of timing variability between pairs of notes,

towards a bottom-up representation of rhythm. Their method, termed quantization, identifies

integer durations from inter-onset interval (IOI) lengths using compound networks similar to

neural networks (without a learning process). Bidirectional cells are used to model the individual

IOI values, which may be increased or reduced, depending on the initial durations. Sum cells

allow for modification of several grouped cells at once. Desain and Honing’s method allows for

the existence of both discrete timing intervals between adjacent notes, as well as continuous

tempo modifications that occur over larger time scales.

Desain (1992) extends this approach with continuous expectancy curves for modeling

anticipation. These curves are comprised of multiple Gaussian distributions placed at future

integers and subdivisions of each inter-onset interval. For each input onset, a distribution is

placed at a distance of one IOI from the most recent onset. Distributions are also placed at one-

third, one-half, three-halves, and two-times the interval length. As onsets are received, intervals

are calculated and their corresponding distributions are combined. The fractions and multiples

used are in keeping with perceptual studies from both speech and music research. In more

complex examples, syncopation is modeled through the use of virtual events. Beat periodicity

may then be achieved by determining global maxima of the expectancy curve (Desain & Honing

1994).

Raphael (2001) presents a method for rhythm transcription that simultaneously extracts

tempo and rhythmic structure either from a MIDI sequence, or from audio, using an onset

detector as a front end. Each observed IOI is assumed to be the noisy (i.e., unquantized) result of

the product of a hidden metrical duration and hidden tempo. The tempo is initially modeled as a

Gaussian, centered on a user-specified expected value. Upon future observations, the tempo is

updated by the previous tempo, plus a zero-mean Gaussian noise term with a variance given by

the length of the metrical note. This results in a slowly-evolving tempo process with an increased

range of possible tempi for longer notes. Probabilities of all possible metrical values for a given

10

IOI are tested for validity by overlaying Gaussian kernels representing the joint probability of the

metrical position, IOI and tempo, with the assumption that the at the end of the sequence, the path

with the highest score will demonstrate the most likely path followed. Raphael identifies this

most likely path by computing maximum a posteriori (MAP) estimates from the joint probability

density function comprised of metrical position, tempo and onset interval. To keep the number of

tractable states within reason, he introduces a selection method that discards kernels that do not

add to the overall likelihood calculation (a process termed “thinning”).

Cemgil et al. (2003) formulate their method similarly, but instead use Markov chain

Monte Carlo and sequential Monte Carlo techniques to sample from the distributions, and discard

nodes with low likelihoods. Hainsworth and Macleod (2004) present a similar approach for audio,

with the main distinctions being the use of particle filters in place of the Monte Carlo methods,

and a finer-grain metrical subdivision, to combat possible errors from onset detection.

NON-TRANSCRIPTIVE APPROACHES

The models discussed thus far seek to infer musical structure (e.g., rhythm, or meter) either

directly from symbolic data or from segments of input audio, which are assumed to relate to

specific metrical durations. The following non-transcriptive approaches to beat and tempo make

no such assumption regarding discrete values prior to beat or tempo determination. Instead, beat

or tempo is generated exclusively from signal trends and characteristics.

Comb Filters

Scheirer (1996) states that the transcriptive metaphor is unfounded; he believes note primitives

(i.e., building blocks of a mental representation) are not valid targets for reduction prior to

calculating beat period and phase. Instead, he proposes a direct calculation of beat period from a

continuous audio signal, using psychoacoustically relevant transforms (Scheirer 1998). The input

signal is first divided into six subbands in order to isolate rhythmic activity into frequency bands

11

that best characterize them (e.g., kick drums in lower bands, and cymbals in high bands). This

provides a more controlled analysis by reducing the amount of interference from instrumentation

across bands. Next, each channel is convolved with half-Hanning window resulting in a half-

wave enveloped signal. Each envelope is then sent through a comb filterbank and resonances are

measured. The filterbank is comprised of 150 logarithmically spaced feedback delays for each

channel, equivalent to tempi between 60 and 240 BPM. Energy produced from the interaction

between the input signal and filters is indicative of a correspondence between the input sequence

and filter tempo. Output energies across the subbands are summed, and the winning periodicity is

selected by pruning the filter exhibiting the highest response across the bank. Beat phase is

retrieved in two steps: first, delay vectors from the internal states of winning filters in each

subband are summed to create a single vector; and second, the maximum value of this vector is

chosen as the next predicted beat location. Beat period and phase are reevaluated every 25 ms to

provide for possible tempo modulations.

Scheirer’s beat tracker was tested on 60 hand-annotated examples, and achieved an

accuracy measure of 68%, a partial accuracy of 18%, and was determined as incorrect for 13%. In

certain circumstances the model has been reported to exceed human ability.

Several authors have since used Scheirer’s (1998) framework. Jehan (2005) uses comb

filters towards segmentation for a beat-level feature extraction. Kurth et al. (2006) also extract

features; however, recognizing the prominence of halving and doubling errors often produced in

tempo estimations, they propose an octave-invariant measure of tempo. Their process is

performed in three steps: first a beat spectrogram is generated from the output of a comb

filterbank. Second, this spectrogram is divided into tempo octaves (e.g., 10–20 BPM, 20–40

BPM, 40–80 BPM, etc.). Third, the octaves are summed, enhancing tempi that are in harmonic

relation, and minimizing those exhibiting an inharmonic relationship. Kurth et al. generate

rhythmic and metrical features from this system and use these towards identification of time-

scaled audio within an information retrieval context.

12

Klapuri et al. (2006) have extended the comb filter approach with a probabilistic model

that jointly determines the tatum, tactus, and meter. The incoming audio signal undergoes a 36-

channel subband decomposition, from which power is extracted across each channel. Following

compression and low-pass filtering, adjacent channels are summed, resulting in four distinct

subbands. These signals are sent to a bank of comb filters. Unlike Scheirer's (1998) method, no

discrete determination of periodicity or phase is made. Instead, the output of the filters is used as

an observation sequence for a hidden Markov model. For every input frame, the state sequence

(comprised of hidden states representing tatum, tactus, and measure periods) is determined based

on the prior state and the observed filterbank output. Klapuri et al. assume that both the measure

period and tatum are functions of the tactus, and simplify the model by encoding this dependency.

Transition probabilities are formed from the product of a prior probability and a Gaussian

function set to limit large period changes. The prior places emphasis on the range of likely

periods given by tempi from a database, and is modeled using Parncutt's (1994) log-Gaussian

distribution. Dependence between concurrent periods is modeled in order to capture the

relationship between the different levels of metrical hierarchy, and is performed with a Gaussian

mixture model. Viterbi back tracing is used to achieve the optimal state sequence. To limit the

potential state explosion that could result from standard Viterbi, Klapuri et al. (2006) choose the

five most likely candidates of each frame using beam search.

Once the period is established, the tatum, tactus, and measure phases are determined

through a two-tiered method. First, an additional HMM identifies the tactus periodicity as a

hidden state variable, and recognizes the frame-length output of the winning filters as an

observation. The state-conditional observational likelihoods are then approximated as the

weighted sum of the observation filter responses. Second, measure-level phases are addressed

using a template matching technique. Phase estimation is achieved through selecting one of two

patterns that best fits the data. Transition probabilities between successive possible beat times are

modeled by a Gaussian distribution as a function of prediction error. Given the present period,

13

and previous beat time, the prediction error measures the deviation of new phase from the

predicted beat time.

In the MIREX 2006 Audio Beat Tracking Evaluation, the system by Klapuri et al. (2006)

(KEA) received third of five entered systems (McKinney et al. 2007). While the MIREX results

determined Dixon’s method (2006) to be the most accurate, the results also demonstrate (1) there

was no statistical difference between the top-scoring algorithms; and (2) the KEA system

demonstrated the least octave switching of the tested algorithms.

Autocorrelation

An autocorrelation function assesses the identity and strength of periodicities within a signal, and

is performed through a dot multiplication of a signal and a time-shifted version of itself. Indices

of the autocorrelation output, or lags, represent the length of the periodicities found in the input

signal. Brown (1993) applied autocorrelation to symbolic melodies of Classical music to extract

musical meter. It has since been used towards detection of metrical periodicity and tempo in onset

detection functions derived from audio signals (Alonso et al. 2004; Davies & Plumbley 2007;

Foote & Uchihashi 2001; Leue & Izmirli 2006; Goto & Muraoka 1995, 1999). Autocorrelation

methods are attractive due to both their computational simplicity and good results.

Davies and Plumbley (2007) present a hybridized approach to audio beat tracking, in that

it uses both autocorrelation and comb filtering. The algorithm is a lightweight signal processing

technique that extracts periodicity through one of two possible states. Upon initialization, the

tracker begins in the General State, in which parameters are generalized to expect a wide range of

periodicities. Once a few regularized periods have been determined, the tracker switches to the

Context-Dependent State, which seeks to use information such as meter and previously

determined periodicities to generate new periods.

Input audio is first transformed into an onset detection function using the complex

spectral difference method (Duxbury 2004). Autocorrelation is performed on segments of the

14

signal. In the General State, the output of the autocorrelation calculation is weighted by a

Rayleigh distribution (Papoulis 1984) representing the likely range of musical periods with a

center lag representing 120 BPM. The weighted autocorrelation is then sent as an input to a bank

of comb templates resembling Scheirer's (1998) filterbank. In all, 512 comb templates comprise

the template bank, representing tempi between 40 and 240 BPM. Each template is comprised of

four weighted delta functions (i.e., impulses) located at integer multiples of the associated period.

The template exhibiting the most energy is selected as the winning period. Once a few regular

periods are received, the system switches to the Context-dependent State. Much of the processing

remains the same; however, two main alterations are made: first, the Rayleigh distribution is

replaced by a tight Gaussian, centered on the most recent period; and second, the time signature is

established as either binary or ternary using a method first described by Gouyon and Herrera

(2003). Peak energy of the autocorrelation at odd and even multiples of the beat period is

extracted and compared. If ternary energy is determined to be more relevant, the comb template is

reformulated to contain three elements.

Beat alignment is achieved through a similar comb template. Elements are set to the

determined beat period, and each template is offset one sample from the last. The comb template

demonstrating the maximum product between the current onset detection function frame and the

bank of templates is selected as containing the correctly aligned initial beat time for that frame.

Additional beat times are identified by iteratively adding the chosen period to this initial time.

Davies and Plumbley (2007) evaluate the performance of their algorithm (DP) against

several state of the art systems including those of Klapuri et al. (2006) (KEA) and Dixon (2001),

using a database of 222 files of various genres. Results indicate that there is no statistical

difference between performance of the KEA and DP systems, demonstrating the proficiency of

the DP two-state model, given the relative simplicity as compared to the KEA system. The DP

beat tracker has also been evaluated in the MIREX 2006 Audio Beat Tracking Evaluation, in

which it finished in second place among five tested algorithms (McKinney et al. 2007). This beat

15

tracker was originally implemented in a system for automated percussive accompaniment (Davies

2007), and has also found use within content-aware effects processing (Stark et al. 2007).

Seppänen et al. (2006) presents an efficient beat tracking algorithm that simultaneously

tracks tactus and tatum. Much of the architecture (e.g., onset detection function representation

and likelihood formulation) is derived from Klapuri et al. (2006), yet simplifications have been

made to reduce computational complexity, as it is intended for use on cellular phones. After

subband decomposition and enveloping stages, the signal is autocorrelated. The discrete cosine

transform (DCT) is then used to approximate the underlying periodicities exhibited in the

autocorrelation. A weighted product of the DCT and autocorrelation frame power forms the

observation matrix. A Gaussian mixture model is used to approximate likelihood functions that

characterize tatum and tactus dependence. The observation matrix is weighted by a prior

(established similarly to that by Klapuri et al. [2006]) and is multiplied by the likelihood

functions. The maximum value then defines the winning beat and tatum period pairs.

Beat phase is performed for only the tactus, as tatum phase can be achieved given

knowledge of tactus phase (but not vice versa). A comb filter is adapted to the currently

determined tactus period, and a cursory estimation is made based on the previous beat time and

the new period. The detection function segment under analysis is sent through the comb filter,

and the output is scored based on peak height and the distance from the predicted phase value.

Seppänen et al. (2006) perform an evaluation of the beat period and phase finding

functionality of their system against those of Scheirer (1998) and Klapuri et al. (2006), on a set of

192 songs of varied genre. Their system achieves slightly reduced accuracy to that of the Klapuri

et al. method, and better results than those of Scheirer’s system.

Ellis (2006) presents a beat tracking method that is used as a first stage in a cover song

identification system (Ellis & Poliner 2007). An onset detection function is achieved as follows:

first, the signal is transformed into a log-magnitude Mel-frequency spectrogram representation;

second, the first-order difference is performed across the temporal axis of the spectrogram; third,

16

the resulting signal is half-wave rectified; finally, the signal is summed across the frequency axis,

resulting in a single continuous waveform. Towards determination of a global tempo, Ellis

performs autocorrelation on the onset detection function, and weights the output waveform with a

Gaussian centered at the lag corresponding to 120 BPM. The maximum value is selected from

this vector and used by a dynamic programming function to determine beat times. A beat tracking

score is provided for each detection function sample, and is given by the sum of the sample's

detection function value and a product of the previous sample's score and a log-time Gaussian

window. Viterbi back tracing is then used to determine the most likely sequence of beat times

starting from a large score near the end of the file.

Ellis' (2006) tracker was evaluated in the MIREX 2006 Audio Beat Tracking Evaluation,

in which it finished in fourth place of the five algorithms entered (McKinney et al. 2007). Xiao et

al. (2008) have improved the performance of Ellis' beat tracker with the addition of a statistical

model to investigate the association between tempo and timbre characteristics. A Gaussian

mixture model is trained using vectors comprised of hand-labeled tempi and frame-wise Mel-

frequency cepstral coefficients (MFCCs). Tempo predictions are then made by maximizing the

likelihood function of possible tempi provided an extracted MFCC calculation. Xiao et al.

provide two evaluations demonstrating the improvements due to this method. Using the Ellis

(2006) beat tracker as a base, the timbre association model demonstrates an improvement of

10.7% and 26.51% on two tested databases. Results from both databases demonstrate that errors

relating to tempo halving and doubling were significantly reduced.

17

REFERENCES

Alonso, M., B. David, and G. Richard. 2004. Tempo and beat estimation of musical signals. In Proceedings of the 5th International Conference on Music Information Retrieval. 158–63.

Allen, P., and R. Dannenberg. 1990. Tracking musical beats in real time. In Proceedings of the

1990 International Computer Music Conference. 140–3. Bello, J., L. Daudet, S. Adballah, C. Duxbury, M. Davies, and M. Sandler. 2005. A tutorial on

onset detection in music signals. IEEE Signal Processing Letters 13(5) part 2: 1035–47. ———, and J. Pickens. 2005. A robust mid-level representation for harmonic content in music

signals. In Proceedings of the 6th International Conference on Music Information Retrieval. 304–11.

———, and M. Sandler. 2003. Phase-based note onset detection for music signals. In

Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing. 49-52.

Berry, W. 1985. Metric and rhythmic articulation in music. Music theory spectrum 7(Spring): 7–

33. Brown, J. 1993. Determination of the meter of musical scores by autocorrelation. Journal of the

Acoustical Society of America. 94(4): 1953–7. Cemgil, A., B. Kappen, P. Desain, and H. Honing. 2001. On tempo tracking: Tempogram

representation and Kalman filtering. Journal of New Music Research. 28(4): 259–73. ———., and ———. 2003. Monte Carlo methods for tempo tracking and rhythm quantization.

Journal of Artificial Intelligence Research 18: 45–81. Cooper, G., and L. Meyer. 1960. The rhythmic structure in music. Chicago: University of

Chicago Press. Davies, M. 2007. Towards automatic rhythmic accompaniment. PhD diss., Queen Mary,

University of London. ———., and M. Plumbley. 2007. Context-dependent beat tracking of musical audio. IEEE

Transactions on Audio, Speech, and Language Processing 15(3): 1009–20. Desain, P. 1992. A (de)composable theory of rhythm perception. Music Perception 9(4): 439–54. ———., and H. Honing. 1989. The quantization of musical time: A connectionist approach.

Computer Music Journal 13(3): 56–66. ———., and H. Honing. 1994. Advanced issues in beat induction modeling: Syncopation, tempo,

and timing. In Proceedings of the 1994 International Computer Music Conference. 92–4.

18

Dixon, S. 2001. Automatic extraction of tempo and beat from expressive performances. Journal of New Music Research. 30(1): 39–58.

———. 2006. Onset detection revisited. In Proceedings of the 9th International Conference on

Digital Audio Effects. 133–7. ———., and W. Goebl. 2002. Pinpointing the beat: Tapping to expressive performances. In

Proceedings of the 7th International Conference on Music Perception and Cognition. 617–20. ———., F. Gouyon, and G. Widmer. 2004. Towards characterization of music via rhythmic

patterns. In Proceedings of the 5th International Conference on Music Information Retrieval. 509–16.

Drake, C., and D. Bertrand. 2001. The quest for universals in temporal processing in music.

Annals of the New York Academy of Sciences 930: 17–27. ———., and M. Botte. 1993. Tempo sensitivity in auditory sequences: Evidence for a multiple-

look model. Perception and Psychophysics 54(3): 277–86. Duxbury, C. Signal models for polyphonic music. PhD diss., Queen Mary, University of London. Ellis, D. 2006. Beat tracking with dynamic programming. http://www.music-ir.org/evaluation/

MIREX/2006_abstracts/TE_BT_ellis.pdf (accessed May 3, 2009). ———., and G. Poliner. 2007. Identifying ‘coversongs’ with chroma features and dynamic

programming beat tracking. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing 1429–32.

Engström, D., J. Kelso, and T. Holroyd. 1996. Reaction-anticipation transitions in human

perception-action patterns. Human Movement Science. 15: 809–32. Foote, J., and S. Uchihashi. 2001. The beat spectrum: A new approach to rhythm analysis. In

Proceedings of the 2001 IEEE International Conference on Multimedia and Expo. 881–4. Fraisse, P. 1982. Rhythm and tempo. In The Psychology of Music, ed. D. Deutsch, 149–80.

Orlando, FL: Academic Press. ———. 1984. Perception and estimation of time. Annual Review of Psychology 35: 1–36. Friberg, A., and A. Sundström. 2002. Swing ratios and ensemble timing in jazz performance:

Evidence for common rhythmic pattern. Music Perception 19(3): 333–49. Glass, L., and M. Mackey. 1988. From clocks to chaos: The rhythms of life. Princeton, NJ:

Princeton University Press. Goto, M. 2003. SmartMusicKIOSK: Music listening station with chorus-search function. In

Proceedings of the 2003 ACM Symposium on User Interface Software and Technology. 31–40.

19

———, and S. Hayamizu. 1999. A real-time music scene description system: Detecting melody and bass lines in audio signals. In Working Notes of the IJCAI Workshop on Computational Auditory Scene Analysis. 31–40.

———., and Y. Muraoka. 1995. A real-time beat tracking system for audio signals. In

Proceeedings of the 1995 International Computer Music Conference. 171–4. ———. 1999. Real-time beat tracking for drumless audio signal: Chord change detection for

musical decisions. Speech Communications 27(3): 311–35. Gouyon, F., and P. Herrera. 2003. Determination of the meter of musical audio signals: Seeking

the reoccurrences in beat segment descriptors. In Proceedings of the 114th Convention of the Audio Engineering Society. http://www.aes.org/e-lib/browse.cfm?elib= 12583 (accessed May 3, 2009).

Hainsworth, S., and M. Macleod. 2004. Particle filtering applied to musical tempo tracking.

EURASIP Journal on Applied Signal Processing 2004(15): 2385–95. Jehan, T. 2005. Creating music by listening. PhD diss., Massachusetts Institute of Technology. Jones, M. 1976. Time, our lost dimension: Toward a new theory of perception, attention, and

memory. Psychological Review 83(5): 323–35. ———., and M. Boltz. 1989. Dynamic attending and responses to time. Psychological Review

96(3): 459–91. Klapuri, A., A. Eronen, and J. Astola. 2006. Analysis of the meter of acoustic musical signals.

IEEE Transactions on Speech and Audio Processing. 14(1): 342–55. Kurth, F., T. Gehrmann, and M. Müller. 2006. The cyclic beat spectrum: Tempo-related audio

features for time-scale invariant audio identification. In Proceedings of the 7th International Conference on Music Information Retrieval. 621–6.

Large, E., and J. Kolen. 1994. Resonance and the perception of musical meter. Connection

Science 6(1): 177–208. ———., and C. Palmer. 2002. Perceiving temporal regularity in music. Cognitive Science 26: 1–

37. Lerdahl, F., and R. Jackendoff. 1983. A generative theory of tonal music. Cambridge,

Massachusetts: MIT Press. Leue, I., and Ö. Izmirli. 2006. Tempo tracking with a periodicity comb kernel. In Proceedings of

the 7th International Conference on Music Information Retrieval. Longuet-Higgins, H., and C. Lee. 1982. The perception of musical rhythms. Perception 115–28. Mach, E. 1886. Beitraege zur analyse der empfindungen. Jena, Germany: Gustav Fischer. McKinney, M., D. Moelants, M. Davies, and A. Klapuri. 2007. Evaluation of audio beat tracking

and music tempo extraction algorithms. Journal of New Music Research 36(1): 1–16.

20

Michon, J. 1967. Timing in temporal tracking. PhD diss., Leiden University. Moelants, D. 2002. Preferred tempo reconsidered. In Proceedings of the 7th International

Conference on Music Perception and Cognition. 580–3. Palmer, C. 1997. Music performance. Annual Review of Psychology. 48: 115–38. Papoulis, A. 1984. Probability, random variables, and stochastic processes. New York: McGraw-

Hill. Pardo, B. 2004. Tempo tracking with a single oscillator. In Proceedings of the 5th International

Conference on Music Information Retrieval. 154–7. Parncutt, R. 1994. A perceptual model of pulse salience and metrical accent in musical rhythms.

Music Perception 11(4): 409–64. Pöppel, E. 1997. A hierarchical model of temporal perception. Trends in Cognitive Science 1(2):

56–61. Povel, D., and P. Essens. 1985. Perception of temporal patterns. Music Perception 2(4): 411–40. Raphael, C. 2001. A Bayesian network for real-time music accompaniment. Neural Information

Processing Systems 14. Repp, B. 2003. Rate limits in sensorimotor synchronization with auditory and visual sequences:

The synchronization threshold and the benefits and costs of interval subdivison. Journal of Motor Behavior. 35(4): 355–70.

———. 2005. Sensorimotor synchronization: A review of the tapping literature. Psychonomic

Bulletin and Review 12(6): 969–92. Rowe, R. 2001. Machine musicianship. Cambridge, Massachusetts: MIT Press. Scheirer, E. 1996. Bregman’s chimerae: Music perception as auditory scene analysis. In

Proceedings of the 4th International Conference on Music Perception and Cognition. http://eprints.kfupm.edu.sa/28758/ (accessed May 4, 2009).

———. 1998. Tempo and beat analysis of acoustical musical signals. Journal of the Acoustical

Society of America 103(1): 588–601. Seppänen, J., A. Eronen, and J. Hiipakka. 2006. Joint beat and tatum tracking from music signals.

In Proceedings of the 4th International Conference on Music Information Retrieval. 685–8. Sethares, W. 2007. Rhythm and transforms. New York: Springer. Stark, A., M. Plumbley, and M. Plumbley. 2007. Real-time beat-synchronous audio effects. In

Proceedings of 2007 Conference of New Interfaces for Musical Expression. 344–5. Steedman, M. 1977. The perception of musical rhythm and metre. Perception 6(5): 555–69.

21

Synder, J., and C. Krumhansl. 2001. Tapping to ragtime: Cues to pulse finding. Music Perception 18(4): 455–89.

Temperley, D., and D. Sleator. 1999. Modeling meter and harmony: A preference-rule approach.

Computer Music Journal 23(1): 10–27. Toiviainen, P. 1999. An interactive MIDI accompanist. Computer Music Journal 22(4): 63–75. ———., and J. Snyder. 2003. Tapping to Bach: Resonance-based modeling of pulse. Music

Perception. 21(1): 43–80. Tzanetakis, G., and P. Cook. 2002. Musical genre classification of audio signals. IEEE

Transactions on Speech and audio processing 10(5): 293–302. Xiao, L., A. Tian, W. Li, and J. Zhou. 2008. Using a statistic model to capture the association

between timbre and perceived tempo. In Proceedings of the 9th International Conference on Music Information Retrieval. 659–62.

transcriptive metaphor transcriptive non-transcriptive...

Documents