a perceptual model for sinusoidal audio coding based on ... · the model relies on signal detection...

A perceptual model for sinusoidal audio coding based onspectral integrationCitation for published version (APA):Par, van de, S. L. J. D. E., Kohlrausch, A. G., Heusdens, R., Jensen, J., & Jensen, S. H. (2005). A perceptualmodel for sinusoidal audio coding based on spectral integration. EURASIP Journal on Applied SignalProcessing, 2005(9), 1292-1304. https://doi.org/10.1155/ASP.2005.1292

DOI:10.1155/ASP.2005.1292

Document status and date:Published: 01/01/2005

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 30. May. 2020

https://doi.org/10.1155/ASP.2005.1292

https://doi.org/10.1155/ASP.2005.1292

https://research.tue.nl/en/publications/a-perceptual-model-for-sinusoidal-audio-coding-based-on-spectral-integration(c73489fc-1913-4bc4-821b-4bb05f211bf7).html

EURASIP Journal on Applied Signal Processing 2005:9, 1292–1304c© 2005 Steven van de Par et al.

A Perceptual Model for Sinusoidal Audio CodingBased on Spectral Integration

Steven van de ParDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The NetherlandsEmail: [email protected]

Armin KohlrauschDigital Signal Processing Group, Philips Research Laboratories, 5656 AA Eindhoven, The Netherlands

Department of Technology Management, Eindhoven University of Technology, 5600 MB Eindhoven, The NetherlandsEmail: [email protected]

Richard HeusdensDepartment of Mediamatics, Delft University of Technology, 2600 GA Delft, The NetherlandsEmail: [email protected]

Jesper JensenDepartment of Mediamatics, Delft University of Technology, 2600 GA Delft, The NetherlandsEmail: [email protected]

Søren Holdt JensenDepartment of Communication Technology, Institute of Electronic Systems, Aalborg University, DK-9220 Aalborg, DenmarkEmail: [email protected]

Received 31 October 2003; Revised 22 July 2004

Psychoacoustical models have been used extensively within audio coding applications over the past decades. Recently, parametriccoding techniques have been applied to general audio and this has created the need for a psychoacoustical model that is specificallysuited for sinusoidal modelling of audio signals. In this paper, we present a new perceptual model that predicts masked thresholdsfor sinusoidal distortions. The model relies on signal detection theory and incorporates more recent insights about spectral andtemporal integration in auditory masking. As a consequence, the model is able to predict the distortion detectability. In fact, thedistortion detectability defines a (perceptually relevant) norm on the underlying signal space which is beneficial for optimisationalgorithms such as rate-distortion optimisation or linear predictive coding. We evaluate the merits of the model by combining itwith a sinusoidal extraction method and compare the results with those obtained with the ISO MPEG-1 Layer I-II recommendedmodel. Listening tests show a clear preference for the new model. More specifically, the model presented here leads to a reductionof more than 20% in terms of number of sinusoids needed to represent signals at a given quality level.

Keywords and phrases: audio coding, psychoacoustical modelling, auditory masking, spectral masking, sinusoidal modelling,psychoacoustical matching pursuit.

1. INTRODUCTION

The ever-increasing growth of application areas such as con-sumer electronics, broadcasting (digital radio and televi-sion), and multimedia/Internet has created a demand for

This is an open-access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

high-quality digital audio at low bit rates. Over the lastdecade, this has led to the development of new coding tech-niques based on models of human auditory perception (psy-choacoustical masking models). Examples include the cod-ing techniques used in the ISO/IEC MPEG family, for exam-ple, [1], the MiniDisc from Sony [2], and the digital compactcassette (DCC) from Philips [3]. For an overview of recentlyproposed perceptual audio coding schemes and standards,we refer to the tutorial paper by Painter and Spanias [4].

mailto:[email protected]





Perceptual Model for Sinusoidal Audio Coding 1293

A promising approach to achieve low bit rate coding ofdigital audio signals with minimum perceived loss of qualityis to use perception-based hybrid coding schemes, where au-dio signals are decomposed and coded as a sinusoidal partand a residual. In these coding schemes, different signal com-ponents occurring simultaneously are encoded with differ-ent encoders. Usually, tonal components are encoded with aspecific encoder aimed at signals composed of sinusoids andthe remaining signal components are coded with a waveformor noise encoder [5, 6, 7, 8, 9]. To enable the selection ofthe perceptually most suitable sinusoidal description of anaudio signal, dedicated psychoacoustical models are neededand this will be the topic of this paper.

One important principle by which auditory perceptioncan be exploited in general audio coding is that the modellingerror generated by the audio coding algorithm is maskedby the original signal. When the error signal is masked, themodified audio signal generated by the audio coding algo-rithm is indistinguishable from the original signal.

To determine what level of distortion signal is allowable,an auditory masking model can be used. We, for example,consider the case where the masking model is used in a trans-form coder. Here the model will specify, for each spectro-temporal interval within the original audio signal, what dis-tortion level can be allowed within that interval such that itis perceptually just not detectable. With an appropriate signaltransformation, for example, an MDCT filter bank [10, 11],it is possible to selectively adapt the accuracy with whicheach different spectro-temporal interval is described, that is,the number of bits used for quantisation. In this way, thespectro-temporal characteristics of the error signal, can beadapted such that auditory masking is exploited effectively,leading to the lowest possible bit rate without perceptible dis-tortions.

Most existing auditory masking models are based on thepsychoacoustical literature that predominantly studied themasking of tones by noise signals (e.g., [12]). Interestingly,for subband coders and transform coders the nature of thesignals is just the reverse; the distortion is noise-like, whilethe masker, or original signal, is often tonal in character. Nev-ertheless, based on this psychoacoustical literature dedicatedpsychoacoustical models have been developed for audio cod-ing for the situation where the distortion signal is noise-likesuch as the ISO MPEG model [1].

Masking models are also used for sinusoidal coding,where the signal is modelled by a sum of sinusoidal com-ponents. Most existing sinusoidal audio coders, for exam-ple, [5, 6, 13] rely on masking curves derived from spectral-spreading-based perceptual models in order to decide whichcomponents are masked by the original signal, and which arenot. As a consequence of this decision process, a number ofmasked components are rejected by the coder, resulting in adistortion signal that is sinusoidal in nature. In this paper amodel is introduced that is specifically designed for predict-ing the masking of sinusoidal components. In addition, theproposed model takes into account some new findings in thepsychoacoustical literature about spectral and temporal inte-gration in auditory masking.

This paper is organised as follows. In Section 2 we discussthe psychoacoustical background of the proposed model.Next, in Section 3, the new psychoacoustical model will beintroduced, followed by Section 4, which describes the cal-ibration of the model. Section 5 compares predictions ofthe model with some basic psychoacoustical findings. InSection 6, we apply the proposed model in a sinusoidal audiomodelling method and in Section 7 we compare, in a listen-ing test, the resulting audio quality to that obtained with theISO MPEG model [1]. Finally, in Section 8, we will presentsome conclusions.

2. PSYCHOACOUSTICAL BACKGROUND

Auditory masking models that are used in audio coding arepredominantly based on a phenomenon known as simulta-neous masking (see, e.g., [14]). One of the earlier relevantstudies goes back to Fletcher [15] who performed listeningexperiments with tones that were masked by noise. In hisexperiments the listeners had to detect a tone that was pre-sented simultaneously with a bandpass noise masker that wasspectrally centred around the tone. The threshold level fordetecting the tones was measured as a function of the maskerbandwidth while the power spectral density (spectrum level)was kept constant. Results showed that an increase of band-width, thus increasing the total masker power, led to an in-crease of the detection thresholds. However, this increase wasonly observed when the bandwidth was below a certain crit-ical bandwidth; beyond this critical bandwidth, thresholdswere independent of bandwidth. These observations led tothe critical band concept which is the spectral interval acrosswhich masker power is integrated to contribute to the mask-ing of a tone centred within the interval.

An explanation for these observations is that the signalprocessing in the peripheral auditory system, specifically bythe basilar membrane in the cochlea, can be represented as aseries of bandpass filters which are excited by the input signal,and which produce parallel bandpass-filtered outputs (see,e.g., [16]). The detection of the tone is thought to be gov-erned by the bandpass filter (or auditory filter) that is centredaround the tone. When the power ratio between the tone andthe masker at the output of this filter exceeds a certain crite-rion value, the tone is assumed to be detectable. With theseassumptions the observations of Fletcher can be explained;as long as the masker has a bandwidth smaller than that ofthe auditory filter, an increase in bandwidth will also lead toan increase in the masker power seen at the output of the au-ditory filter, which, in turn, leads to an increase in detectionthreshold. Beyond the auditory filter bandwidth the addedmasker components will not contribute to the masker powerat the output of the auditory filter because they are rejectedby the bandpass characteristic of the auditory filter. Whereasin Fletchers experiments the tone was centred within thenoise masker, later on experiments were conducted wherethe masker did not spectrally overlap with the tone to be de-tected (see, e.g., [17]). Such experiments reveal more infor-mation on the auditory filter characteristic, specifically aboutthe tails of the filters.

1294 EURASIP Journal on Applied Signal Processing

The implication of such experiments should be treatedwith care. When different maskers and signals are chosen, theresulting conclusions about the auditory filter shape are quitedifferent. For example, a tonal masker proves to be a muchpoorer masker than a noise signal [17]. In addition, the fil-ter shapes seem to depend on the masker type as well as onthe masker level. These observations suggest that the basicassumptions of linear, that is, level independent, auditory fil-ters and an energy criterion that defines audibility of distor-tion components, are only a first-order approximation andthat other factors play a role in masking. For instance, it isknown that the basilar membrane behaves nonlinearly [18],which may explain, for instance, the level dependence of theauditory filter shape. For a more elaborate discussion of au-ditory masking and auditory filters, the reader is referred to[19, 20, 21].

Despite the fact that the assumption of a linear auditoryfilter and an energy detector can only be regarded as a first-order approximation of the actual processing in the audi-tory system, we will proceed with this assumption becauseit proves to give very satisfactory results in the context of au-dio coding with relatively simple means in terms of compu-tational complexity.

Along similar lines as outlined above, the ISO MPEGmodel [1] assumes that the distortion or noise level that isallowed within a specific critical band is determined by theweighted power addition of all masker components spreadon and around the critical band containing the distortion.The shape of the weighting function that is applied is basedon auditory masking data and essentially reflects the under-lying auditory filter properties. These “spectral-spreading”-based perceptual models have been used in various para-metric coding schemes for sinusoidal component selection[5, 6, 13]. It should be noted that in these models, it is as-sumed that only the auditory filter centred around the dis-tortion determines the detectability of the distortion. Whenthe distortion-to-masker ratio is below a predefined thresh-old value in each auditory filter, the distortion is assumed tobe inaudible. On the other hand, when one single filter ex-ceeds this threshold value, the distortion is assumed to beaudible. This assumption is not in line with more recent in-sights in the psychoacoustical literature on masking and willlater in the paper be shown to have a considerable impact onthe predicted masking curves. Moreover, in the ISO MPEGmodel [1], a distinction is made between masking by noisyand tonal spectral components to be able to account for thedifference in masking power of these signal types. For thispurpose a tonality detector is required which, in the Layer Imodel, is based on a spectral peak detector.

Threshold measurements in psychoacoustical literatureconsistently show that a detection threshold is not a rigidthreshold. A rigid threshold would imply that if the signalto be detected would be just above the detection threshold,the signal would always be detected while it would never bedetected when it would be just below the threshold. Contraryto this pattern, it is observed in detection threshold measure-ments that the percentages of correct detection as a functionof signal level follow a sigmoid psychometric function [22].

The detection threshold is defined as the level for which thesignal is detected correctly with a certain probability of, typ-ically, 70%–75%.

In various theoretical considerations, the shape of thepsychometric function is explained by assuming that withinthe auditory system some variable, for example, the stim-ulus power at the output of an auditory filter, is observed.In addition, it is assumed that noise is present in this ob-servation due to, for example, internal noise in the au-ditory system. When the internal noise is assumed to beGaussian and additive, the shape of the sigmoid functioncan be predicted. For the case that a tone has to be de-tected within broadband noise, the assumption of a stimu-lus power measurement with additive Gaussian noise leadsto good predictions of the psychometric function. When theincrease in the stimulus power caused by the presence ofthe tonal signal is large compared to the standard devia-tion of the internal noise, high percentages of correct de-tection are expected while the reverse is true for small in-creases in stimulus power. The ratio between the increase instimulus power and the standard deviation of the internalnoise is defined as the sensitivity index d′ and can be cal-culated from the percentage of correct responses of the sub-jects. This theoretical framework is based on signal detec-tion theory and is described more extensively in, for example,[23].

In several more recent studies it is shown that the audibil-ity of distortion components is not determined solely by thecritical band with the largest audible distortion [24, 25]. Buuset al. [24] performed listening tests where tone complexeshad to be detected when presented in a noise masker. Theyfirst measured the threshold levels of several tones separatelyeach of which were presented simultaneously with widebandnoise. Due to the specific spectral shape of the masking noise,thresholds for individual tones were found to be constantacross frequency. In addition to the threshold measurementsfor a single tone, thresholds were also measured for a com-plex of 18 equal-level tones. The frequency spacing of thetones was such that each auditory critical band containedonly a single tone. If the detectability of the tones was onlydetermined by the filter with the best detectable tone, thecomplex of tones would be just audible when one individualcomponent of the complex had the same level as the mea-sured threshold level of the individual tones. However, theexperiments showed that thresholds for the tone complexwere considerably lower than expected based on the best-filter assumption, indicating that information is integratedacross auditory filters.

In the paper by Buus et al. [24], a number of theo-retical explanations are presented. We will discuss only themultiband detector model [23]. This model assumes thatthe changes in signal power at the output of each auditoryfilter are degraded by additive internal noise that is inde-pendent in each auditory filter. It is then assumed that anoptimally weighted sum of the signal powers at the out-puts of the various auditory filters is computed which servesas a new decision variable. Based on these assumptions, itcan be shown that the sensitivity index of a tone complex,


d′total, can be derived from the individual sensitivity indices d′nas follows:

d′total =

√√√√√ K∑n=1

d′2n , (1)

where K denotes the number of tones and where each indi-vidual sensitivity index is proportional to the tone-to-maskerpower ratio [22]. According to such a framework, each dou-bling of the number of auditory filters that can contributeto the detection process will lead to a reduction of 1.5 dB inthreshold. The measured thresholds by Buus et al. are wellin line with this prediction. In their experiments, the com-plex of 18 tones leads to a reduction of 6 dB in detectionthreshold as compared to the detection threshold of a sin-gle tone. Based on (1) a change of 6.3 dB was expected. Morerecently, Langhans and Kohlrausch [25] performed similarexperiments with complex tones having a constant spacingof 10 Hz presented in a broadband noise masker, confirm-ing that information is integrated across auditory filters. Inaddition, results obtained by van de Par et al. [26] indicatethat also for bandpass noise signals that had to be detectedagainst the background of wideband noise maskers, the sameintegration across auditory filters is observed.

As indicated, integration of information across a widerange of frequencies is found in auditory masking. Similarly,integration across time has been shown to occur in the au-ditory system. Van den Brink [27] investigated the detectionof tones of variable duration that were presented simultane-ously with a noise masker with a fixed duration that was al-ways longer than that of the tone. Increasing the duration ofthe tone reduced the detection thresholds up to a duration ofabout 300 milliseconds. While this result is an indication ofintegration across time, it also shows that there is a limitationin the interval for which temporal integration occurs.

The above findings with respect to spectral and tem-poral integration of information in auditory masking haveimplications for audio coding which have not been consid-ered in previous studies. On the one hand it influences themasking properties of complex signals as will be discussedin Section 5, on the other hand it has implications for ratedistortion optimisation algorithms. To understand this, con-sider the case where for one particular frequency region athreshold level is determined for distortions that can be in-troduced by an audio coder. For another frequency regiona threshold can be determined similarly. When both distor-tions are presented at the same time, the total distortion isexpected to become audible due to the spectral integrationgiven by (1). This is in contrast to the more conventionalmodels, such as the ISO MPEG model [1], which would pre-dict this simultaneous distortion to be inaudible.

The effect of spectral integration, of course, can easily becompensated for by reducing the level of the masking thresh-olds such that the total distortion will be inaudible. But,based on (1), assuming that it holds for masking by com-plex audio signals, there are many different solutions to thisequation which lead to the same d′total. In other words, many

x

hom γi

+

Ca

Withinchannel

distortiondetectability

Di

Cs

+ x D

Figure 1: Block diagram of the masking model.

different distributions of distortion levels per spectral regionwill lead to the same total sensitivity index. However, notevery distribution of distortion levels will lead to the sameamount of bits spent by the audio coder. Thus, the concept ofa masking curve which determines the maximum level of dis-tortion allowed within each frequency region is too restric-tive and can be expected to lead to suboptimal audio coders.In fact, spectral distortion can be shaped such that the associ-ated bit rate is minimised. For more information the readeris referred to a study where these ideas were confirmed bylistening tests [28].

3. DESCRIPTION OF THE MODEL

In line with various state-of-the-art auditory models thathave been presented in the psychoacoustical literature, forexample, [29], the structure of the proposed model followsthe various stages of auditory signal processing. In view of thecomputational complexity, the model is based on frequencydomain processing and consequently neglects some parts ofperipheral processing, such as the hair cell transformationwhich performs inherent nonlinear time-domain processing.

A block diagram of the model is given in Figure 1. Themodel input x is the frequency domain representation of ashort windowed segment of audio. The window should leadto sufficient rejection of spectral side lobes in order to fa-cilitate adequate spectral resolution of the auditory filters.The first stage of the model resembles the outer- and middle-

ear transfer function hom, which is related to the filtering ofthe ear canal and the ossicles in the middle ear. The transferfunction is chosen to be the inverse of the threshold-in-quiet

function htq. This particular shape is chosen to obtain an ac-curate prediction of the threshold-in-quiet function when nomasker signal is present.

The outer- and middle-ear transfer function is followedby a gammatone filter bank (see, e.g., [30]) which resemblesthe filtering property of the basilar membrane in the innerear. The transfer function of an nth-order gammatone filterhas a magnitude spectrum that is approximated well by

γ( f ) =(

1 +(

f − f0k ERB

(f0))2

)−n/2, (2)

where f0 is the centre frequency of the filter, ERB( f0) is theequivalent rectangular bandwidth of the auditory filter cen-tred at f0 as suggested by Glasberg and Moore [31], n is


the filter order which is commonly assumed to be 4, andk = 2(n−1)(n− 1)!/π(2n− 3)!!, a factor needed to ensure thatthe filter indeed has the specified ERB. The centre frequenciesof the filters are uniformly spaced on an ERB-rate scale andfollow the bandwidths as specified by the ERB scale [31]. Thepower at the output of each auditory filter is measured anda constant Ca is added to this output as a means to limit thedetectability of very weak signals at or below the threshold inquiet.

In the next stage, within-channel distortion detectabil-ities are computed and are defined as the ratios betweenthe distortion and the masker-plus-internal noise seen at theoutput of each auditory filter. In fact, the within-channel dis-tortion detectability Di is proportional to the sensitivity in-dex d′ as described earlier. This is an important step; the dis-tortion detectability (or d′) will be used as a measure of per-ceptual distortion. This perceptual distortion measure can beinterpreted as a measure of the probability that subjects candetect a distortion signal in the presence of a masking sig-nal. The masker power within the ith filter due to an original(masking) signal x is given by

Mi = 1N

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣x( f )∣∣2

, (3)

where N is the segment size in number of samples. Equiva-lently, the distortion power within the ith filter due to a dis-tortion signal ε is given by

Si = 1N

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2. (4)

Note that (1/N)|x( f )|2 denotes the power spectral densityof the original, masking signal in sound pressure level (SPL)per frequency bin, and similarly (1/N)|ε( f )|2 is the powerspectral density of the distorting signal. The within-channeldistortion detectability Di is given by

Di = SiMi + (1/N)Ca

. (5)

From this equation two properties of the within-channel dis-tortion detectability Di can be seen. When the distortion-to-masker ratio Si/Mi is kept constant while the masker poweris much larger than (1/N)Ca, distortion detectability is alsoconstant. In other words, at medium and high masker levelsthe detectability Di is mainly determined by the distortion-to-masker ratio. Secondly, when the masker power is smallcompared to (1/N)Ca, the distortion detectability is indepen-dent of the masker power, which resembles the perception ofsignals near the threshold in quiet.

In line with the multiband energy detector model [23],we assume that within-channel distortion detectabilities Di

are combined into a total distortion detectability by an addi-tive operation. However, we do not add the squared sensitiv-ity indices as in (1), but we simply add the indices directly. Al-though this may introduce inaccuracies, these will later turn

out to be small. A benefit of this choice is that the distortionmeasure that will be derived from this assumption will haveproperties that allow a computationally simple formulationof the model (see (11)). In addition, recent results [26] showthat at least for the detection of closely spaced tones (20 Hzspacing) masked by noise, the reduction in thresholds whenincreasing the signal bandwidth is more in line with a directaddition of distortion detectabilities than with (1). There-fore, we state that

D(x, ε) = CsLeff

∑i

Di (6)

= CsLeff

∑i

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2

NMi + Ca, (7)

where D(x, ε) is the total distortion detectability as it is pre-dicted for a human observer given an original signal x and adistortion signal ε. The calibration constant Cs is chosen suchthat D = 1 at the threshold of detectability. To account forthe dependency of distortion detectability on the duration ofthe distortion signal (in line with [27]), a scaling factor Leff isintroduced defined as

Leff = min(

L

300 ms, 1)

, (8)

where L is the segment duration in milliseconds. Equa-tion (8) resembles the temporal integration time of thehuman auditory system which has an upper bound of300 milliseconds [27].1

Equation (7) gives a complete description of the model.However, it defines only a perceptual distortion measure andnot a masking curve such as is widely used in audio codingnor a masked threshold such as is often used in psychoacous-tical experiments.

In order to derive a masked threshold, we assume that thedistortion signal ε( f ) = Aε. Here, A is the amplitude of thedistortion signal and ε the normalised spectrum of the dis-tortion signal such that ‖ε‖2 = 1 which is assumed to corre-spond to a sound pressure level of 0 dB. Without yet makingan assumption about the spectral shape of ε, we can derivethat, assuming that D = 1 at the threshold of detectability,the masked threshold A2 for the distortion signal ε is givenby

1A2= CsLeff

∑i

∑f

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2∣∣ε( f )∣∣2

NMi + Ca. (9)

When deriving a masking curve it is important to con-sider exactly what type of signal is masked. When a mask-ing model is used in the context of a waveform coder, the

1An alternative definition would be to state that Leff = N , the total dura-tion of the segment in number of samples. According to this definition it isassumed that distortions are integrated over the complete excerpt at hand,which is not in line with perceptual masking data, but which in our experi-ence still leads to very satisfactory results [32].


distortion signal introduced by the coder is typically assumedto consist of bands of noise. For a sinusoidal coder, however,the distortion signal contains the sinusoids that are rejectedby the perceptual model. Thus, the components of the distor-tion signal are in fact more sinusoidal in nature. Assumingnow that a distortion component is present in only one binof the spectrum, we can derive the masked thresholds for si-nusoidal distortions. We assume that ε( f ) = v( fm)δ( f − fm)with v( fm) being the sinusoidal amplitude and fm the sinu-soidal frequency. Together with the assumption that D = 1 atthe threshold of detectability, v can be derived such that thedistortion is just not detectable. In this way, by varying fmover the entire frequency range, v2 constitutes the maskingcurve for sinusoidal distortions in the presence of a maskerx. By substituting the above assumptions in (7) we obtain

1v2(fm) = CsLeff

∑i

∣∣hom(fm)∣∣2∣∣γi( fm)∣∣2

NMi + Ca. (10)

Substituting (10) in (7), we get

D(x, ε) =∑f

∣∣ε( f )∣∣2

v2( f ). (11)

This expression shows that the computational load for calcu-lating the perceptual distortion D(x, ε) can be very low oncethe masking curve v2 has been calculated. This simple formof the perceptual distortion, such as given in (11), arises dueto the specific choice of the addition as defined in (6).

4. CALIBRATION OF THE MODEL

For the purpose of calibration of the model, the constantsCa for absolute thresholds and Cs for the general sensitivityof the model in (7) need to be determined. This will be doneusing two basic findings from the psychoacoustical literature,namely the threshold in quiet and the just noticeable differ-ence (JND) in level of about 0.5 - 1 dB for sinusoidal signals[33].

When considering the threshold in quiet, we assume thatthe masking signal is equal to zero, that is, x = 0 and thatthe just detectable sinusoidal distortion signal is given by

ε( f ) = htq( fm)δ( f − fm) for some fm, where htq is thethreshold-in-quiet curve. By substituting these assumptionsin (7) (assuming that D = 1 corresponds to a just detectabledistortion signal), we obtain

Ca = CsLeff

∑i

∣∣γi( fm)∣∣2. (12)

Note that (12) only holds if∑

i |γi( fm)|2 is constant for all fm,which is approximately true for gammatone filters.

We assume a 1 dB JND which corresponds to a maskingcondition where a sinusoidal distortion is just detectable inthe presence of a sinusoidal masker at the same frequency, sayfm. For this to be the case, the distortion level has to be 18 dBlower than the masker level, assuming that the masker anddistortion are added inphase. This specific phase assumption

is made because it leads to similar thresholds as when themasker and signal are slightly off-frequency with respect toone another, the case which is most likely to occur in au-dio coding contexts. We therefore assume that the maskersignal is x( f ) = A70δ( f − fm) and the distortion signalε( f ) = A52δ( f − fm), with A70 and A52 being the amplitudesfor a 70 and 52 dB SPL sinusoidal signal, respectively. Using(3) and (7), this leads to the expression

1Cs= Leff

∑i

∣∣hom(fm)∣∣2∣∣γi( fm)∣∣2

A252∣∣hom

(fm)∣∣2∣∣γi( fm)∣∣2

A270 + Ca

. (13)

When (12) is substituted into (13), an expression is ob-tained where Cs is the only unknown. A numerical solutionto this equation can be found using, for example, the bi-section method (cf. [34]). A suitable choice for fm wouldbe fm = 1 kHz, since it is in the middle of the auditoryrange. This calibration at 1 kHz does not significantly reducethe accuracy of the model at other frequencies. On the onehand the incorporation of a threshold-in-quiet curve pre-filter provides the proper frequency dependence of thresh-olds in quiet. On the other hand, JNDs do not differ muchacross frequency both in the model predictions and humans.

5. MODEL EVALUATION AND COMPARISONWITH PSYCHOACOUSTICAL DATA

To show the validity of the model, some basic psychoacous-tical data from listening experiments will be compared tomodel predictions. We will consider two cases, namely sinu-soids masked by noise and sinusoids masked by sinusoids.

Masking of sinusoids has been measured in several ex-periments for both (white) noise maskers [12, 35] and for si-nusoidal maskers [36]. Figure 2a shows masking curves pre-dicted by the model for a white noise masker with a spectrumlevel of 30 dB/Hz for a long duration signal (solid line) and a200 millisecond signal (dashed line) with corresponding lis-tening test data represented by circles [12] and asterisks [35],respectively. Figure 2b shows the predicted masking curve(solid line) for a 1 kHz 50 dB SPL sinusoidal masker alongwith corresponding measured masking data [36]. The modelpredictions are well in line with data for both sinusoidal andnoise maskers, despite the fact that no tonality detector wasincluded in the model such as is conventionally needed inmasking models for audio coding (e.g., [1]). Only at lowerfrequencies, there is a discrepancy between the data for thenoise masker and the predictions by the model. The reasonfor this discrepancy may be that in psychoacoustical studies,running noise generators are used to generate the masker sig-nal rather than a single noise realisation, as it is done in au-dio coding applications. The latter case has, according to sev-eral studies, a lower masking strength [37]. This difference inmasking strength is due to the inherent masker power fluc-tuations when a running noise is presented, which dependsinversely on the product of time and bandwidth seen at theoutput of an auditory filter. The narrower the auditory filter(i.e., the lower its centre frequency), the larger these fluctua-tions will be and the larger the difference is expected to be.


104103102

Frequency (Hz)

0

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

(a)

104103102

Frequency (Hz)

0

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

(b)

Figure 2: (a) Masking curves predicted by the model for a whitenoise masker with a spectrum level of 30 dB/Hz for a long dura-tion signal (solid line) and a 200- millisecond signal (dashed line)with corresponding listening test data represented by the circles[12] and asterisks [35], respectively. (b) Masking curves for a 1 kHz50 dB SPL sinusoidal masker. The dashed line is the threshold inquiet. Circles show data from [36].

As can be seen in Figure 2, the relatively weaker mask-ing power of a sinusoidal signal is predicted well by themodel without the need for explicit assumptions about thetonality of the masker such as those included in, for ex-ample, the ISO MPEG model [1]. Indeed, in the case ofa noise masker (Figure 2a), the masker power within thecritical band centred around 1 kHz (bandwidth 132 Hz) isapproximately 51.2 dB SPL, whereas the sinusoidal masker(Figure 2b) has a power of 50 dB SPL. Nevertheless, pre-dicted detection thresholds are considerably lower for thesinusoidal masker (35 dB SPL) than for the noise masker(45 dB SPL). The reason why the model is able to predictthese data well is that for the tonal masker, the distortion-to-masker ratio is constant over a wide range of auditoryfilters. Due to the addition of within-channel distortion de-tectabilities, the total distortion detectability will be relativelylarge. In contrast, for a noise masker, only the filter centredon the distortion component will contribute to the total dis-tortion detectability because the off-frequency filters have avery low distortion-to-masker ratio. Therefore, the widebandnoise masker will have stronger masking effect. Note thatfor narrowband noise signals, the predicted masking power,in line with the argumentation for a sinusoidal masker, willalso be weak. This, however, seems to be too conservative[38].

101100

Number of components

20

40

60

80

100

Mas

ked

thre

shol

d(d

BSP

L)

Figure 3: Masked thresholds predicted by the model (solid line)and psychoacoustical data (circles) [25]. Masked thresholds are ex-pressed in dB SPL per component.

A specific assumption in this model is the integration ofdistortion detectabilities over a wide range of auditory fil-ters. This should allow the model to predict correctly thethreshold difference between narrowband distortion signalsand more wideband distortion signals. For this purpose anexperiment is considered where a complex of tones had tobe detected in the presence of masking noise [25]. The tonecomplex consisted of equal-level sinusoidal components witha frequency spacing of 10 Hz centred around 400 Hz. Themasker was a 0–2 kHz noise signal with an overall level of80 dB SPL. The number of components in the complex wasvaried from one up to 41. The latter case corresponds to abandwidth of 400 Hz, which implies that the tone complexcovers more than one critical band. Equation (9) was used toderive masked thresholds. As can be seen in Figure 3, there isa good correspondence between the model predictions andthe data from [25]. Therefore, it seems that the choice of thelinear addition that was made in (6) did not lead to largediscrepancies between psychoacoustical data and model pre-dictions.

To conclude this section, a comparison is made betweenpredictions of the MPEG-1 Layer I [1] and the model pre-sented in this study which incorporates spectral integrationin masking. The MPEG model is one of a family of mod-els used in audio coding that are based on spectral-spreadingfunctions to model spectral masking. When the masking ofa narrowband distortion signal is considered, it is assumedthat the auditory filter that is spectrally centred on this dis-tortion signal determines whether the distortion is audi-ble or not. When the energy ratio between distortion sig-nal and masking signal as seen at the output of this audi-tory filter is smaller than a certain criterion value, the dis-tortion is inaudible. In this manner the maximum allowabledistortion signal level at each frequency can be determinedwhich constitutes the masking curve. An efficient implemen-tation for calculating this masking curve is a convolution be-tween the masker spectrum and a spreading function bothrepresented on a Bark scale. The Bark scale is a perceptu-ally motivated frequency scale similar to the ERB-rate scale[39].

The spectral integration model presented here does notconsider only a single auditory filter to contribute to thedetection of distortions, but potentially a whole range of


104103102

Frequency (Hz)

0

20

40

60

80

Leve

l(dB

)

(a)

104103102

Frequency (Hz)

0

20

40

60

80

Leve

l(dB

)

(b)

Figure 4: Masked thresholds predicted by the spectral integrationmodel (dashed line) and the ISO MPEG model (solid line). Themasking spectrum (dotted line) is for (a) a 1 kHz sinusoidal signaland (b) a short segment of a harpsichord signal.

filters. This can have a strong impact on the predicted mask-ing curves. Figure 4a shows the masking curves for a sinu-soidal masker at 1 kHz for the MPEG model (solid line) andthe spectral integration model (dashed line). The spectrumof the sinusoidal signal is also plotted (dotted line), but scaleddown for visual clarity. As can be seen, there is a reason-able match between both models, showing some differencesat the tails. In Figure 4b, in a similar way the masking curvesare shown but now resulting from a complex spectrum (partof a harpsichord signal). It can be seen that the maskingcurves differ systematically showing much smoother mask-ing curves for the spectral integration model as compared tothe MPEG model. For the spectral integration model mask-ing curves are considerably higher in spectral valleys. Thiseffect is a direct consequence of the spectral integration as-sumption that was adopted in our model (cf. (6)). In thespectral valleys of the masker, distortion signals can only bedetected using the auditory filter centred on the distortionwhich will lead to relatively high masked thresholds. Thisis so because off-frequency filters will be dominated by themasker spectrum. However, detection of distortion signals atthe spectral peaks of the masker is mediated by a range ofauditory filters centred around the peak, resulting in rela-tively low masked thresholds. In this case the off-frequencyfilters will reveal similar distortion-to-masker ratios as theon-frequency filter. Thus, in the model proposed here, de-tection differences between peaks and troughs are smaller,resulting in smoother masking curves as compared to thoseobserved in a spreading-based model such as the ISO MPEGmodel.

The smoothening effect is observed systematically incomplex signal spectra typically encountered in practical sit-uations and represents the main difference between the spec-tral integration model presented here and existing spreading-based models.

6. APPLICATION TO SINUSOIDAL MODELLING

Sinusoidal modelling has proven to be an efficient techniquefor the purpose of coding speech signals [40]. More recently,it has been shown that this method can also be exploited forlow-rate audio coding, for example, [41, 42, 43]. To accountfor the time-varying nature of the signal, the sinusoidal anal-ysis/synthesis is done on a segment-by-segment basis, witheach segment being modelled as a sum of sinusoids. The si-nusoidal parameters have been selected with a number ofmethods, including spectral peak-picking [44], analysis-by-synthesis [41, 43], and subspace-based methods [42].

In this section we describe an algorithm for selecting si-nusoidal components using the psychoacoustical model de-scribed in the previous section. The algorithm is based onthe matching pursuit algorithm [45], a particular analysis-by-synthesis method. Matching pursuit approximates a sig-nal by a finite expansion into elements (functions) chosenfrom a redundant dictionary. In the example of sinusoidalmodelling, one can think of such functions as (complex) ex-ponentials or as real sinusoidal functions. Matching pursuitis a greedy, iterative algorithm which searches the dictionaryfor the function that best matches the signal and subtractsthis function (properly scaled) to form a residual signal to beapproximated in the next iteration.

In order to determine which is the best matching func-tion or dictionary element at each iteration, we need to for-malise the problem. To do so, let D = (gξ)ξ∈Γ be a completedictionary, that is, a set of elements indexed by ξ ∈ Γ, whereΓ is an arbitrary index set. As an example, consider a dictio-nary consisting of complex exponentials gξ = ei2πξ(·). In thiscase, the index set Γ is given by Γ = [0, 1). Obviously, the in-dexing parameter ξ is nothing more than the frequency of thecomplex exponential. Given a dictionary D , the best match-ing function can be found by, for each and every function,computing the best approximation and selecting that func-tion whose corresponding approximation is “closest” to theoriginal signal.

In order to facilitate the following discussion, we assumewithout loss of generality that ‖gξ‖ = 1 for all ξ. Given aparticular function gξ , the best possible approximation of thesignal x is obtained by the orthogonal projection of x ontothe subspace spanned by gξ (see Figure 5). This projection isgiven by 〈x, gξ〉gξ . Hence, we can decompose x as

x = 〈x, gξ〉gξ + Rx, (14)

where Rx is the residual signal after subtracting the projec-tion 〈x, gξ〉gξ . The orthogonality of Rx and gξ implies that

‖x‖2 = ∣∣〈x, gξ〉∣∣2

+ ‖Rx‖2. (15)


xRx

gξ

〈x, gξ〉gξ span(gξ )

Figure 5: Orthogonal projection of x onto span(gξ).

We can do this decomposition for each and every dictionaryelement and the best matching one is then found by selectingthe element gξ′ for which ‖Rx‖ is minimal, or, equivalently,for which |〈x, gξ〉| is maximal. A precise mathematical for-mulation of this phrase is

ξ′ = arg supξ∈Γ

∣∣〈x, gξ〉∣∣. (16)

It must be noted that the matching pursuit algorithm isonly optimal for a particular iteration. If we subtract the ap-proximation to form a residual signal and approximate thisresidual in a similar way as we approximated the original sig-nal, then the two dictionary elements thus obtained are notjointly optimal; it is in general possible to find two differ-ent elements which together form a better approximation.This is a direct consequence of the greedy nature of the algo-rithm. The two dictionary elements which together are opti-mal could be obtained by projecting the signal x onto all pos-sible two-dimensional subspaces. This, however, is in generalvery computationally complex. An alternative solution to thisproblem is to apply, after each iteration, a Newton optimisa-tion step [46].

To account for human auditory perception, the unit-norm dictionary elements can be scaled [43], which is equiv-alent to scaling the inner products in (16). We will refer tothis method as the weighted matching pursuit (WMP) algo-rithm. While this method performs well, it can be shownthat it does not provide a consistent selection measure forelements of finite time support [47]. Rather than scaling thedictionary elements, we introduce a matching pursuit algo-rithm where psychoacoustical properties are accounted forby a norm on the signal space. We will refer to this methodas psychoacoustical matching pursuit (PAMP). As mentionedin Section 3 (see (11)), the perceptual distortion can be ex-pressed as

D =∑f

∣∣ε( f )∣∣2

v2( f )=∑f

a( f )∣∣ε( f )

∣∣2, (17)

where a = v−2. It follows from (10) that

a( f ) = CsLeff

∑i

∣∣hom( f )∣∣2∣∣γi( f )

∣∣2

NMi + Ca. (18)

1009080706050403020100

Number of sinusoids

0

20

40

60

Perc

eptu

aldi

stor

tion

Figure 6: Perceptual distortion associated with the residual signalafter sinusoidal modelling as a function of the number of sinusoidalcomponents that were extracted.

By inspection of (18), we conclude that a is real and positiveso that, in fact, the perceptual distortion measure (17) definesa norm

‖x‖2 =∑f

a( f )∣∣x( f )

∣∣2. (19)

This norm is induced by the inner product

〈x, y〉 =∑f

a( f )x( f ) y∗( f ), (20)

facilitating the use of the distortion measure in selecting theperceptually best matching dictionary element in a matchingpursuit algorithm. In Figure 6, the perceptual distortion as-sociated with the residual signal is shown as a function of thenumber of real-valued sinusoids that have been extracted fora short segment of a harpsichord excerpt (cf. (11)). As can beseen the perceptually most relevant components are selectedfirst, resulting in a fast reduction of the perceptual distor-tion for the first components. For a detailed description thereader is referred to [47, 48]. The fact that the distortion de-tectability defines a norm on the underlying signal space isimportant, since it allows for incorporating psychoacousticsin optimisation algorithms. Indeed, rather than minimisingthe commonly used l2-norm, we can minimise the percep-tually relevant norm given by (19). Examples include rate-distortion optimisation [32], linear predictive coding [49],and subspace-based modelling techniques [50].

7. COMPARISON WITH THE ISO MPEG MODELIN A LISTENING TEST

In this section we assess the performance of the proposedperceptual model in the context of sinusoidal parameterestimation. The PAMP method for estimating perceptuallyrelevant sinusoids relies on the weighting function a which,by definition, is the inverse of the masking curve. Equation(18) describes how to compute the masking curve forthe proposed perceptual model. We compare the use ofthe proposed perceptual model in PAMP to the situationwhere the masking curve is computed using the MPEG-1Layer I-II (ISO/IEC 11172-3) psychoacoustical model [1].There are several reasons for comparison with the MPEGpsychoacoustic model; the model provides a well-known


reference and because of its frequent application, it is still ade facto state-of-the-art model.

Using the MPEG-1 psychoacoustic model masking curvedirectly in the PAMP algorithm for sinusoidal extraction isnot reasonable because the MPEG-1 psychoacoustic modelwas developed to predict the masking curve in the case ofnoise maskees (distortion signals). It predicts for every fre-quency bin how much distortion can be added within thecritical band centred around the frequency bin. This pre-diction is, however, too conservative in the case that distor-tions are sinusoidal in nature since in this case the distor-tion energy is not spread over a complete critical band butis concentrated in one frequency bin only. Hence, we canadapt the MPEG-1 model by scaling the masking functionwith the critical bandwidth such that the model now predictsthe detection thresholds in the case of sinusoidal distortion.The net effect of this compensation procedure is an increaseof the masking curve at high frequencies by about 10 dB,thereby de-emphasizing high-frequency regions during si-nusoidal estimation. In fact, this masking power increaseat higher frequencies reduces the gap between the mask-ing curves between the ISO MPEG model and the proposedmodel (cf. Figure 4) By applying this modification to the ISOMPEG model, and by extending the FFT order to the sizeof the PAMP dictionary, it is suited to be used in the PAMPmethod. The dictionary elements in our implementationof the PAMP method were real-valued sinusoidal functionswindowed with a Hanning window, identical to the windowused in the analysis-synthesis procedure described below.

In the following, we present results obtained by listeningtests with audio signals. The signals are mono, sampled at44.1 kHz, where each sample is represented by 16 bits. Thetest excerpts are Carl Orff, Castanet, Celine Dion, Harpsi-chord Solo, contemporary pop music, and Suzanne Vega.

The excerpts were segmented into fixed-length frames of1024 samples (corresponding to 23.2 milliseconds) with anoverlap of 50% between consecutive frames using a Han-ning window. For each signal frame, a fixed number of per-ceptually relevant sinusoids per frame were extracted usingthe PAMP method described above, where the perceptualweighting functions a were generated from masking curvederived from the proposed perceptual model (see (18)) andthe modified MPEG model described above, respectively. Forthe MPEG model we made use of the recommendationsof MPEG Layer II, since these support input frame lengthsof 1024 samples. The masking curves were calculated fromthe Hanning-windowed original signal contained within thesame frame that is being modelled using the PAMP method.Finally, modelled frames were synthesized from the esti-mated sinusoidal parameters and concatenated to form mod-elled test excerpts, using a Hanning window-based overlap-add procedure.

To evaluate the performance of the proposed method, weused a subjective listening test procedure which is somewhatcomparable to the MUSHRA test (multistimulus test withhidden reference and anchors) [51]. For each test excerpt,listeners were asked to rank 6 different versions: 4 excerptsmodelled using the modified MPEG masking curve and fixed

Table 1: Scores used in subjective test.

Score Equivalent5 Best4 Good3 Medium2 Poor1 Poorest

model orders (i.e., the number of sinusoidal components persegment) of K = 20, 25, 30, and K = 35, and one excerptmodelled using the proposed perceptual model with K = 25.In addition, to have a low-quality reference signal, an excerptmodelled with K = 30, but using the unmodified MPEGmasking curve was included. As a reference, the listeners hadthe original excerpt available as well, which was identified tothe subjects. Unlike the MUSHRA test, no hidden referenceand no anchors were presented to the listeners.

The test excerpts were presented in a “parallel” way, us-ing the interactive benchmarking tool described in [52] asan interface to the listeners. For each excerpt, listeners wererequested to rank the different modelled signals on a scalefrom 1–5 (in steps of 0.1) as outlined in Table 1. The lis-teners were instructed to use the complete scale such thatthe poorest-quality excerpt was rated with 1 and the highest-quality excerpt with 5. The excerpts were presented throughhigh-quality headphones (Beyer-Dynamic DT990 PRO) in aquiet room, and the listeners could listen to each signal ver-sion as often as needed to determine the ranking. A total of12 listeners participated in the listening test, of which 6 lis-teners worked in the area of acoustic signal processing andhad previously participated in such tests. The authors did notparticipate in the test.

Figure 7 shows the overall scores of the listening test, av-eraged across all listeners and excerpts. The circles representthe median score, and the error bars depict 25 and 75 per-cent ranges of the total response distributions. As can beseen, the excerpts generated with the proposed perceptualmodel (SiCAS@25) show better average subjective perfor-mance than any of the excerpts based on the MPEG psychoa-coustic model, except for the MPEG case using a fixed modelorder of 35 (MPEG@35). As expected, the MPEG-based ex-cerpts have decreasing quality scores for decreasing modelorder. Furthermore, the low-quality anchor (MPEG@30nt,i.e., the MPEG model without spectral tilt modification) re-ceived the lowest-quality score on average. The statisticaldifference between the quality scores was analysed using apaired t-test using a significance level of p < 0.01, and byworking on the score differences between the proposed per-ceptual model and each of the MPEG-based methods. TheH0 hypothesis was that the mean of such difference distribu-tion was zero (µ∆ = 0), while the alternative hypothesis H1

was that µ∆ > 0. The statistical analysis supports the qual-ity ordering suggested by Figure 7. In particular, there is astatistically significant improvement in using the proposedperceptual model (SiCAS@25) over any of the MPEG-basedmethods except for MPEG@35 which performs better thanSiCAS@25 (p < 7.0 · 10−3). In fact, the model presented here


76543210

SiC

AS@

25

MP

EG

@35

MP

EG

@30

MP

EG

@25

MP

EG

@20

MP

EG

@30

nt

0

1

2

3

4

5

6

Poorest

Poor

Medium

Good

Best

Figure 7: Subjective test results averaged across all listeners and ex-cerpts.

leads to a reduction of more than 20% in terms of number ofsinusoids needed to represent signals at a given quality level.

As mentioned already in Section 5 the most relevant dif-ference between the proposed model and the ISO MPEGmodel is the incorporation of spectral integration propertiesin the proposed model. This leads to systematically smoothermasking curves such as predicted by our model for complexmasker spectra (cf. Figure 4). The effect of this is that fewersinusoidal components are used for modelling spectral val-leys of a signal with the proposed perceptual model as com-pared to the ISO MPEG model. We think that this differenceaccounts for the improvement in modelling efficiency thatwe observed in the listening tests and we expect that simi-lar improvements would have been observed when our ap-proach was compared to other perceptual models that arebased on the spectral-spreading approach such as those usedin the ISO MPEG model.

8. CONCLUSIONS

In this paper we presented a psychoacoustical model that issuited for predicting masked thresholds for sinusoidal dis-tortions. The model relies on signal detection theory and in-corporates more recent insights about spectral and temporalintegration in auditory masking. We showed that, as a con-sequence, the model is able to predict distortion detectabili-ties. In fact, the distortion detectability defines a (perceptu-ally relevant) norm on the underlying signal space which isbeneficial for optimisation algorithms such as rate-distortionoptimisation or linear predictive coding. The model provesto be very suitable for application in the context of sinu-soidal modelling, although it is also applicable in other au-dio coding contexts such as transform coding. A compara-tive listening test using a sinusoidal analysis method calledpsychoacoustical matching pursuit showed a clear preferencefor the model presented here over the ISO MPEG model [1].

More specifically, the model presented here leads to a re-duction of more than 20% in terms of number of sinusoidsneeded to represent signals at a given quality level.

ACKNOWLEDGMENTS

The authors would like to thank Nicolle H. van Schijndel,Gerard Hotho, and Jeroen Breebaart and the reviewers fortheir helpful comments on this manuscript. Furthermore,the authors thank the participants in the listening test. Theresearch was supported by Philips Research, the Technol-ogy Foundation STW, Applied Science Division of NWO, theTechnology Programme of the Dutch Ministry of EconomicAffairs, and the EU project ARDOR, IST-2001-34095.

REFERENCES

[1] IISO/MPEG Committee, Coding of moving pictures and asso-ciated audio for digital storage media at up to about 1.5 Mbit/s- part 3: Audio, 1993, ISO/IEC 11172-3.

[2] T. Yoshida, “The rewritable minidisc system,” Proc. IEEE,vol. 82, no. 10, pp. 1492–1500, 1994.

[3] A. Hoogendoorn, “Digital compact cassette,” Proc. IEEE,vol. 82, no. 10, pp. 1479–1589, 1994.

[4] T. Painter and A. Spanias, “Perceptual coding of digital audio,”Proc. IEEE, vol. 88, no. 4, pp. 451–515, 2000.

[5] K. N. Hamdy, M. Ali, and A. H. Tewfik, “Low bit rate highquality audio coding with combined harmonic and waveletrepresentation,” in Proc. IEEE Int. Conf. Acoustics, Speech, Sig-nal Processing (ICASSP ’96), vol. 2, pp. 1045–1048, Atlanta,Ga, USA, May 1996.

[6] S. N. Levine, Audio representations for data compression andcompressed domain processing, Ph.D. thesis, Stanford Univer-sity, Stanford, Calif, USA, 1998.

[7] H. Purnhagen and N. Meine, “HILN—the MPEG-4 paramet-ric audio coding tools,” in Proc. IEEE Int. Symp. Circuits andSystems (ISCAS ’00), vol. 2000, pp. 201–204, Geneva, Switzer-land, May 2000.

[8] W. Oomen, E. Schuijers, B. den Brinker, and J. Breebaart, “Ad-vances in parametric coding for high-quality audio,” in Proc.114th AES Convention, Amsterdam, The Netherlands, March2003, preprint 5852.

[9] F. P. Myburg, Design of a scalable parametric audio coder, Ph.D.thesis, Technische Universiteit Eindhoven, Eindhoven, TheNetherlands, 2004.

[10] H. S. Malvar, Signal Processing with Lapped Transforms, ArtechHouse, Boston, Mass, USA, 1992.

[11] P. P. Vaidyanathan, Multirate Systems and Filter Banks, Pren-tice Hall Signal Processing Series, Prentice Hall, EnglewoodCliffs, NJ, USA, 1993.

[12] J. E. Hawkins and S. S. Stevens, “The masking of pure tonesand of speech by white noise,” Journal of the Acoustical Societyof America, vol. 22, pp. 6–13, 1950.

[13] T. S. Verma, A perceptually based audio signal model with ap-plication to scalable audio coding, Ph.D. thesis, Stanford Uni-versity, Stanford, Claif, USA, 1999.

[14] R. L. Wegel and C. E. Lane, “The auditory masking of onepure tone by another and its probable relation to the dynamicsof the inner ear,” Phys. Rev., vol. 23, pp. 266–285, 1924.

[15] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics,vol. 12, no. 1, pp. 47–65, 1940.

[16] P. M. Sellick, R. Patuzzi, and B. M. Johnstone, “Measurementsof BM motion in the guinea pig using Mossbauer technique,”Journal of the Acoustical Society of America, vol. 72, pp. 131–141, 1982.


[17] J. P. Egan and H. W. Hake, “On the masking pattern of asimple auditory stimulus,” Journal of the Acoustical Society ofAmerica, vol. 22, pp. 622–630, 1950.

[18] K. G. Yates, I. M. Winter, and D. Robertson, “Basilar mem-brane nonlinearity determines auditory nerve rate-intensityfunctions and cochlear dynamic range,” Hearing Research,vol. 45, no. 3, pp. 203–220, 1990.

[19] R. D. Patterson, “Auditory filtershapes derived with noisestimuli,” Journal of the Acoustical Society of America, vol. 59,pp. 1940–1947, 1976.

[20] M. van der Heijden and A. Kohlrausch, “The role of envelopefluctuations in spectral masking,” Journal of the Acoustical So-ciety of America, vol. 97, no. 3, pp. 1800–1807, 1995.

[21] M. van der Heijden and A. Kohlrausch, “The role of distortionproducts in masking by single bands of noise,” Journal of theAcoustical Society of America, vol. 98, no. 6, pp. 3125–3134,1995.

[22] J. P. Egan, W. A. Lindner, and D. McFadden, “Masking-leveldifferences and the form of the psychometric function,” Per-ception and Psychophysics, vol. 6, pp. 209–215, 1969.

[23] D. M. Green and J. A. Swets, Signal Detection Theory and Psy-chophysics, Krieger, New York, NY, USA, 1974.

[24] S. Buus, E. Schorer, M. Florentine, and E. Zwicker, “Decisionrules in detection of simple and complex tones,” Journal of theAcoustical Society of America, vol. 80, no. 6, pp. 1646–1657,1986.

[25] A. Langhans and A. Kohlrausch, “Spectral integration ofbroadband signals in diotic and dichotic masking experi-ments,” Journal of the Acoustical Society of America, vol. 91,no. 1, pp. 317–326, 1992.

[26] S. van de Par, A. Kohlrausch, J. Breebaart, and M. McKinney,“Discrimination of different temporal envelope structures ofdiotoic and dichotic targets signals within diotic wide-bandnoise,” in Proc. 13th International Symposium on Hearing, pp.334–340, Dourdan, France, August 2003.

[27] G. van den Brink, “Detection of tone pulse of various dura-tions in noise of various bandwidths,” Journal of the AcousticalSociety of America, vol. 36, pp. 1206–1211, 1964.

[28] S. van de Par and A. Kohlrausch, “Application of a spec-trally integrating auditory filterbank model to audio coding,”in Fortschritte der Akustik, Plenarvortrage der 28. DeutschenJahrestagung fur Akustik, DAGA-02, pp. 484–485, Bochum,Germany, 2002.

[29] T. Dau, D. Puschel, and A. Kohlrausch, “A quantitative modelof the ‘effective’ signal processing in the auditory system. I.Model structure,” Journal of the Acoustical Society of America,vol. 99, no. 6, pp. 3615–3622, 1996.

[30] R. D. Patterson, “The sound of a sinusoid; spectral models,”Journal of the Acoustical Society of America, vol. 96, no. 3,pp. 1409–1418, 1994.

[31] B. R. Glasberg and B. C. J. Moore, “Derivation of audi-tory filter shapes from notched-noise data,” Hearing Research,vol. 47, no. 1-2, pp. 103–138, 1990.

[32] R. Heusdens, J. Jensen, W. B. Kleijn, V. Kot, O. Niamut, S. vande Par, N. H. van Schijndel, and R. Vafin, “Sinusoidal codingof audio and speech,” in preparation for Journal of the AudioEngineering Society, 2005.

[33] B. C. J. Moore, An Introduction to the Psychology of Hearing,Academic Press, London, UK, 3rd edition, 1989.

[34] G. Charestan, R. Heusdens, and S. van de Par, “A gamma-tone based psychoacoustical modeling approach for speechand audio coding,” in Proc. ProRISC/IEEE: Workshop on Cir-cuits, Systems and Signal Processing, pp. 321–326, Veldhoven,The Netherlands, November 2001.

[35] A. J. M. Houtsma, “Hawkins and Stevens revisited at low

frequencies,” Journal of the Acoustical Society of America,vol. 103, no. 5, pp. 2848–2848, 1998.

[36] E. Zwicker and A. Jaroszewski, “Inverse frequency dependenceof simultaneous tone-on-tone masking patterns at low levels,”Journal of the Acoustical Society of America, vol. 71, pp. 1508–1512, 1982.

[37] A. Langhans and A. Kohlrausch, “Differences in auditory per-formance between monaural and diotic conditions. I. Maskedthresholds in frozen noise,” Journal of the Acoustical Society ofAmerica, vol. 91, pp. 3456–3470, 1992.

[38] S. van de Par and A. Kohlrausch, “Dependence of binauralmasking level differences on center frequency, masker band-width and interaural parameters,” Journal of the Acoustical So-ciety of America, vol. 106, pp. 1940–1947, 1999.

[39] E. Zwicker and H. Fastl, Psychoacoustics—Facts and Models,Springer, Berlin, Germany, 2nd edition, 1999.

[40] R. J. McAulay and T. F. Quatieri, “Sinusoidal coding,” inSpeech Coding and Syntesis, W. B. Kleijn and K. K. Paliwal,Eds., chapter 4, pp. 121–173, Elsevier Science B. V., Amster-dam, The Netherlands, 1995.

[41] M. Goodwin, “Matching pursuit with damped sinusoids,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’97), vol. 3, pp. 2037–2040, Munich, Germany, April1997.

[42] J. Nieuwenhuijse, R. Heusdens, and E. F. Deprettere, “Ro-bust exponential modeling of audio signals,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’98), vol. 6,pp. 3581–3584, Seattle, Wash, USA, May 1998.

[43] T. S. Verma and T. H. Y. Meng, “Sinusoidal modeling us-ing frame-based perceptually weighted matching pursuits,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’99), vol. 2, pp. 981–984, Phoenix, Ariz, USA, May1999.

[44] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesisbased on a sinusoidal representation,” IEEE Trans. Acoust.,Speech, Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[45] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Processing, vol. 41,no. 12, pp. 3397–3415, 1993.

[46] K. Vos and R. Heusdens, “Rate-distortion optimal exponen-tial modeling of audio and speech signals,” in Proc. 21st Sym-posium on Information Theory in the Benelux, pp. 77–84,Wassenaar, The Netherlands, May 2000.

[47] R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal model-ing using psychoacoustic-adaptive matching pursuits,” IEEESignal Processing Lett., vol. 9, no. 8, pp. 262–265, 2000.

[48] R. Heusdens and S. van de Par, “Rate-distortion optimal si-nusoidal modeling of audio and speech using psychoacous-tical matching pursuits,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’02), vol. 2, pp. 1809–1812,Orlando, Fla, USA, May 2002.

[49] R. C. Hendriks, R. Heusdens, and J. Jensen, “Perceptual lin-ear predictive noise modelling for sinusoid-plus-noise audiocoding,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’04), vol. 4, pp. 189–192, Montreal, Quebec,Canada, May 2004.

[50] J. Jensen, R. Heusdens, and S. H. Jensen, “A perceptual sub-space approach for modeling of speech and audio,” IEEETrans. Speech Audio Processing, vol. 12, no. 2, pp. 121–132,2004.

[51] ITU, ITU-R BS 1534. Method for subjective assessment of inter-mediate quality level of coding systems, 2001.

[52] O. A. Niamut, Audio codec Benchmark manual, Department ofMediamatics, Delft University of Technology, January 2003.


Steven van de Par studied physics atthe Eindhoven University of Technology(TU/e), and received his Ph.D. degree in1998 from the Institute for Perception Re-search on a topic related to binaural hear-ing. As a Postdoctoral Researcher at thesame institute, he studied auditory-visualinteraction and he was a Guest Researcher atthe University of Connecticut Health Cen-tre. In the beginning of 2000 he joinedPhilips Research, Eindhoven. Main fields of expertise are audi-tory and multisensory perception and low-bit-rate audio coding.He published various papers on binaural detection, auditory-visualsynchrony perception, and audio-coding-related topics. He partic-ipated in several projects on low-bit-rate audio coding based on si-nusoidal techniques and is presently participating in the EU Adap-tive Rate-Distortion Optimized Audio codeR (ARDOR) project.

Armin Kohlrausch studied physics at theUniversity of Gottingen, Germany, and spe-cialized in acoustics. He received his M.S.degree in 1980 and his Ph.D. degree in 1984,both in perceptual aspects of sound. From1985 until 1990 he worked at the ThirdPhysical Institute, University of Gottingen,and was responsible for research and teach-ing in the fields psychoacoustics and roomacoustics. In 1991 he joined the Philips Re-search Laboratories, Eindhoven, and worked in the Speech andHearing Group, Institute for Perception Research (IPO). Since1998, he has combined his work at Philips Research Laboratorieswith a Professor position for multisensory perception at the TU/e.In 2004 he was appointed a Research Fellow of Philips Research.He is a member of a great number of scientific societies, both inEurope and the USA. Since 1998 he has been a Fellow of the Acous-tical Society of America and serves currently as an Associate Editorfor the Journal of the Acoustical Society of America, covering theareas of binaural and spatial hearing. His main scientific interestsare in the experimental study and modelling of auditory and mul-tisensory perception in humans and the transfer of this knowledgeto industrial media applications.

Richard Heusdens is an Associate Professorin the Department of Mediamatics, DelftUniversity of Technology. He received hisM.S. and Ph.D. degrees from the Delft Uni-versity of Technology, the Netherlands, in1992 and 1997, respectively. In the spring of1992 he joined the Digital Signal ProcessingGroup, Philips Research Laboratories, Eind-hoven, the Netherlands. He has worked onvarious topics in the field of signal process-ing, such as image/video compression and VLSI architectures forimage-processing algorithms. In 1997, he joined the Circuits andSystems Group, Delft University of Technology, where he was aPostdoctoral Researcher. In 2000, he moved to the Information andCommunication Theory (ICT) Group where he became an Assis-tant Professor, responsible for the audio and speech processing ac-tivities within the ICT Group. Since 2002, he has been an AssociateProfessor. Research projects he is involved in cover subjects such asaudio and speech coding, speech enhancement, and digital water-marking of audio.

Jesper Jensen received the M.S. and Ph.D.degrees from Aalborg University, Aalborg,Denmark, in 1996 and 2000, respectively,both in electrical engineering. From 1996to 2001, he was with the Center for Per-sonKommunikation (CPK), Aalborg Uni-versity, as a Researcher, Ph.D. student, andAssistant Research Professor. In 1999, hewas a Visiting Researcher at the Center forSpoken Language Research, University ofColorado at Boulder. Currently, he is a Postdoctoral Researcher atDelft University of Technology, Delft, the Netherlands. His mainresearch interests are in digital speech and audio signal processing,including coding, synthesis, and enhancement.

Søren Holdt Jensen received the M.S. de-gree in electrical engineering from Aal-borg University, Denmark, in 1988, andthe Ph.D. degree from the Technical Uni-versity of Denmark, in 1995. He has beenwith the Telecommunications Laboratory ofTelecom Denmark, the Electronics Instituteof the Technical University of Denmark, theScientific Computing Group of the DanishComputing Center for Research and Educa-tion (UNI-C), the Electrical Engineering Department of KatholiekeUniversiteit Leuven, Belgium, the Center for PersonKommunika-tion (CPK) of Aalborg University, and is currently an Associate Pro-fessor in the Department of Communication Technology, AalborgUniversity. His research activities are in digital signal processing,communication signal processing, and speech and audio process-ing. He is a Member of the Editorial Board of EURASIP Journalon Applied Signal Processing, and a former Chairman of the IEEEDenmark Section and the IEEE Denmark Section’s Signal Process-ing Chapter.

a perceptual model for sinusoidal audio coding based on ... · the model relies on signal detection...

Documents