perceptual distortion maps for room · pdf fileperceptual distortion maps for room...

Perceptual Distortion Maps for Room Reverberation

Thomas Zarouchas1, John Mourjopoulos1,

1 Audio and Acoustic Technology Group, Wire Communications Laboratory, Electrical Engineering and Computer Engineering Department, University of Patras, 26500, Greece

[email protected], [email protected]

ABSTRACT

From reverberated audio signals and using as reference the input (anechoic) audio, a number of distortion maps are extracted indicating how room reverberation distorts in time-frequency scales, perceived features in the received signal. These maps are simplified to describe the monaural time-frequency / level distortions and the distortion of the spatial cues (i.e., inter-channel cues and coherence) which are very important for sound localization in a reverberant environment. Such maps here are studied as functions of room parameters (size, acoustics, distance, etc), as well as due to input signal properties. Overall perceptual distortion ratings are produced and reverberation-resilient signal features are extracted.

1. INTRODUCTION

Room acoustics introduce reverberation to audio signals which is usually formally described by the linear system response functions (e.g. convolutional input/output relationships using appropriate Room Impulse responses). Such approach helps to describe up to a certain degree features of reverberation, important from a signal processing perspective [1, 2, 3]. However, the perception of reverberation is a very complex phenomenon, resulting from time-frequency, delay, level, directional and signal-dependent cues [4, 5, 6, 7]. Currently, there is a significant gap between the objective and the subjective approach for analyzing such phenomena. This work, extends earlier published results in signal processing-based methodology to deal with room reverberation [8, 9] and also recent attempts to introduce perceptually motivated models for similar applications. Specifically, in this work, a Computational Auditory Masking Model (CAMM) [10,11] complemented by a novel Inter-channel Cue Mapping Module (ICMM) is used for the perceptual description of reverberation distortions in audio signals and the degradations of the stereo image in a typical audio reproduction setup.

The proposed method requires as inputs the anechoic audio signal and the corresponding reverberant signal. According to this approach, it is possible to locate, from the evaluated “internal representations” of the

corresponding auditory model, time/frequency regions with significant degradation due to reverberation. Furthermore, the output of the inter-channel cue process module indicates the modification of the relevant spatial cues due to reverberation. In both cases, the outputs of the CAMM and ICMM are presented in a form of time-frequency (2-D) maps [12].

The paper is organized as follow:

In section 2 the analysis scheme for the extraction of the distortion maps of room reverberation is presented. In section 3 the utilization of the Computational Auditory Masking Model to derive the distortion maps due to reverberation is presented. In section 4 the Inter-channel Cue Mapping Model is presented. Simulation results are given in section 5. Finally, some conclusions are drawn in section 6.

2. DISTORTION MAPS FOR REVERBERATION

The proposed structure for the evaluation of the distortion maps of room reverberation is shown in Fig. 1 for stereo reproduction. The concept can be extended to multi-channel audio signal reproduction.

Zarouchas, Mourjopoulos REVERBERATION DISTORTION MAPS

of 8

Figure 1: Analysis scheme for the assessment of the perceptual distortion cues

The scheme shown in Fig. 1, employs a monaural Masking Model for estimation of perceived distortion due to reverberation, complemented by a Inter-channel Cue Mapping Module (ICMM) for the evaluation of the alterations in the relevant spatial cues.

Inputs to the proposed method, are the source audio signals and estimates or measurements of the reverberant audio signals generated in any real room. For the analysis in the monaural (CAMM) and spatial cues (ICMM) the signals are transformed into the time-frequency domain. For the purpose of this processing, a novel filterbank is utilized, with near-perfect reconstruction properties, enabling flexible signal modification. This filterbank provides non-uniform analysis bands with sufficient frequency resolution in order to capture the perceptually relevant cues at low frequencies, following closely the critical band / ERB scale. The sub-band domain signals ks n are obtained

as:

M 1

k k Km 0

s n s n m h m cos

, ,( ) (1 )

where s n is the input signal, M is the length of a

prototype filter h n and can be considered as a

function of the sub-band index k, the number of sub-bands K and a phase parameter φ [15, 16].

3. DISTORTION MAPS BASED ON MONAURAL MASKING MODEL

The Computational Auditory Masking Model used here can successfully emulate many aspects of the monaural signal processing of the auditory system. Input to the auditory model are a single channel of the original audio signal “anechoic” and the corresponding reverberant

audio signal generated in any real room, as can be measured via an omni directional microphone and was described in detail in [14]. Alternatively, a simulated reverberant signal may be obtained via convolution with a measured room impulse response. The CAMM derives the monaural “internal representations” of the audio signal in a number of frequency bands, which are inputted to a Decision-Threshold Device (DTD). The output of the DTD represents time-frequency maps with significant reverberation distortions. The concept of the DTD is based on the Just Noticeable Intensity Difference of the internal signal representations.

The detailed structure for the evaluation of the distortions due to reverberation is shown in Figure 2. Input to the filter bank is the original (source) single channel audio signal x n and the corresponding

reverberant single channel audio signal x n . The sub-

band signals see Eq. (1) kx n and kx n are then

inputted to the CAMM, producing the monaural “internal representations” kz n and kz n

respectively.

The Decision-Threshold Device (DTD) accompanied with a set of thresholds kT n is utilized to extract the

difference k n , according to

k k kn z n z n (2 )

and therefore to derive the parameter

k k kD n n T n (3 )

The kD n parameter indicates the degree of the

perceived distortions due to reverberation above the specified threshold, when kD n 0 , in the time-

frequency domain for single channel audio signals and generally

k0 D n 1 (4 )

It is clear that the CAMM can easily be extended to evaluate separately the parameters kD n for the 2

channels of a stereo signal. However, in such a case, any binaural masking mechanisms are not considered.


of 8

Figure 2: Analysis scheme for processing of reference and reverberant audio signals in order to derive the

time-frequency map of parameter kD n

4. DISTORTION MAPS BASED ON INTER-CHANNEL CUE MAPPING

Input to the Inter-channel Cue Mapping Module, are stereo channels for both the source audio signal and the corresponding reverberant signal. The relevant spatial cues [4,13] examined here are the inter-channel level difference (ILD), inter-channel time difference (ITD) and inter-channel coherence (IC). These are derived for all signals and channels independently, in each frequency band and as function of time (see Figure 3).

Figure 3: Analysis scheme of the Inter-channel Cue Mapping Module

Short-time estimates of the power for each channel (i.e. left –right for a typical stereo setup) and for each sub-band are considered for a window size of N samples, according to:

R

L

N 2Rkx

m 1

N 2Lkx

m 1

p k n x n m

p k n x n m

,

,

(5 )

and cross-power estimate between the two channels (left-right) is also performed, according to

R L

NR Lk kx x

m 1

p k n x n m x n m,

(6 )

As it is known [4, 13], Inter Channel Level Difference (ICLD, [dB]), denotes the level/intensity differences between two (left –right) channels:

R

L

x10

x

p k nICLD k n 10log

p k n

,,

, (7 )

with a typical level range of

kICLD n (8 )

Inter Channel Time Difference (ICTD, [samples]), describes the time difference between two channels, and is the time instance at which the maximum value of a short-time estimate of the normalized cross-correlation function has occurred, i.e.:

R L

R L

R L

x x

x x

x x

p k nk n

p k n p k n

,,

, , (9 )

ICTD has a typical time range (samples) of

n ICTD n (10 )

Inter Channel Coherence (ICC), defines the coherence between two channels ( Rx and Lx ), and can be expressed as:

R L

00x xn n

ICC k n max k n

, , (11 )

considering the maximum value of the instantaneous normalized cross-correlation. ICC has a range of:

0 ICC k n 1, (12 )

where 1 indicates that Rx and Lx are perfectly coherent.


of 8

Based on the definitions of equations (7-12) the evaluation of the spatial cues for both the source and the reverberant signal (denoted by ~), is performed as:

1 k n 1 2 k n 2

1 k n 1 2 k n 2

k n k n

ICLD and ICLD

n ICTD n and n ICTD n

0 ICC 1 and 0 ICC 1

, ,

, ,

, ,

(13 )

Here, from these maps which correspond to source and received signals, distortion maps will be evaluated (in the time-frequency domain), defined by the differences of these maps. Therefore the differential metric (distortion map) for each cue is introduced according to:

k n k n k n

tk n k n k n

ck n k n k n

ICLD ICLD

ICTD ICTD

ICC ICC

, , ,

, , ,

, , ,

(14 )

Based on equations (8, 10, 12, 14) the typical level, time and coherence range for each differential metric will be:

1 2 k n 1 2

t1 2 k n 1 2

ck n

n n n n

1 1

,

,

,

(15 )

Hence, the output of the ICMM, according to equation (15) indicates the variation of the inter-channel cues in the time-frequency domain for both signals in a form of differential cue mapping. Typical differential maps for a number of test cases are shown in Figures 6 and 7. For the differential metrics ( c t

k n k n k n, , ,, , ) and for the

distortion parameter ( k nD , ) the mean values, in a frame

by frame basis, can be also evaluated for each test case. Additional, a logarithmic expression of the corresponded mean values (except for k n, ) can also

estimated, according to:

dB 10 k ni 20 log X i i=1,...,M, , (16 )

where k nX i, is the mean value of the corresponding

differential metric or the distortion parameter, in frame i, for a number of M frames and typical frame length of 1024 samples, leading to a simplified interpretation of the overall perceived signal-dependent distortion which can be assigned to each map.

5. TESTS AND RESULTS

Preliminary tests were conducted having as reference typical stereo 16 bit resolution signals at fs = 44100Hz. These tests, were using input (reference) audio signals of different categories, i.e. big band jazz (JAZZ), solo classical piano (PIANO) and male speech (SPEECH) and as second input, the corresponding signals recorded under various reverberation conditions in different real enclosures (see Table 1) ranging from a acoustically treated laboratory to a large sports hall.

From this set of distortion maps, the local variations and the overall distortion metrics for each specific test case are evaluated, for typical audio signals and reverberation conditions.

Room Dimensions LxWxH(m)

RT(sec) (freq avg.)

Type

R1 7.15x4.60x2.90 0.368 Laboratory

R2 10.20x7.05x2.65 1.1 Classroom

R3 60x42x13.8 6.4 Sports hall

Table 1: Properties of rooms used for tests

Table 2 indicates the variation of the monaural masking distortion parameter k nD , (dB) for different audio

signals, recorded in three different enclosures with varying acoustical properties. As it is shown, room R3 (large sports hall) indicates a higher degree of perceived distortion, for all types of signal.

Room Signal

R1 R2 R3 JAZZ -12.57 -9.26 -5.03

PIANO -7.15 -5.12 -3.93 SPEECH -3.72 -1.05 0.73

Table 2: Monaural masking distortion parameter k nD ,

(dB) for different real enclosures and different audio signals

The variation of the distortion parameter k nD , (dB) and

the corresponding distortion map based on CAMM, for a reverberant audio signal segment recorded in room R3, are shown in Figure 4. As can be observed in Fig. 4(c), the frequency averaged distortion metric


of 8

k nD , increases during the reverberant decay of the piano

note (shown without reverberation in Fig. 4(a)), indicating the increase of the perceived distortion due to reverberant tail. The corresponding 2-D perceptually motivated map of Fig. 4(d), gives a more detailed illustration of the corresponding time-frequency distortions, indicating in red, signal regions with higher degree of perceived distortion.

The effect of different room acoustics, on the variation of distortion parameter k nD , , is shown in Figure 5.

Note, that the dashed line in each case indicates the frequency-averaged mean value of the perceived distortion for the corresponding audio signal segment. It is clear that the mean metric increases with reverberation time and the perceived effects of reverberation are more pronounced for the larger rooms (Fig. 5(a) and 5(b)) than for the acoustically treated room (Fig. 5(c)). Furthermore, heavy reverberation seems to lower the frequency and depth of the modulations of the perceived distortion.

Table 3 shows the variation of the inter-channel differential metrics for the above enclosures, using the PIANO as input signal. As it is shown, the divergence in the inter-channel coherence differential c

k n, and the

inter-channel time differential tk n, metrics between

rooms R1 and room R3, is close to 3 dB.

Room Differential Metric R1 R2 R3

Coherence ck n, -32.15 -30.52 -29.30

Level

k n, 0.08 1.64 2.09

Time tk n, -26.88 -23.54 -23.99

Table 3: Differential metrics for different real enclosures and PIANO as a test signal

For the differential inter-channel metrics, similar overall trends as with the previously described monaural metric can be observed in Figures 6 and 7, corresponding to the acoustically treated room (Fig. 6) and the large sports hall (Fig. 7). For low reverberation, all differential inter-channel metrics display low dispersion (distortion maps have large time-frequency regions close to green, i.e. 0 dB for k n, metric). Furthermore, deviations are low

around this value. With heavy reverberation conditions,

higher dispersion can be observed for each differential inter-channel metric (Fig. 7).

6. CONCLUSIONS

The work illustrates the efficiency of the proposed Computationally Auditory Masking Model and the novel Inter-channel Cue Mapping Module to describe with appropriate 2-D maps, the perceived distortion and the general degradation of audio signals due to reverberation. As it was shown, signal-dependent perceived distortions, can be isolated into specific time-frequency regions.


of 8

Figure 4: (a) source audio signal segment (PIANO), (b) corresponding reverberant signal recorded in Room 3, (c) distortion parameter k nD , (dB), (d) Distortion Map

based on CAMM, for reverberant signal

These distortion maps illustrate different aspects of perceived degradations, from monaural masking due to reverberant decay, to inter-channel level, time and coherence variations in stereo signals.

Figure 5: Distortion parameter for PIANO audio segment, (a) room R3, (b) room R2, (c) room R1.

Dashed line indicates overall mean value

This detailed identification of the distortions can allow novel signal-processing methods to evolve, so that such distortions can be suppressed without or in conjunction with the more traditional inverse filter based methods [14].

It is also promising that both short-term and long-term trends of the proposed distortion metrics, seem to follow the trends in the established physical acoustical parameters of the recorded space. However, unlike existing acoustical measurements, the proposed distortion maps are dynamically-varying with the signal evolution and are dependent on the specific audio signal. Hence, such maps may help to reconsider the problem of reverberation from a signal-processing perspective that is closer to perception and the specific signal reproduced inside such an enclosure.


of 8

Figure 6: Differential Cue Mapping, (a) inter-channel coherence, (b) inter-channel level difference, (c) inter-channel time difference, for room R1 and JAZZ as test

signal

Future work will examine the variation and sensitivity of the differential metrics introduced here, with respect to different source receiver positions, leading to a hierarchical structure of the relative importance of each differential metric.

Figure 7: Differential Cue Mapping (a) inter-channel coherence, (b) inter-channel level difference, (c) inter-channel time difference, for room R3 and JAZZ as test

signal

7. REFERENCES

[1] M. R. Schroeder, B. F. Logan, “Colorless Artificial Reverberation”, Journal of the Audio Engineering Society, Vol. 9, p. 192, 1961


of 8

[2] S. T. Neely, J. B. Allen, “Invertibility of a Room Impulse Response”, Journal of the Acoustical Society of America, Vol. 66, pp. 165-169, 1979.

[3] J. N. Mourjopoulos , “Digital Equalization of Room Acoustics”, Journal of the Audio Engineering Society, Vol. 42, No 11, pp. 884-900, 1994.

[4] J. Blauert (1997). Spatial Hearing: The Psychophysics of Human Localization (Revised Edition), The MIT press, USA-Cambridge.

[5] J. M. Buchholz, J. Mourjopoulos, J. Blauert, “Room Masking: Understanding and Modeling the Masking of Room Reflections”, 110th AES Convention, Amsterdam, May 2001, preprint (5312).

[6] R. H. Bolt, A. D. MacDonald, “Theory of Speech Masking by Reverberation”, Journal of the Acoustical Society of America, Vol. 21(6), pp. 577-580, 1949.

[7] F. E. Toole, “Loudspeakers and Rooms for Sound Reproduction – A Scientific Review”, Journal of the Audio Engineering Society, Vol. 54, No. 6, June 2006.

[8] J. L. Flanagan, R. C. Lummis, “Signal Processing to Reduce Multipath Distortions in Small Rooms”, Journal of the Audio Engineering Society, Vol. 47, pp. 1475-1481, 1970.

[9] J. B. Allen, D. A. Berkley, J. Blauert, “Multimicrophone Signal Processing Technique to Remove Room Reverberation from Speech Signals”, Journal of the Acoustical Society of America, Vol. 64(2), pp. 912-915, 1977.

[10] J. M. Buchholz, J. Mourjopoulos, “A Computational Auditory Masking Model Based on Signal-Dependent Compression. I. Model Description and Performance Analysis”, Acta Acustica United with Acustica, Vol. 90, pp.873-886, (2004).

[11] J. M. Buchholz, J. Mourjopoulos, “A Computational Auditory Masking Model Based on Signal-Dependent Compression. II. Model Simulations and Analytical Approximations”, Acta Acustica United with Acustica, Vol. 90, pp.887-900, (2004).

[12] S. Harding, J. Barker, G. J. Brown, “Mask Estimation for Missing Data Speech Recognition

Based on Statistics of Binaural Interaction”, IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 1, January 2006.

[13] C. Faller, J. Merimaa, “Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence”, Journal of the Acoustical Society of America, Vol. 116(5), pp. 3075-3089 November 2004.

[14] T. Zarouchas, J. Mourjopoulos, J. Buchholz, P. Hatziantoniou, “A Perceptual Measure for Assessing and Removing Reverberation from Audio Signals”, 120th AES Convention, Paris, May 2006, preprint (6702).

[15] ISO/IEC 11172-3, “Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s, Part 3: Audio”.

[16] J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers,“Parametric Coding of Stereo Audio”, EURASIP Journal on Applied Signal Processing, Vol. 2005, Issue 9, pages 1305-1322.

perceptual distortion maps for room · pdf fileperceptual distortion maps for room...

Documents