analysis of broad banding and minimization techniques for square patch antenna

8
111 IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010 Two-scale Auditory Feature Based Non-intrusive Speech Quality Evaluation Kark Audhkhasi and Arun Kumar 1 Electrical Engineering Department, University of Southern California, Los Angeles, California, USA, 1 Center for Applied Research in Electronics, Indian Institute of Technology Delhi, New Delhi - 110 016, India ABSTRACT This paper proposes a novel two-scale auditory feature based algorithm for non-intrusive evaluaon of speech quality. The neuron firing probabilies along the length of the basilar membrane, from an explicit auditory model, are used to extract features from the distorted speech signal. This is in contrast to previous methods, which either use standard vocal tract based features, or incorporate only some aspects of the human auditory percepon mechanism. The features are extracted at two scales, namely a global scale spanning all voiced frames in an uerance, and a local scale spanning voiced frames from conguous voiced segments in the uerance. This is followed by a simple informaon fusion at the score level using Gaussian Mixture Models (GMMs). The use of an explicit auditory model to extract features is based on the premise that similar processing (in a qualitave sense) happens in human speech percepon. In addion, auditory feature extracon at two scales incorporates the effects of both long term and short term distorons on speech quality. The proposed algorithm is shown to perform at least as good as the ITU-T Recommendaon P.563. Keywords: Cochlear model, Mean opinion score, Non-intrusive measurement, Objecve speech quality measurement. 1. INTRODUCTION In modern speech communication networks, Quality of Service (QoS) is a very important factor, both from the service provider’s and customer’s perspective. It is important to maintain a certain minimum QoS to ensure customer satisfaction. The objective evaluation of speech quality is, therefore, an important research problem. Speech quality is ideally a subjective measure determined by factors such as naturalness, intelligibility etc. The task of a speech quality evaluation algorithm is to come up with an objective score to signify the overall quality of the speech signal. Equipped with this score, the receiver can take decisions to switch to different modes of operation, and can also direct the transmitter to do the same. For example, in an Adaptive Multi-Rate (AMR) codec [1], there are several modes of operation which allocate different proportions of the available bandwidth to source and channel coding. If the received speech quality falls below a prescribed level, the receiver can instruct the transmitter to allocate a greater share to channel coding. This can ensure that the speech quality is maintained above a certain minimum level. Generally, subjective measures are regarded to provide the best or “reference” estimates of the perceived quality of a speech utterance. However, they require human subjects, which makes the evaluation time consuming and unsuitable for autonomous applications. Objective measures, on the other hand, are computed using algorithms which do not require human intervention. There are two basic classes of objective speech quality evaluation algorithms, namely, intrusive, where both the clean and degraded utterances are required, and non-intrusive, where only the degraded utterance is needed, as shown in Figure 1 for a typical speech coding scenario. Clearly, non-intrusive measurement is suitable for autonomous monitoring of speech quality in a communication network, since only the distorted signal is available at the receiver. The earliest attempt at non-intrusive speech quality evaluation was reported by Liang and Kubichek [2], wherein perceptually based speaker independent parameters like Perceptual Linear Prediction (PLP) coefficients and perceptually-weighted bark spectrum are computed. A set of reference centroids is trained using these features extracted from a clean speech database. The Average Minimum Distance (AMD) between the degraded speech features and the reference centroids is used as an indicator of speech quality. The authors improve upon their previous work by training Hidden Markov Models (HMMs) for clean speech, using a large corpus, and the degraded utterance [3]. A simple distance measure between two HMMs is used to generate a quality score for the degraded speech utterance. Au and Lam [4] present a novel approach based on the assumption that poor quality speech utterances have a very smeared-out spectrogram. The variance and dynamic range of energies is computed

Upload: munish

Post on 14-Feb-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

111IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

Two-scale Auditory Feature Based Non-intrusive Speech Quality Evaluation

Kartik Audhkhasi and Arun Kumar1

Electrical Engineering Department, University of Southern California, Los Angeles, California, USA, 1Center for Applied Research in Electronics, Indian Institute of Technology Delhi, New Delhi - 110 016, India

ABSTRACT

This paper proposes a novel two-scale auditory feature based algorithm for non-intrusive evaluation of speech quality. The neuron firing probabilities along the length of the basilar membrane, from an explicit auditory model, are used to extract features from the distorted speech signal. This is in contrast to previous methods, which either use standard vocal tract based features, or incorporate only some aspects of the human auditory perception mechanism. The features are extracted at two scales, namely a global scale spanning all voiced frames in an utterance, and a local scale spanning voiced frames from contiguous voiced segments in the utterance. This is followed by a simple information fusion at the score level using Gaussian Mixture Models (GMMs). The use of an explicit auditory model to extract features is based on the premise that similar processing (in a qualitative sense) happens in human speech perception. In addition, auditory feature extraction at two scales incorporates the effects of both long term and short term distortions on speech quality. The proposed algorithm is shown to perform at least as good as the ITU-T Recommendation P.563.

Keywords:Cochlear model, Mean opinion score, Non-intrusive measurement, Objective speech quality measurement.

1. INTRODUCTION

In modern speech communication networks, Quality of Service (QoS) is a very important factor, both from the service provider’s and customer’s perspective. It is important to maintain a certain minimum QoS to ensure customer satisfaction. The objective evaluation of speech quality is, therefore, an important research problem. Speech quality is ideally a subjective measure determined by factors such as naturalness, intelligibility etc. The task of a speech quality evaluation algorithm is to come up with an objective score to signify the overall quality of the speech signal. Equipped with this score, the receiver can take decisions to switch to different modes of operation, and can also direct the transmitter to do the same. For example, in an Adaptive Multi-Rate (AMR) codec [1], there are several modes of operation which allocate different proportions of the available bandwidth to source and channel coding. If the received speech quality falls below a prescribed level, the receiver can instruct the transmitter to allocate a greater share to channel coding. This can ensure that the speech quality is maintained above a certain minimum level. Generally, subjective measures are regarded to provide the best or “reference” estimates of the perceived quality of a speech utterance. However, they require human subjects, which makes the evaluation time consuming and unsuitable for autonomous applications. Objective measures, on the other hand, are computed using algorithms which

do not require human intervention. There are two basic classes of objective speech quality evaluation algorithms, namely, intrusive, where both the clean and degraded utterances are required, and non-intrusive, where only the degraded utterance is needed, as shown in Figure 1 for a typical speech coding scenario. Clearly, non-intrusive measurement is suitable for autonomous monitoring of speech quality in a communication network, since only the distorted signal is available at the receiver. The earliest attempt at non-intrusive speech quality evaluation was reported by Liang and Kubichek [2], wherein perceptually based speaker independent parameters like Perceptual Linear Prediction (PLP) coefficients and perceptually-weighted bark spectrum are computed. A set of reference centroids is trained using these features extracted from a clean speech database. The Average Minimum Distance (AMD) between the degraded speech features and the reference centroids is used as an indicator of speech quality.

The authors improve upon their previous work by training Hidden Markov Models (HMMs) for clean speech, using a large corpus, and the degraded utterance [3]. A simple distance measure between two HMMs is used to generate a quality score for the degraded speech utterance. Au and Lam [4] present a novel approach based on the assumption that poor quality speech utterances have a very smeared-out spectrogram. The variance and dynamic range of energies is computed

Page 2: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

112 IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

for each spectrogram block and these values are averaged over all blocks. A large variance and dynamic range indicate good speech quality. In another method, the likelihood that the speech signal has been produced by a human is computed using the PARCOR or log-area ratio coefficients [5]. If these coefficients exceed the physical limits for a normal human vocal tract, the speech signal is assigned a poor quality score. Werner, Junge and Vary [1] compute the Lp norms of some GSM transmission parameters, e.g. Bit Error Rate (BER), Adaptive Multi-Rate (AMR) mode, and Frame Error Rate (FER), over 480 ms frames. An optimal linear combination in a Minimum Mean Squared Error (MMSE) sense of functions of these norms is used to estimate the speech quality.

The approach of Falk, Xu and Chan [6] involves training three Gaussian Mixture Models (GMMs) using PLP coefficients on a clean speech database, one each for inactive, voiced and unvoiced speech respectively. The average log-likelihood score of the degraded speech generated by each GMM is used to construct a three-dimensional feature vector. This vector is then mapped onto the MOS scale using Multivariate Adaptive Regression Splines (MARS). This work is extended by training three GMMs based on degraded speech in addition to the ones trained on clean speech [7]. Another feature, namely segmental SNR, is also used to construct a seven-dimensional feature vector for each utterance, which is mapped to the MOS scale using a MARS model. In the Low Complexity Speech Quality

Assessment (LCQA) algorithm [8], 11 per-frame features (e.g. spectral centroid, dynamics, flatness, pitch etc) are computed and their mean, variance, skewness and kurtosis over the entire utterance are used to construct a 44-dimensional feature vector. Dimensionality reduction results in a 14-component global feature vector, which is used to train a joint GMM with the MOS score of the utterance. Given the global feature vector of a degraded utterance, its MOS is estimated using the MMSE criteria. This score is then mapped onto the MOS scale using a 3rd

order monotonic polynomial mapping. The ITU-T

Recommendation P.563 [9] is used for non-intrusive quality assessment of narrow-band speech. A total of 51 features are computed from the frame. Eight of these features are used to ascertain a distortion class. Depending on the distortion class, an optimal linear combination of certain features out of the set of 51 is used to decide the intermediate quality. The overall speech quality is then computed by taking the linear combination of the intermediate quality with some key parameters.

All the existing approaches to non-intrusive speech quality evaluation use vocal tract and speech signal based features, or incorporate only some aspects of the human auditory system. This is not consonant with the human perception of speech, which has the auditory system as the front-end, and the firing pattern of neurons as the features. An approach mimicking this process can give scores which are likely to be better correlated with the subjective speech quality scores as compared to existing approaches. Another important factor is the presence of both long-term (e.g. stationary noise) and short-term (e.g. burst errors) degradations in the speech signal. Thus, feature extraction at multiple scales may be better able to capture the quality than at a single scale.

The scope of this paper is as follows: Section II discusses the proposed two-scale auditory feature based non-intrusive speech quality evaluation algorithm. It also contains a brief overview of Lyon’s Cochlear model [10]. Section III describes the database and the experiments and results in the form of comparison with P.563 and the LCQA algorithms. The conclusions are presented in section IV.

2. AUDITORY FEATURES BASED ALGORITHM

The ITU-T P.563 [9], LCQA [8] and all other approaches for non-intrusive speech quality evaluation are primarily based on vocal tract and speech signal based features. However, it is important to note that for non-intrusive speech quality evaluation, an approach which accounts for important aspects of human speech perception is desirable. This is the key motivation in exploring auditory features for non-intrusive speech quality evaluation.

Clean speech

Clean speech encodedparameters

Distorted encoderparameters

Distorted speech

Estimated quality score

ENCODER

COMMUNICATIONCHANNEL

DECODER

NON-INTRUSIVEMEASUREMENT

INTRUSIVEMEASUREMENT

Figure 1: Intrusive and non-intrusive speech quality evaluation in a speech coding application.

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 3: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

113IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

2.1 Lyon’s Cochlear Model

R. F. Lyon proposed a cochlear model as shown in Figure 2 [10]. Sound entering the outer and middle ear passes through the oval window into the cochlea. In the cochlear duct, the pressure wave travels down the basilar membrane. The stiffness of the basilar membrane varies along its length and, at any point, resonates with a pressure wave of particular frequency. At each stage of the cochlea, vibrations are sensed by the hair cells which excite neurons, which in turn communicate with the higher levels in the brain.

Hence, the cochlea maps the frequency content of a pressure wave into the spatial domain. The cochlea is sensitive to high frequency sounds near the base, and low frequency sounds are captured as the wave travels down the cochlea. The block diagram in Figure 2 gives a high-level overview of Lyon’s model. The outer ear processing and middle ear processing are combined into a single pre-emphasis filter. The inner ear consists of a filter bank, where the bandwidth of each stage is a function of its center frequency, as shown in Figure 3. This filter bank models the frequency selectivity along the length of the cochlea. After passing through the filter bank, each band-pass signal is passed through half-wave rectifiers to model the detection nonlinearity of inner hair cells. The output of each channel from the rectification stage passes through four Automatic Gain Control (AGC) stages. The value of the gain of each stage depends on a time constant, the previous output sample and the previous outputs of the neighboring channels as well. This is done to model masking effects [11,12]. The final output of Lyon’s model is the firing rates/probabilities of neurons in the various frequency bands of the cochlear filter bank.

Lyon’s cochlear model has been used in a variety of applications like robust pitch detection [13], analog electronic auditory models [14], and robust speech recognition. Another similar auditory model is Seneff’s mean rate and synchrony model [15]. The front end of this model has two outputs, namely the mean rate output, which captures the spectral magnitude information,

and the synchrony output which captures the match between the neuron firing at a particular location and the characteristic frequency of that location. In this work, we use Lyon’s cochlear model.

2.2 Single-scale Auditory Feature Based Approach

The core idea behind the proposed auditory feature based approach is to use the neuron firing probabilities from Lyon’s cochlear model for deriving features sensitive to speech quality. This approach is described in the following:

2.2.1 Training Stage 1) The speech signal is passed through a Voice Activity

Detector (VAD), which classifies each frame as voiced, unvoiced or inactive (silence). Only voiced frames are used to estimate the quality of the speech signal here, since these frames have the maximum influence on a listener’s perception of quality.

2) For each voiced frame, 64-dimensional auditory feature vectors consisting of neuron firing probabilities in different auditory channels are generated by passing the signal frame through Lyon’s cochlear model [10]. In our implementation, these feature vectors are computed once every 0.5 ms in a 20 ms voiced frame. Let these feature vectors for the ith

frame

be denoted by n

NF

={ }

1x (n)i where NF is the number of

feature vectors computed per frame.

3) The mean, variance, skewness and kurtosis of each dimension of this 64-dimensional vector is computed across all voiced frames in an utterance. The results of this computation are four 64-dimensional vectors for an utterance. Let these mean, variance, skewness and kurtosis vectors for the jth

utterance be denoted

by X X X XMj

Vj

Sj

Kj( ) ( ) ( ) ( ), , ,and respectively.

Auditory Filterbank

Neuron Firing Rates

Inner Hair Cells

Dynamic RangeCompressionand Masking

OUTER EARMIDDLE EAR FILTER

HALF-WAVERECTIFIER

AUTO. GAINCONTROL

AUTO. GAINCONTROL

AUTO. GAINCONTROL

HALF-WAVERECTIFIER

HALF-WAVERECTIFIER

FILTER FILTER

Figure 2: Lyon’s cochlear auditory model [10].Figure 3: Magnitude response of every fifth auditory filter in Lyon’s cochlear auditory model for 8 KHz speech [10].

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 4: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

114 IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

4) These four row vectors are concatenated to form a single 256-dimensional auditory feature vector (1) for the utterance.

X X X X Xj Mj

Vj

Sj

Kj= [ ]( ) ( ) ( ) ( ) (1)

where j =1...J, and J is the total number of training utterances.

5) During the training stage, the J 256-dimensional auditory feature vectors from the training database are used for Principal Component Analysis (PCA). The first 14 principal components are used, which contain approximately 80% of the total energy. Let the resulting 14-dimensional feature vector for the jth

utterance be denoted by Φj.

6) A joint GMM is trained between this 14-dimensional feature vector and the MOS score θj avail-able from a MOS-labeled speech database [16], using the Expectation Maximization (EM) algo-rithm [17].

Λ Φ( , , ) [ ] ,( ) ( ) ( )µ ω θm

m

M m

m

M m

m

M

j j j

JEM M{ } { } ∑{ } = {{ } }

= = = =1 1 1 1 (2)

where Λ is the joint GMM with M mixture components, µ(m), ω(m) and Σ(m) are the mean, mixture weight and covariance matrix respectively of the mth

mixture

component, [Φj θj] is the 15-dimensional vector for the jth

training utterance, consisting of the 14-dimensional

auditory feature vector Φj and the MOS score θj. J signifies the total number of utterances used for training the joint GMM.

2.2.2 Test Stage 1) The test utterance is processed in exactly the same

way as given in the training stage, to obtain a 14-dimensional auditory feature vector Φ.

2) Given the feature vector Φ and the trained joint GMM Λ( , , )µ ωm m

Mm m

Mm m

M{ } { } ∑{ }= = =1 1 1 , the estimate of the MOS ̂ iˆis obtained using the MMSE criterion as follows:

{ } { }E E2ˆ ˆ( ) arg min ( ( ))

ˆ( )

Φ Φ Φ

Φ

= - =

(3)

E y m

m

Mmθ µθΦ Φ Φ{ } =

=∑ ( ) ( )( )1

(4)

where

y mm m m

kM k k k

( )( ) ( ) ( )

( ) ( ) ( )( )

( , )

( ,Φ

Φ Σ

Σ Φ ΣΦ ΦΦ

Φ ΦΦ

==

ω µ

ω µ

N

N1 (5)

and

µ µ µθ θ θΦ Φ ΦΦ ΦΣ Σ Φ( ) ( ) ( ) ( ) ( )( ) ( )m m m m m= + −−1

(6)

where µ µθ θΦ ΦΦ ΦΣ Σ( ) ( ) ( ) ( ), , ,m m m m

are the means, covariance

and cross-covariance matrices of Φ and θ of the mth

mixture component. N(.) represents the Gaussian probability density function, and E{.} is the expectation operator.

3) This objective score is mapped onto the MOS scale using a 3rd order monotonic polynomial mapping.

2.2.3 Two-Scale Auditory Feature Based Approach In the simple auditory feature-based approach given above, the statistics (i.e. mean, variance skewness and kurtosis) of the 64-dimensional auditory feature vector are computed over all voiced frames. This approach may have a limitation. Short duration/transient distortions, e.g. frame erasure, noise spike etc., are also important to speech quality as the longer duration ones such as e.g. additive Gaussian noise. By averaging features over the entire utterance, the effect of any transient distortion gets buried. Hence, both global and local statistics should be incorporated while estimating the quality of a speech utterance. This is basic idea in proposing a two-scale auditory feature based approach.

Figure 4a and b show the two stages of this approach. In the first stage [Figure 4a], the steps proceed as in the auditory feature-based approach explained in the previous sub-section. The difference is that the statistics are computed at two scales instead of just one. These are the global scale, spanning all voiced frames in the utterance, and the local scale spanning all N contiguous voiced segments. Hence, two joint GMMs with MOS are trained, namely ΛG for the global scale features, and ΛC for the local scale features. The training for ΛG proceeds in exactly the same way as the single-scale approach. However, in case of the local scale, N 14-dimensional feature vectors are obtained for each utterance, where

Speechsignal

Voiced frames

64-dim.per-frame

feature vector256-dim. globalfeature vector

N 256-dim. localfeature vectors

VOICEACTIVITY

DETECTION

GMMMAPPING

AG

MOSG

MOSC

MOSC

MOS’

(a)

(b)

MOSp

MOSG

GMMMAPPING

AC

GMMMAPPING

^

3rd ORDERPOLYNOMIALREGRESSION

PROJECTIONONTO TOP

26 PCS

PROJECTIONONTO TOP

14 PCS

GLOBALSTATISTICS

COMP.

LOCALSTATISTICS

COMP.

LYON’S COCHLEAR MODEL

Figure 4: The two-scale auditory feature-based approach: stages 1 (a) and 2 (b).

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 5: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

115IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

N is the number of contiguous voiced segments. Each of these N feature vectors is associated with the MOS score of the utterance while training the joint ΛC.

In the second stage [Figure 4(b)], two estimates of MOS using the GMMs ΛG and ΛC are found. These two estimates need to be combined to get a final score. It was observed that the MOS score does not always lie between these two estimates. Hence, a weighted average would not work well in this scenario. So, a joint GMM Λ of these two scores with the MOS was trained. During the test stage, the GMMs ΛG and ΛC are used to estimate two scores. These two scores are combined using the third GMM Λ, and this final estimate is mapped onto the MOS scale using a 3rd

order monotonic polynomial

mapping. The next section discusses the implementation details, experiments designed and performed, and the results obtained.

3. EXPERIMENTS AND RESULTS

We begin with a description of the speech database, followed by the experiments and results.

3.1 ITU-T P. Supplement 23 Database

Supplement 23 to the P series of ITU-T Recommendations [16] is a database of coded and source speech material used in the ITU-T 8 Kbit/s codec (Recommendation G.729) characterization tests. The purpose of this database is to provide source, pre-processed and processed speech material, and related subjective test plans and scores, for the development of new and revised ITU Recommendations relating to objective voice quality measures.

The database is an ACR/DCR labeled database. Experiments 1 and 3 are ACR-labeled whereas Experiment 2 is DCR-labeled. However, since this work involves evaluation of listening quality of a speech utterance, only experiments 1 and 3 were used. All speech files are recorded in 16-bit linear PCM with a low-byte first format at a sampling rate of 22.05 KHz. Experiment 1 is divided into three sub-experiments (A, D and O) and experiment 3 into four sub-experiments (A, C, D, O). These sub-experiments correspond to the various laboratories where the recording and subjective quality evaluation of speech utterances was done. The total number of utterances in the two experiments is 1326. Three versions of each utterance are available, namely original (clean speech), pre-processed (clean speech with IRS filtering) and coded (speech subjected to various classes of coder/channel degradations).

The following pre-processing is performed on the ITU-T P. Supp 23 [16] database for the experiments conducted in this work:

1) All the speech utterances are down-sampled to 8 KHz. 2) The opinion scores provide the ACR scores of the

speech utterances. Each utterance has 24 ACR scores given by 24 different subjects. The MOS score for the ith

utterance is computed as the average of the 24 ACR

scores. 3) The previous step results in 1 MOS score for each

utterance. However, for non-intrusive speech quality evaluation, per-condition MOS scores are needed [9], since the distortion condition is rated for quality, and not the specific utterance. The condition-averaged MOS is computed by averaging the MOS scores over all the four files for a given distortion condition.

3.2 Correlation Analysis of Auditory Features

As a preliminary investigation into the efficacy of auditory features, the correlation coefficient between the features and the MOS score was computed. The utterance was divided into 20 ms frames, overlapping by 10 ms. A MATLAB© implementation of Lyon’s Cochlea Model [18] was used for computing the 64-dimensional auditory feature vector once every 0.5 ms for each voiced frame. The mean, variance, skewness and kurtosis of these features were computed over the entire utterance, resulting in four 64-dimensional global feature vectors. The correlation coefficient of each dimension of these global feature vectors was computed with the MOS score. Figure 5 shows the relation between the center frequency of Lyon's auditory filters as a function of filter index, while Figure 6a and b show the absolute values of correlation coefficients as a function of the center frequency of the auditory filter.

The following observations can be made from these plots: 1) The mean of the auditory features has the maximum

average absolute correlation coefficient (0.148),

Figure 5: The center frequency of Lyon’s auditory filters as a function of stage index for 8 KHz speech.

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 6: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

116 IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

followed by variance (0.124), skewness (0.109) and kurtosis (0.060). This highlights the importance of lower order moments i.e. mean and variance, of the auditory features as compared to the higher order ones i.e. skewness and kurtosis.

2) Features from some frequency bands show a systematic higher correlation with MOS than the others. These include frequencies around 150,600 and 1500 Hz. One possible explanation for this is that these frequencies roughly correspond to the fundamental frequency and the dominant formant frequencies of voiced speech.

3.3 Division of the Database into Test and Training Sets

The ITU-T P.Supp 23 database contains 1326 MOS-labeled utterances. A simple method is adopted to improve the performance test of the algorithms along the lines of a leave-1 out method.

1) Each distortion condition has four utterances. Of the total 330 distortion conditions, 55 are randomly selected for the test set. The remaining 275 conditions are used for training.

2) Step 1 is repeated two more times to get three randomly selected training and test sets. Each training set contains files corresponding to 275 distortion conditions, while the test sets contain files corresponding to 55 distortion conditions.

3) The performance of the non-intrusive speech quality evaluation algorithms is averaged across these three randomly selected test sets.

3.4 Performance of the Auditory Feature Based Approaches and Comparisons

An ITU-T standard ANSI C implementation of P.563 was used for the performance evaluation. The Karl-Pearson correlation coefficient, ρ, between the estimated scores and the subjective MOS scores was computed, after a 3rd

Figure 6: The absolute correlation coefficient of (a) mean, (b) variance, (c) skewness, and (d) kurtosis, of neuron firing probabilities with MOS.

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 7: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

117IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

order monotonic polynomial mapping. Ni i i QQ

Ni i i QQ

Q Q

Q Q

ˆ1

2 2ˆ1

ˆ( )( )ˆ( ) ( )

=

=

Σ - -=

Σ - -

(7)

where N is the number of test conditions, Q is the condition-averaged MOS and Qˆis its estimate. The correlation coefficients are averaged across the three randomly selected test sets, and their variance is computed.

For the performance evaluation of LCQA, an 8 mixture component GMM with full covariance matrix was trained. A simple PCA based dimensionality reduction was adopted in our implementation of LCQA, to ensure a fair comparison between the LCQA features and the proposed auditory features. The randomly selected set of 1100 files was used for training the joint GMM. For the auditory feature based approach, an 8 mixture component joint GMM of the 14-dimensional auditory feature vector and the MOS is trained using the EM algorithm [18]. For the two-scale auditory feature-based approach, two 8 mixture component GMMs were trained for the global auditory features and the contiguous voiced frame auditory features. An 8 mixture component GMM was also trained for mapping the two objective scores (corresponding to the global and contiguous voiced frame auditory features) to the MOS scale. Table 1 shows the average results for the four algorithms using condition-averaged MOS. Figure 7 pictorially shows the average performance results for the four algorithms.

The two-scale auditory feature-based algorithm is seen to perform as well as the ITU standard P.563. Moreover, it improves upon the performance of LCQA. The following conclusions can be drawn from the results presented in Table 1.

1) Auditory features are equally important in representing speech quality as vocal tract based features, such as the

ones used in LCQA and P.563

2) Analysis of the speech signal at multiple scales leads to an improvement in the quality estimate

This is evident from Table 1, where the performance is better for the two-scale auditory feature-based approach as compared to the single-scale one.

4. CONCLUSION

This paper presents a novel two-scale auditory feature based non-intrusive speech quality evaluation algorithm, which performs as well as ITU-T P.563. This is achieved because of two factors.

Firstly, the use of neuron firing probabilities (along the length of the basilar membrane) to extract quality-sensitive auditory features mimics the human auditory mechanism. This naturally leads to better evaluation of speech quality as compared to methods based on features from vocal tract or the speech signal. Secondly, the use of multiple scales instead of one for computing these auditory features is able to take into account distortions at different temporal spans i.e. both short-term and long-term.

Figure 7: Comparison of the performance of P.563, LCQA, the simple auditory feature based approach and the two-scale auditory feature based approach for condition-averaged MOS.

Table 1: Correlation coefficients of scores obtained from P.563, LCQA, simple auditory feature based and two-scale auditory feature based algorithms with the condition-averaged MOS.Experiment P.563 LCQA Single-scale Two-scale auditory features auditory features1(A) 0.9058 0.7723 0.9262 0.96221(D) 0.8746 0.8723 0.9871 0.97201(O) 0.9212 0.9300 0.9103 0.93553(A) 0.9522 0.9404 0.8854 0.88493(C) 0.9666 0.9431 0.9410 0.95243(D) 0.9832 0.8674 0.9256 0.95653(O) 0.9434 0.9110 0.9236 0.9428Average 0.9353 0.8909 0.9285 0.9438STD. DEV. 0.0570 0.0783 0.0547 0.0487

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

Page 8: Analysis of Broad Banding and Minimization Techniques for Square Patch Antenna

118 IETE JOURNAL OF RESEARCH | VOL 56 | ISSUE 2 | MAR-APR 2010

REFERENCES

1. M Werner, T Junge, and P Vary, Quality control for AMR speech channels in GSM networks, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, Vol. 3, pp. 1076-9, 2004.

2. J Liang, and R Kubichek, Output-based objective speech quality, Proc. IEEE Vehicular Technology Conference, Vol. 3, pp. 1719-23, 1994.

3. W Li, and R Kubichek, Output-based objective speech quality measurement using continuous Hidden Markov Models, Proc. IEEE Intl. Symposium on Signal Processing and its Applications, Vol. 1, pp. 389-92, 2003.

4. O Au, and K Lam, A novel output-based objective speech quality measure for wireless communications, Proc. Intl. Conference on Signal Processing, Vol. 1, pp. 666-9, 1998.

5. P Gray, M Hollier, and R Massara, Non-intrusive speech-quality assessment using vocal-tract models, Proc. Inst. Elect. Eng. Vision, Image and Signal Processing, Vol. 147, pp. 493-501, 2000.

6. T Falk, Q Xu, and W Y Chan, Non-intrusive GMM-based speech quality measurement, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 125-8, 2005.

7. T Falk, and W Y Chan, Enhanced non-intrusive speech quality measurement using degradation models, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 837-40, 2006.

8. V Grancharov, D Y Zhao, J Lindblom, and W B Kleijn, Low-complexity, nonintrusive speech quality assessment, IEEE Transactions on

Audio, Speech and Language Processing, Vol. 14, No. 6, pp. 1948-56, Nov. 2006.

9. ITU-T Rec, Single-ended method for objective speech quality assessment in narrow-band telephony applications, pp. 563, 2004.

10. R F Lyon, A computational model of filtering, detection, and compression in the cochlea, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, pp. 1282-5, 1982.

11. L R Rabiner, and B H Juang, Fundamentals of speech recognition, Pearson Education, 2003.

12. E Zwicker, and H Fastl, Psycho-acoustics: Facts and models, 2nd edn, Springer-Verlag, 1999.

13. M Slaney, and R F Lyon, A perceptual pitch detector, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 357-60, 1990.

14. R F Lyon, and C Mead, An analog electronic cochlea, Proc. IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 36, pp. 1119-33, 1988.

15. S Seneff, A computational model for the peripheral auditory system: Application to speech recognition research, Proc. IEEE Intl. Conference on Acoustics, Speech and Signal Processing, Apr. 1986.

16. ITU-T Rec, ITU-T coded-speech database, Supplement 23, 1998.

17. A P Dempster, N M Laird, and D B Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Statistical Society, Ser. B, Vol. 39, No. 1, pp. 1-38, 1977.

18. M Slaney, Lyon’s cochlear model, Advanced Technology Group, Apple Technical Report No. 13, Apple Computer Inc., 1988.

Audhkhasi K and Kumar A: Auditory Feature-based Speech Quality Evaluation

AUTHORSKartik Audhkhasi received his B.Tech. in Electrical Engineering and M.Tech. in Information and Communication Technology from Indian Institute of Technology, Delhi (IITD) in 2008. At present, he is a Ph.D. student at the Signal Analysis and Interpretation Laboratory (SAIL) within the Electrical Engineering Department at the University of Southern California

(USC). He is broadly interested in signal processing and machine learning, with an emphasis on speech processing, recognition and human language technologies.

E-mail: [email protected]

DOI: 10.4103/0377-2063.63087; Paper No JR 449_09; Copyright © 2010 by the IETE

Arun Kumar received the B.Tech, M.Tech and PhD degrees in Electrical Engineering from Indian Institute of Technology (IIT), Kanpur. He was a Visiting Researcher at the University of California, Santa Barbara, from 1994 to1996. Since 1997, he has been with the Centre for Applied Research in Electronics (CARE), IIT Delhi, where he is currently working as Professor. His research

interests span the areas of digital signal processing, underwater acoustics, communications, and voice technologies for man-machine interaction. In these areas, he has introduced new courses at the Masters level, and has supervised several Masters and PhD theses at IIT Delhi. He has also supervised over 25 funded research and development projects from Indian and foreign industries, as well as various government organizations. He has received the Young Scientist award of the International Union of Radio Science (URSI) in the Netherlands.

E-mail: [email protected]