voice quality with itu-t p.863 ‘polqa’€¦ · polqa is becoming an integral and central part...

A Rohde & Schwarz Company

Voice Quality with ITU-T P.863 ‘POLQA’ Application Note

July 2012

SwissQual AG Allmendweg 8 CH-4528 Zuchwil Switzerland

t +41 32 686 65 65 f +41 32 686 65 66 e [email protected] www.swissqual.com

Part Number: 12-070-200912-4

SwissQual has made every effort to ensure that eventual instructions contained in the document are adequate and free of errors and omissions. SwissQual will, if necessary, explain issues which may not be covered by the documents. SwissQual’s liability for any errors in the documents is limited to the correction of errors and the aforementioned advisory services.

Copyright 2000 - 2012 SwissQual AG. All rights reserved.

No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system, or translated into any human or computer language without the prior written permission of SwissQual AG.

Confidential materials.

All information in this document is regarded as commercial valuable, protected and privileged intellectual property, and is provided under the terms of existing Non-Disclosure Agreements or as commercial-in-confidence material.

When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo trademark somewhere in your text.

SwissQual®, Seven.Five®, SQuad®, QualiPoc®, NetQual®, VQuad®, Diversity® as well as the following logos are registered trademarks of SwissQual AG.

Diversity Explorer™, Diversity Ranger™, Diversity Unattended™, NiNA+™, NiNA™, NQAgent™, NQComm™, NQDI™, NQTM™, NQView™, NQWeb™, QPControl™, QPView™, QualiPoc Freerider™, QualiPoc iQ™, QualiPoc Mobile™, QualiPoc Static™, QualiWatch-M™, QualiWatch-S™, SystemInspector™, TestManager™, VMon™, VQuad-HD™ are trademarks of SwissQual AG.

SwissQual acknowledges the following trademarks for company names and products:

Adobe®, Adobe Acrobat®, and Adobe Postscript® are trademarks of Adobe Systems Incorporated.

Apple is a trademark of Apple Computer, Inc.

DIMENSION®, LATITUDE®, and OPTIPLEX® are registered trademarks of Dell Inc.

ELEKTROBIT® is a registered trademark of Elektrobit Group Plc.

Google® is a registered trademark of Google Inc.

Intel®, Intel Itanium®, Intel Pentium®, and Intel Xeon™ are trademarks or registered trademarks of Intel Corporation.

INTERNET EXPLORER®, SMARTPHONE®, TABLET® are registered trademarks of Microsoft Corporation.

Java™ is a U.S. trademark of Sun Microsystems, Inc.

Linux® is a registered trademark of Linus Torvalds.

Microsoft®, Microsoft Windows®, Microsoft Windows NT®, and Windows Vista® are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries U.S.

NOKIA® is a registered trademark of Nokia Corporation.

Oracle® is a registered US trademark of Oracle Corporation, Redwood City, California.

SAMSUNG® is a registered trademark of Samsung Corporation.

SIERRA WIRELESS® is a registered trademark of Sierra Wireless, Inc.

TRIMBLE® is a registered trademark of Trimble Navigation Limited.

U-BLOX® is a registered trademark of u-blox Holding AG.

UNIX® is a registered trademark of The Open Group.

Voice Quality with ITU-T P.863 ‘POLQA’ – Application Note

© 2000 - 2012 SwissQual AG

|

CONFIDENTIAL MATERIALS

ii

Contents

Voice Quality with ITU-T P.863 ‘POLQA’ ........................................................................................................ 0

1 Introduction – What is P.863 ‘POLQA’? ............................................................................................ 4

More complex telecommunication networks and handsets ................................................................... 4

Demand for wideband audio transmission ............................................................................................ 5

Narrowband vs. super-wideband tests .................................................................................................. 5

2 Technical Details of POLQA ............................................................................................................... 6

POLQA as a model of a subjective listening test .................................................................................. 6

POLQA in narrow-band and super-wideband mode ............................................................................. 7

POLQA internal processing steps ......................................................................................................... 8

Time Alignment ................................................................................................................................ 8

Psycho-acoustic model .................................................................................................................. 10

POLQA prediction performance and typical results ............................................................................ 14

3 Narrow-band Voice Quality measurements with P.863 ‘POLQA' in Diversity ............................ 16

Idea of the narrowband test ................................................................................................................. 16

Speech reference signals for narrow-band tests ................................................................................. 17

What are the differences to the previous ITU P.862 ‘PESQ’? ............................................................. 18

Test definition and result presentation................................................................................................. 19

4 Wideband Voice Quality measurements with P.863 ‘POLQA' in Diversity ................................. 22

Idea of the wideband test .................................................................................................................... 22

Wideband speech reference signals ................................................................................................... 22

What are the differences to narrowband? ........................................................................................... 23

Where wideband quality can be assessed .......................................................................................... 24

Wideband analysis in Diversity ............................................................................................................ 25

5 Real field measurements .................................................................................................................. 26

Results in GSM / UMTS networks compared to P.862 ‘PESQ’ ........................................................... 26

Results in real field networks compared to P.862 ‘PESQ’ ................................................................... 28

Sample dependency of P.863 ‘POLQA’ scores in real field measurements ....................................... 29

Results in real field networks in super-wideband mode of P.863 ‘POLQA’......................................... 30

6 Conclusion ......................................................................................................................................... 32



|


iii

Figures

Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks .................... 6

Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model .................................. 6

Figure 3: Basic scheme of the main components of P.863 ‘POLQA’ ................................................................ 8

Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts ................. 9

Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences ..................... 9

Figure 6: Example of an aligned pair of reference and degraded signal ........................................................... 9

Figure 7: Block-scheme of POLQA as in ITU-T P.863 .................................................................................... 10

Figure 8: Application of masking slopes to the Bark spectrum........................................................................ 12

Figure 9: Consideration of fully and partially masked spectral parts ............................................................... 13

Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking ....................... 13

Figure 11: Insertion and capturing in a speech test setup ............................................................................... 16

Figure 12: IRS in send and receive direction as specified in ITU-T P.48 ........................................................ 17

Figure 13: P.863 ‘POLQA’ narrowband main result representation in NQDI .................................................. 19

Figure 14: P.863 ‘POLQA’ narrowband detail result representation in NQDI ................................................. 20

Figure 15: P.863 ‘POLQA’ test selection in NQDI ........................................................................................... 20

Figure 16: P.863 ‘POLQA’ statistical report in MS EXCEL .............................................................................. 21

Figure 17: P.863 ‘POLQA’ wideband main result representation in NQDI ...................................................... 25

Figure 18: P.863 ‘POLQA’ wideband audio bandwidth representation in NQDI ............................................. 25

Figure 19: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI ........................................................... 26

Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions .................... 26

Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ ............................................................. 27

Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ .............................................................. 28

Figure 23: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB modeand NB mode .................... 31

Figure 24: Distribution of predicted MOS scores by P.863 ‘POLQA’ SWB mode in wideband networks ....... 32

Tables

Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’ .................................................... 14

Table 2: Typical predicted MOS-LQ values for common transmission techniques ......................................... 18

Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques .............................................. 24

Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups ... 27

Table 5: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in common real field setups .................. 29

Table 6: Comparison of different speech samples in common real field setups ............................................. 30

Table 7: Comparison of the NB and SWB mode of P.863 ‘POLQA’ in common real field setups .................. 31



Chapter 1 | Introduction – What is P.863 ‘POLQA’?


4

1 Introduction – What is P.863 ‘POLQA’?

SwissQual has been driving the development of new objective perceptual quality prediction algorithms since it was founded in 2000. Immediately after SwissQual’s foundation, its voice quality predictor SQuad was specifically developed to meet the requirements of mobile and Voice-over-IP scenarios. SQuad still forms the backbone of the entire voice quality suite of SwissQual to this day. Already from the beginning it overcame disadvantages of ITU-T P.862 ‘PESQ’ in these application areas. In order to keep up with the latest advancements in network and processing technologies, SQuad was continuously maintained and improved over the years to deliver precise quality scores to the customer.

Already in 2005 ITU-T started a project for standardization of a new objective voice quality model. This project called P.OLQA should extend the scope of the existing ITU-T P.862 ‘PESQ’ and overcome disadvantages and known problems of PESQ.

The P.OLQA project was finalized in 2010 by a competition between six candidate models, including the latest SQuad algorithm. In a detailed analysis based on more than 45’000 speech files, the SQuad algorithm was selected as one winning model passing the challenging thresholds set by ITU-T.

Together with the two other selected models from Opticom and TNO, SQuad was integrated into a Joint Model ‘POLQA’ that combines the strengths of the three underlying algorithms and now forms the new ITU-T P.863 ‘POLQA’ approved in January 2011.

SwissQual is one of the most active drivers for the development of objective measures in international standardization bodies. As a consequence SwissQual leads the corresponding working group at ITU-T and both initiated and set several standards over the last years such as ITU-T P.563 (a no-reference voice quality measure), ITU-T J.341 (a full-reference measure for HDTV), ETSI TR 102 506 (a method for an ‘Estimation of Quality per Call’) and now the brand-new ITU-T P.863 ‘POLQA’.

POLQA is becoming an integral and central part of SwissQual’s voice quality analysis suite and will be the recommended voice quality predictor for both narrowband and wideband speech.

The existing and widely introduced SQuad algorithm remains a part of Diversity and can still be used if desired. SQuad can still be combined with the previous ITU-T P.862 ‘PESQ’. This gives all customers the possibility to continue their ongoing measurement campaigns and to plan a transition to ITU-T P.863 ‘POLQA’ on their own schedule.

More complex telecommunication networks and handsets

Telecommunication networks are being equipped more and more with highly non-linear components and long distance calls are usually passing through several such components and even through different networks.

Today, speech quality is no longer determined by the speech codec used or by lost frames alone. We also have an interaction with different other components that perform automatic the signal level adjustments, smart loss concealments and similar strategies in order to increase intelligibility in case of critical situations. Unfortunately, these components are not used just once in a connection; there are rather several of them, potentially causing interferences.

In addition to that, we also saw some progress in the standardization of speech codecs, with recently standardized coding schemes now being integrated in the networks. These new schemes, such as EVRC and EVRC-B used in CDMA networks as well as wideband coding schemes such as AMR-WB and EVRC-WB were considered from the very beginning of the development of the latest SQuad version and are now covered by POLQA as well. In addition to traditional schemes for voice source coding, audio compression methods (e.g. MP3, AAC, …) are increasingly being used in telecommunication services as well.

Besides more complex handsets for traditional telephony applications, new and packet-based transmission technologies and multi-stage connections are becoming wide-spread. This leads to a lot of new distortion types that did not appear in traditional quasi circuit-switched networks. We will see an increasing amount of time-warping and asynchronous re-sampling effects as well as non-exact replacements of missed packets or



Chapter 1 | Introduction – What is P.863 ‘POLQA’?


5

speech frames. All of this considerably changes the physical signal, without necessarily affecting its qualitative perception. The correct rating of these types of signal distortions is a clear shortcoming of PESQ and is now solved by POLQA.

Demand for wideband audio transmission

Telecom industries are now initiating the evolution from narrowband telephony to wideband speech transmission. The codecs for wideband are ready and approved by the standardizations bodies, the handsets are not restricted in processing power and the core networks are being upgraded.

Of course, narrowband speech is the normal experience for telephone users and has been accepted for decades. As the mobile telephone becomes an increasingly multi-media based device, the traditional telephone “sound” seems less and less acceptable. The expectation of the consumer is changing whilst, at the same time, the increased processing power allows wider audio bandwidths. The standardization bodies provide the corresponding coding schemes and the core networks are being upgraded. The first step – wideband transmission up to 7000Hz – is already being overhauled by the emergence of so-called super-wideband transmission technology which opens the band up to 14000Hz, and of course by the family of audio codecs including MP3 and AAC which even allows transmission above the hearing threshold.

SwissQual had already upgraded the entire audio processing chain in their products to address wideband audio transmission in 2009 and in SwissQual products ‘wideband’ test applications are already available since early 2010 using the SQuad algorithm.

SwissQual’s Diversity platform was the first measurement platform for mobile network testing and benchmarking to integrate a real-time test of wideband speech. Therefore, SwissQual has been able to gain the most experience with wideband testing in the field and can use this superior know-how for a smooth and transparent integration of P.863 ‘POLQA’ in their products.

Narrowband vs. super-wideband tests

P.863 ‘POLQA’ offers the same two operational modes supported by SQuad: one for traditional narrowband telephony and one for super-wideband (which translates to no audio bandwidth limitation in practice).

Choosing the operational mode is less a matter of how the channel is set up. The question is rather to which kind of reference the received signal shall be compared and what is inserted into the channel.

For narrowband tests, the reference signal is a perfect signal in telephony bandwidth (in principle 300 – 3400Hz). This signal is inserted into the channel as well. The resulting MOS prediction gives a quality that is relative to the narrowband signal. This means that a perfectly transmitted signal will be scored close to MOS = 5 (practically 4.5). A potential wideband or super-wideband capability of the channel is not tested, since the input signal is just narrowband.

This narrowband test case is well defined over years; it allows compatibility to existing measures like P.862 ‘PESQ’ and SQuad in narrowband mode as well. The entire quality scale is used for these narrowband conditions.

In case of the super-wideband mode, the reference signal is a perfect signal in almost full bandwidth (50 – 14’000Hz). Here this full-bandwidth signal is inserted into the channel. The resulting MOS prediction gives a quality that is relative to the super-wideband signal. This means that a perfectly transmitted super-wideband(!) signal will be scored with close to MOS = 5 (practically 4.5). If the actual channel is ‘just’ narrowband, the super-wideband signal becomes limited in audio bandwidth by the channel and the score relative to the super-wideband reference is lower than 5.0.

This super-wideband mode should be used in all scenarios where super-wideband or wideband systems are to be compared against each other or qualified against a narrowband network.



Chapter 2 | Technical Details of POLQA


6

2 Technical Details of POLQA

POLQA as a model of a subjective listening test

ITU-T P.863 ‘POLQA’ is a so-called full-reference model. The quality estimation is based on the comparison of the transmitted signal with the high quality reference signal.

transmission

channel

Full reference

measurement

high quality

speech signal

copy of

high quality

speech signal

Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks

The basic approach follows the common approach used by other measures such as SQuad. It compares the received and potentially degraded signal with an undistorted reference signal. This allows a very detailed and fine analysis of any kind of differences between the two signals. To consider human perception, at first a model of the listening device (i.e. a handset or a headphone) is applied. That way, the exact same signal as it would be heard by using such a device is used.

Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model

The more important step however is the application of a psycho-acoustic model that transforms the signal into an internal sound representation under consideration of frequency and intensity warping and masking effects. In this internal sound representation plane, the differences between the degraded and reference speech are calculated. These differences describe differences that would be perceptible in a direct subjective comparison. Since speech perception and recognition is more than just listening to sound stimuli, a cognitive model is the last step of the quality prediction. Here individual distortions are weighted according to speech perception. For example, in case of human voice, a listener is more tolerant to certain distortion types as long as they can be considered ‘natural’ even if they differ significantly from the reference signal.

POLQA predicts the voice quality as it is perceived in an ITU-T P.800 subjective listening only test (LOT). Those tests are the most used listening tests in telecommunications. A listener scores the quality of a presented voice sample on a 1 (bad) to 5 (excellent) so-called Absolute Category Rating (ACR) scale. The listener does not compare the signal directly to a reference; he compares the signal to an internal reference, i.e. his or her expectation of ‘how it should it sound if it were perfect’).

Model of Device

(i.e.handset)

Psycho-acoustic model

(frequency and intensity

warping, masking)

Model of Device

(i.e.handset)


(frequency and intensity

warping, masking)

Cognitive

model

Distance

SimilarityMOS-LQO

Distorted

signal

Reference

signal





7

POLQA in narrow-band and super-wideband mode

Please note that POLQA has two different operational modes, one for narrowband and one for super-wideband. The structure of the model is the same; the main difference is: ‘What is the reference?’

In narrowband mode the reference signal is a narrowband (telephony band) signal. This simply means that all degradations are rated in relation to this reference. The assumed listening situation is – corresponding to narrowband telephony – an ordinarily shaped handset on one ear.

The band limitation of the telephony band itself is not considered as degradation, since the expectation of the listener is exactly that limited signal. Therefore POLQA compares to a telephony band signal too. In addition, POLQA in narrowband mode does not take distortions into account that are outside of the spectral range of a telephone handset; usually frequencies below 200Hz are not transmitted anymore.

As a consequence, a ‘only’ band limited but undistorted signal is scored by POLQA with a high value in this context. A perfect narrow-band signal will receive a POLQA score of 4.5. This narrow-band mode of POLQA and the maximum value of 4.5 makes it backwards compatible to ITU-T P.862 ‘PESQ’ as well as to SQuad with respect to the scale use and the targeted telephony test scenario.

Additionally, POLQA supports super-wideband signals in a super-wideband operational mode. Here – of course – the reference signal which POLQA compares to is super-wideband. In this mode, POLQA scores like a human listener wearing headphones and expecting HiFi quality. Consequently, for getting high scores with POLQA the signal needs to be a clean signal and almost unlimited in bandwidth. In case of an ideal super-wideband signal, POLQA scores with 4.75.

It is very important to understand that a narrowband telephony signal scored in super-wideband mode will get significantly lower scores than it would get in narrowband mode. This is logical, since due to the comparison to a super-wideband reference, all missing spectral components are considered as distortions. That is the same as listening to telephony speech through a headphone: the listener expects HiFi quality, but only telephony bandwidth is received.

ITU-T P.863 ‘POLQA’ is the first recommended model by ITU-T for super-wideband speech. In the context of speech, super-wideband can even be considered ‘unlimited’ speech, since there are no relevant speech parts anymore above 14’000Hz.

An interesting intermediate case is the traditional wideband that is limited to 7’000Hz. However, wideband is a term that has been used in different ways. The first trials for extended audio bandwidths already started in the 1980’s and opened the band up to 7000Hz (a sampling frequency of 16kHz was used here). The lower band limitation was sometimes 50Hz, sometimes 100Hz. An early coding standard was based on an ADPCM scheme (ITU-T G.722) and remained untouched for many years. Now wideband speech transmission is coming to mobile networks and will be enabled by AMR-WB and EVRC-WB. There is still an upper limit of 7’000Hz, however the lower end is just limited by the electro-acoustical components in the mobile phones.

These traditional wideband scenarios will be scored by POLQA in its super-wideband mode as well. This means that the wideband signal is compared against a super-wideband reference. Consequently, the parts above 7’000Hz are missing in the signal, leading to a measured degradation. However, the parts above 7’000Hz only contribute to a lesser extent to the perceived speech quality. Therefore, an ideal signal just limited at 7’000Hz will be scored by POLQA at around 4.5, making it backwards compatible with ITU-T P.862.2 ‘PESQ-WB’.

It is important to note that a traditional wideband channel must be scored in POLQA super-wideband mode. It is not in line with ITU-T P.863 to use a 7’000Hz reference signal.





8

POLQA internal processing steps

The following block scheme provides a brief overview of POLQA. There is a differentiation between the time-alignment part and the psycho-acoustic ‘perceptual’ and cognitive model.

Idealization

Space / Time

Alignment

Idealization

Perceptual

model

Cognitive

model

Perceptual

model

Input

Reference

Degraded

Output

Internal

representation

of the ideal

Internal

representation

of the output

Difference in internal

representation

Quality

Idealization

Figure 3: Basic scheme of the main components of P.863 ‘POLQA’

Time Alignment

Why does POLQA perform time alignment?

POLQA and other objective measures following the same base structure compare the (spectral) short-term characteristics of the reference signal and the degraded signal frame by frame. The alignment marks corresponding sections in both signals. Only this way can the correct frames be compared to each other.

What makes it challenging?

Aligning two signals is simple for constant delay between the two signals and a linear transmission. Here, just an offset has to be compensated. More complicated are un-synchronous devices (clock drift), they lead to a constantly increasing / decreasing ‘delay‘. Here the compensation is not constant but at least constantly and linearly changing over time. Even more challenging are processing components transmitting individual parts of the signal with different delays. These can lead to stretched or compressed speech pauses but also to stretched or compressed speech parts. This stretching or compressing can be done by preserving the pitch or by just ‘warping’ the entire signal part.

In all these cases, each individual short frame of the degraded signal (usually 32ms in length) has to be assigned to a corresponding frame in the reference signal.

How can it be done in a robust and fast way?

At first POLQA indicates signal parts where the delay can be assumed to be constant and flags them as ‘landmarks’. These parts can be of different length; in the simplest case one single part covers the entire signal (if there is a constant delay over the entire file).





9

REFERENCE PROCESSED

Correspondence

with confidence

Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts

In a second step, the areas between these landmarks are analyzed. Therefore, the signal is sub-divided more and more into a series of smaller parts. Each part has an assigned corresponding part in the other signal.

Each assigned signal part is given a value that rates the confidence of the assignment. In less confident areas a wider signal range is analyzed, whereas the assignment correspondences of parts with a high confidence are considered as fixed.

This approach allows a very efficient and robust search structure since the search range becomes more and more restricted as more landmarks are set. The result is a kind of matrix with corresponding signal parts and associated search ranges.

Figure 5: Illustration of assigned signal parts and the optimal ‘path’ of signal correspondences

A Viterbi-like algorithm then calculates the most likely ‘path’ through this matrix and fixes the corresponding signal parts.

The end result of the time alignment step is a correspondence table with start and the end times of each signal part and its correspondence in the reference. Parts of the degraded signal with no correspondence in the reference (i.e. inserted or added parts), as well as parts of the reference signal that are missing in the degraded signal, are marked as well. The following signal graph illustrates a practical example. The upper graph shows the (complete) reference signal, the lower graph shows the received and degraded signal.

Figure 6: Example of an aligned pair of reference and degraded signal





10

The green areas denote signal parts assigned with high confidence, the blue ones are those with lower confidence. The red signal part indicates a part of the reference signal that was lost during transmission and is no longer present in the degraded signal. Unassigned silent parts (white) are not used for direct comparison but rather for an analysis of the annoyance of the noise floor in there.


Just like any of the models that have the same basic approach as POLQA, the psycho-acoustic model starts with a global level alignment followed by a frame-wise spectral analysis of overlapping frames. As is usual in these models, a short-term level scaling is applied as well, and the application of a cosine-based window and a FFT is used for converting the audio signal from the time domain to the spectral domain.

The block scheme of the POLQA psycho-acoustic model is shown in the figure below.

Scaling towards

degraded

Idealization

Frequency warping

to pitch scale

Windowed FFT

Input Reference

x

Masking

Intensity warping to

loudness scale

x

Scaling towards

playback level

Frequency warping

to pitch scale

Windowed FFT

Masking

x

Frequency response

Noise estimation

Reverb

Perceptual subtraction

Output Degraded

Asymmetry processing

Lp time integration Lp time integration

FRQ NOI RVB indicators

Disturbance indic. ‚Da’Disturbances in speech

Disturbance indic. ‚D’

Disturbances in speech

NOIStationary and

switched noises

FRQspectral shaping

band limitation

Nose suppression

Partial Local and Global

scaling

Frequency response

compensation

Intensity warping to

loudness scale

Nose suppression

Cognitive model- Combination of individual indicators- Training on subjective reference scores

- Mapping into MOS scale

Predicted Listening

Quality MOS-LQO

RVBRoom

reverberations

Figure 7: Block-scheme of POLQA as in ITU-T P.863

The basic approach of the psycho-acoustic model, which means the use of critical bands and the loudness compression, looks similar to well-known state-of-the-art models.





11

However, there are three parts that make P.863 ‘POLQA’ different from established standards such as P.862 ‘PESQ’.

Removing / Reduction of individual distortion types and separate consideration of them

Idealization of the reference signal

Sharpened loudness spectra

Removing / Reduction of individual distortion types

It is too easy to assume that each difference in the signal after transformation to an internal psycho-acoustic representation will be considered correctly by this purely physiological view. There are some kinds of distortions that are not well covered by the established and used psycho-acoustic models. These are for example some kinds of linear distortions, such as so-called frequency responses (leading to a colored or shaped spectral distribution), echoes and reverberations as well as strong additive background noises.

The reason for these shortcomings becomes clearer when we look at how psycho-acoustic models were designed and evaluated. These models mainly describe the spectral integration into critical bands (due to the so-called frequency-to-place transformation on the basilar membrane in the human ear), the sensitivity in different spectral areas, the non-linear perception of intensity as well as spectral and temporal masking effects. These models were widely developed and validated with signals like sine waves and narrow-band noises. They do not include any assumptions about speech recognition. For example an echo, creating a distortion due to slightly delayed repetition of the same speech, cannot be distinguished from a pure noise of the same intensity using this approach.

Therefore, it makes sense to detect and quantify those distortion types in prior to the application of the psycho-acoustic model. The long-term frequency response is calculated and compensated in the signals. An indicator ‘FRQ’ is calculated separately and considered in the final MOS prediction. The same applies to background noises. They are measured and widely removed from the signal. The amount of noise is later considered through a ‘NOI’ indicator. In a similar way, echoes and reverberations are calculated for correction of the final predicted MOS.

By applying these corrections, the signal is now much closer to a signal to which the psycho-acoustic model can be applied. It has been freed of spectral shaping and strong noises.

Idealization of the reference signal

A truly new part to existing and established methods is a so-called idealization of the reference signal. The idea is to remove slight distortions such as noises and to align the spectral shape and timbre towards an ideal. This makes sense, since a listener in a scoring situation as in a P.800 ACR test, does not compare the degraded signal to the input signal (the actually used reference), but rather to a conception of how that talker should sound. This step is modeled for the first time in P.863 ‘POLQA’.

Common objective models compare the signal to be scored with a reference and weight all (perceptually relevant) differences as distortion. As a consequence, if the signal for scoring is identical to the reference (totally transparent transmission or just a copy of the reference signal), no differences will be found and the predicted MOS is at the maximum (e.g. 4.5 for narrow-band).

POLQA is different. The internal representation of the ideal reference signal is not equal to the internal representation of the signal used as reference. This means that a non-optimal reference, e.g. having a low noise floor, will have that noise removed. If POLQA gets this (noisy) reference signal for scoring, it compares that signal with the internally calculated ‘ideal’ and may provide a score lower than the maximum. POLQA is not a model that rates differences between two signals; it rates an absolute quality and uses the idealized reference as an upper bound. Absolute quality is the difference to an imagined or expected ideal, just like in subjective absolute category rating tests.

It may be a bit irritating for a technical user, but POLQA is just a consequent model of the subjective listening test. In a listening test too, listeners will never give a very high score to a signal that is a bit noisy.





12

Sharpened loudness spectra

The usual approach for transforming a signal to an internal representation is

1. Application of a time-to-frequency transformation by estimation of a short-term spectral power density. This can be done by a filter-bank or a Fourier transformation. POLQA uses a FFT approach after applying a Hann window. Window length is 32ms with 50% overlap.

2. Subdivision of spectrum into bands. This is usually motivated by the – in principle – logarithmic perception of frequency. This may be done in a simplified manner using 3

rd octave bands or – as is

more common and used in POLQA too – using critical bands according to Zwicker. At the end of this aggregation, the hearing spectrum is sub-divided into 24 bands with increasing bandwidth towards higher frequencies. For each band a power or intensity value is computed. This frequency scale using critical bands is known as ‘Bark’ scale.

3. The intensity is transformed to a perceived loudness scale. Basically, the intensity is compressed at higher sound intensities, similarly to a decibel scale. In addition, the varying sensitivity at different frequencies is taken into account. Intensities below the hearing threshold are discarded as well. This loudness scale is known as ‘Sone’ scale.

Of course, all objective models following this approach will apply the standard range of signal compensations in addition to the plain psycho-acoustic transformation. These compensations include further and individual short-term level scaling, spectral compensation and weighting functions. However, the psycho-acoustic models almost always follow the approach outlined above.

Common models such as P.862 ‘PESQ’ apply the spectral masking thresholds directly to this internal representation. The result is a so-called smeared spectrum. In principle this is modeling the self-masking effects of the signal. That means that quieter parts are masked by louder parts at neighboring frequencies. This effect is widely used and described for audio coding as in e.g. MP3, where masked parts of the signal are considered redundant and are not transmitted. Furthermore, the quantization noise is shaped such as to be masked by the signal and thus not to add perceptible distortions.

In POLQA this approach was revised, since we are less interested in the self-masking effects of the signal but rather in the perception of remaining (or unmasked) differences between two signals. The chosen approach can be imagined – compared to a ‘smeared loudness spectrum’ – as a ‘sharpened loudness spectrum’.

In a first step the masking slopes are calculated (Figure 8):

Figure 8: Application of masking slopes to the Bark spectrum

The second step consists of analyzing which masking slopes other parts of the spectrum, either fully or partially (Figure 9):

Bark

So

ne

Bark

So

ne





13

Figure 9: Consideration of fully and partially masked spectral parts

In a third step the fully masked spectral parts are removed and the partially masked parts are reduced in their loudness (Figure 10):

Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking

Finally, we get a loudness spectrum that represents the individual spectral parts as they contribute to perception. This means that fully masked parts are taken out, while partially masked parts are attenuated.

These modified spectra of the reference and the degraded signal are then compared and differences are considered as ‘perceptible’ differences. The big advantage of the ‘sharpened’ approach is the remaining high resolution of the spectrum. It allows a high spectral resolution in the analysis, as required e.g. for a valid qualitative assessment of the reproduction of fine spectral structures in upper bands by compression algorithms.

Aggregation and Cognitive Effects

The steps above are performed for a short-term spectral comparison across all frames in the speech signal. At the end of this ‘main loop’ across all frames, we have a quality indicator for each frame window. This quality indicator is based on the differences of the short-term representations of the reference and degraded signal. It is dimensionless, but may be imagined as a kind of signal similarity over time.

These individual quality indicators represent a quality at a certain point in the audio signal. They are aggregated over all frames. This aggregation is not a plain averaging; the aggregation contains non-linear weightings and slopes in time. These aggregated quality indicators are then weighted and corrected using the previously calculated descriptors for spectral shaping (FRQ), additive noises (NOI) and echoes and re-verberations (RVB).

Finally, the aggregated overall quality indicator is mapped to the MOS scale. In the narrowband operational mode the indicator is mapped to a range from 1.0 to 4.5, in case of super-wideband to a range of 1.0 to 4.75. The upper bound represents the typical maximal MOS obtained in subjective listening tests.

Bark

So

ne

Bark

So

ne

Masked

Partially masked

unmasked

Bark

So

ne

Masked

Partially masked

unmasked

Bark

So

ne





14

POLQA prediction performance and typical results

The main trigger for the development of ITU-T P.863 was that the existing P.862 ‘PESQ‘ did not cover today‘s quality variations in telecommunication networks. This decreasing inaccuracy of the prediction performance of P.862 ‘PESQ’ required an evolved voice quality prediction model to cope with

New types of speech codecs and codecs not yet used in telecommunications, e.g. audio codecs

Enhanced frame loss concealment techniques

Voice Quality Enhancement (VQE) systems, non-linear processing for increasing intelligibility

Re-sampling, time-warping

In addition the P.863 development should extend the scope of P.862 mainly by

Extension to super-wideband (50 to 14’000Hz)

Qualitative prediction of intermediate bandwidth, changes in audio bandwidth, bandwidth extension

Acoustical ‘interfaces’, echoes, reverberations

Sound presentation level

Due to the wide scope of P.863, the development and evaluation required a huge amount of test data. Test data means, speech samples with this variation of degradations scored by human listeners in defined sub-jective experiments. In the end, for the evaluation of P.863 ‘POLQA’ a total of 62 subjectively scored data sets were used containing more than 45’000 voice samples.

These data sets1 were used for calculating the prediction performance by means of residual square errors or

correlation coefficients. The residual square error or – as in previous times – Pearson’s correlation coefficient is the indicator for the accuracy of the objective measure; it is given by the remaining prediction error to the ‘true’ scores obtained in the subjective tests.

These values give an overview of the performance in general. However, the actual reached numbers depend on the construction of the data set and the kind of conditions it contains. It is always true that there are test conditions that can be predicted ‘easily’ in an accurate way by a model (e.g. noises, waveform codecs and so on) and others where the deviation is higher (usually combinations of distortions). The occurrence of such conditions in a data set has a strong influence on these figures. This is not only due to the objective prediction method rather caused by uncertainties of the listeners in the auditory tests as well.

For the P.863 ‘POLQA’ evaluation ITU-T has chosen a statistic approach that is based on an r.m.s.e. calculation, but takes the uncertainty of the subjectively derived MOS values into account. Based on these figures, the performance evaluation of P.863 ‘POLQA’ compared to P.862.1 and P.862.2 ‘PESQ’ was done.

Table 1: Improvement in performance of P.863 ‘POLQA’ to P.862 ‘PESQ’

rmse* P.862.1 'PESQ' P.863 'POLQA' in NB mode Improvement by

Classical narrowband exp. 0.157 0.123 22%

Advanced narrowband exp. 0.227 0.154 32%

P.862.2 'PESQ-WB' P.863 'POLQA' in SWB mode Improvement by

Wideband experiments 0.345 0.150 57%

1 A data set, also often called experiment or database, is a set of speech files processed or transmitted under different

real field or simulated conditions and scored subjectively. A data set usually consists of about 200 individual speech samples. The prediction accuracy is calculated by comparison of the MOS scores given by the listeners and the prediction by the objective measure as e.g. P.863 ‘POLQA’.





15

The so-called classical set of narrowband experiments covers 22 data sets used in ITU-T already for standardization efforts from the mid 90’s until about 2003. They contain common codec and noise distortions, mobile channels of the 2

nd and 3

rd generation as well as VoIP as it was state of the art at the

millennium. Even though these databases cover distortions that were already used during the development of P.862 ‘PESQ’, the new method P.863 ‘POLQA’ shows even higher prediction accuracy here.

The advanced set of narrowband experiments is more focused on the latest coding technologies, frame loss concealment strategies, noise reduction and of course 3

rd and 4

th generation mobile as well as the newest

VoIP implementations. This set is based on 15 data sets. The improvement reached with the new method P.863 ‘POLQA’ is evident. This set covers a wide range of test conditions of latest technologies which P.863 was designed for.

Finally, there was a set of common wideband data as well. It covers 7 different data sets. Here the improvement over P.862.2 ‘PESQ-WB’ is extremely high.



Chapter 3 | Narrow-band Voice Quality measurements with P.863 ‘POLQA' in Diversity


16

3 Narrow-band Voice Quality measurements with P.863 ‘POLQA' in Diversity

Idea of the narrowband test

The idea of a narrowband test is a test situation, in which ‘a listener’ listens to a speech signal using a conventionally shaped telephone handset. That means, he or she is restricted to the telephony bandwidth and does expect such a band-limited signal as well. As a consequence a perfect sounding but band-limited signal will get a high score, since it exactly matches that listener’s expectation of excellent quality in this context. Despite of a channel having a wider audio bandwidth; the listener would not experience this, since there is a limitation in the transfer of the handset.

Therefore, a narrowband speech test, independent of whether it is performed with P.863 ‘POLQA’, with P.862 ‘PESQ’ or with SQuad in narrowband mode, always models a conventional narrowband telephony situation.

This test approach has been very commonly used over the last years or even decades and is perfectly suited for characterization of narrowband networks and systems. However, no qualification relatively to wideband or super-wideband systems is possible.

Of course, neither P.863 ‘POLQA’ nor SQuad ‘listens’ to the speech signal using a handset. The handset transfer characteristic is modeled in the algorithms itself. That means, only spectral parts that are perceptible via such a handset will by analyzed by the quality predictor. Both P.863 ‘POLQA’ and SQuad analyze signals as they are recorded at an electrical interface to the network. This can either be an ISDN or PSTN line but also a headphone connector of a mobile device. The specification of such a receiving handset is called IRS receive (IRS: Intermediate Reference System) and is described in ITU-T P.48. The IRS receive characteristic can be imagined as a weak telephone band-pass with a slight preference for higher frequencies towards 3kHz.

network / real channel

network / real channel

89 dB(A) SPL -26dB ovl -26dB ovl 79 dB(A) SPL

electr.

interface

electr.

interface

electr.

interface

electr.

interface

Reference circuit with the assumption of a channel gain of 0dB and and Overall Loudness Rating of 10dB

model of

microphone

Reference

speech signal

MOS predictor (i.e. POLQA)

model of

handset

model of

handsetCopy of reference

speech signal

psycho-

acoustic

model

Figure 11: Insertion and capturing in a speech test setup

(SPL: sound pressure level, OVL relative to overload point)

Similar to this, the sending direction is modeled in this narrow-band setup as well. The source speech signal is inserted into an electrical interface, either a PSTN or ISDN line or into the microphone input of a mobile device. In reality at this point the signal has passed the microphone and some voice processing components already. To emulate this part of the signal path, a model of a typical narrowband microphone is applied. This is called IRS send, since it models the device in sending direction. It can also be imagined as a weak





17

telephony band-pass but with a quite strong pre-emphasis up to 3kHz. This makes the speech sound a bit ‘sharp’ but with higher intelligibility in background noise situations.

2

Figure 11 schematizes the idea behind a narrow-band test. The modeled sending device allows a direct electric coupling to the channel under test and guarantees reproducible results independent from an actual used microphone.

The frequency responses for the two filters modeling the device are given in Figure 12. It is clearly visible that there is a bandwidth limitation to the telephony band, although a slightly wider band can pass than just 300 to 3400Hz.

IRS send direction (ITU-T P.48)

-30

-20

-10

0

10

0 1000 2000 3000 4000

f / Hz

a /

dB

IRS rcv direction (ITU-T P.48)

-30

-20

-10

0

10

0 1000 2000 3000 4000

f / Hz

a /

dB

Figure 12: IRS in send and receive direction as specified in ITU-T P.48

While for ISDN and PSTN interfaces defined level and impedance requirements are given and fulfilled by the interface devices, for mobile phones only the headset connector as a proprietary interface is available. SwissQual’s connector interface for mobile phones is adjusted for this type of interface. It applies the correct level, adjusts the frequency response and matches to the impedance of each individual phone type and enables a quasi-standard electrical network termination point even for mobile handsets.

Speech reference signals for narrow-band tests

ITU-T P.863 ‘POLQA’ is a so-called full-reference model. The basic approach follows the common approach of those measures that is the same for e.g. SQuad or P.862 ‘PESQ’ as well. It compares the received and potentially degraded signal with an undistorted reference signal. This allows a very detailed and fine analysis of any kind of difference between the two signals. In consideration of human perception, at first a model of the listening device (in case of narrow-band a handset) is applied in the model itself. This way, the exact same signal is compared as it would be heard by using such a device. In a narrow-band test case the signal is compared to an optimal narrow-band reference.

ITU-T P.800 and P.862.3 give constraints and requirements to the speech samples to be used. That is mainly the temporal structure and signal level of the speech signal. SwissQual’s measurement systems pro-vide a set of speech samples in different languages. All speech samples follow the same rules composition and pre-processing, they are all composed of a meaningful female and a male sentence. The reference speech file is 6s in length and contains more than 3.5s of active speech. There is a speech pause between the two sentences as required by P.862.3 and P.863. The signal is adjusted to a speech r.m.s.e. level of -26dB rel. OVL

3 that corresponds to an analogue level of -20dBm at a 600 Ohms four-wire interface.

In addition P.862.3 defines the insertion and capturing process for the speech signal. These definitions are describing the insertion point and the expected spectral characteristics. In narrow-band channels the signals are inserted after a pre-filtering according to the IRSsend characteristic as shown and explained above.

2 This characteristic is taken from older carbon microphones: the pre-emphasis should compensate the low-pass

characteristic of the inductive loaded analogue lines at that time.

3 The value of -26dB relates to an overload point of 32767/-32768 as is used in 16bit resolution in the digital signal

domain.





18

What are the differences to the previous ITU P.862 ‘PESQ’?

Actually the differences are quite small for common applications in cellular networks. A customer may only see a slightly changing MOS-LQ value for error-free or high quality transmission using EFR or AMR with higher bitrates. Instead of a typical value in the range of 4.0 for EFR, they may now obtain values in the range of 3.9. This is mainly due to the fact that the actual bandwidth limitation in the narrowband channel is also considered by P.863 ‘POLQA’, i.e. limitations relative to the IRSrcv frequency response. In case the actually used bandwidth is slightly narrower, the MOS will be lower by a very narrow margin as well.

An improvement will be seen for EVRC type codecs as used in CDMA. The new P.863 ‘POLQA’ shows an even better comparability to EFR/AMR codecs.

4

Furthermore, POLQA is trained for scoring complex channels including more than just a codec, e.g. noise reduction, variable gain and filtering as well as strong time warping.

The following table shows the main differences in scores between SQuad version ‘08’, P.862.1 ‘PESQ’ and P.863 ‘POLQA’.

The results are based on typical speech samples and are an average across six speech samples (i.e. American English as used in SwissQual Diversity). Except the ‘transparent transmission’ all samples were pre-filtered by IRSsend.

Table 2: Typical predicted MOS-LQ values for common transmission techniques for SQuad ‘08’, as well as P.862.1 ‘PESQ’ and P.863 ‘POLQA’.

P.862.1 (narrowband)

SQuad-LQ ‘08’ (narrowband)

P.863 (narrowband)

Linear distortions

Transparent transmission ~40 – ~3800 Hz

4.50 4.50 4.50

Transparent transmission ~180 – ~3500 Hz (G.712)

4.40 4.50 4.30

Transparent transmission ~200 – ~3500 Hz (IRSsend)

4.50 4.50 4.40

Transparent transmission 300 – 3400 Hz (box block)

4.10 4.30 3.60

IRSsend + G.711 (A-Law standard PCM)

4.40 4.40 4.30

Codec conditions

IRSsend + EFR / AMR 12.2kbps 4.15 4.15 4.20

IRSsend + EFR (real loss-free connection)

4.10 4.15 4.10

IRSsend + QCELP 13kbps 3.90 4.00 4.00

IRSsend + EVRC 9.5 kbps 3.75 3.90 3.90

IRSsend + EVRC-B 9.3 kbps 3.75 4.00 3.90

IRSsend + AMR 7.95 kbps 3.90 4.00 3.95

IRSsend + AMR 6.70 kbps 3.75 3.90 3.85

AMR 4.75 kbps 3.40 3.70 3.65

4 ITU-T and 3GPP do not recommend the use of the P.862 family for EVRC-type codecs.





19

The codecs are used as reference SW implementations. In addition one EFR condition is shown as it behaves in a real loss-free channel, using a commercial Nokia handset as access device to the network. The channel was terminated by an ISDN card device running G.711 A-Law.

Firstly, a very slight more pessimistic prediction is enabled by P.863 ‘POLQA’ compared to SQuad08. However, for practical use cases this absolute difference is negligible. Compared to P.862.1 the higher rates of AMR match very well even though the lower rates are scored higher by P.863. In addition, the EVRC type codecs are scored higher and more realistic by P.863 and especially SQuad08 compared to P.862.1.

P.863 ‘POLQA’ considers linear distortions and bandwidth limitations in its score. For super-wideband mode it is obvious. There, a signal is always compared to a super-wideband reference (50 to 14000 Hz). It is important to note that P.863 ‘POLQA’ in narrow-band mode considers a ‘full narrow-band’ signal (~50 to 3800 Hz) as reference. To this signal an IRSrcv filter is applied in P.863 ‘POLQA’ itself. That means limitations lowering this bandwidth will lead to a predicted distortion. With P.863 ‘POLQA’ the actual channel filters and band-pass characteristics in the microphone and loudspeaker path of the used mobile phone are taken more into account as it was for P.862 ‘PESQ’.

5

SwissQual’s SQuad08 also considers linear distortion in narrow-band mode; however it is less sensitive than P.863 ‘POLQA’ and is supposed to be less dependent from the actually used phone and its internal filtering.

SwissQual’s speech quality suite offers two methods for predicting listening quality: The known SQuad08 and the new ITU-T P.863 ‘POLQA’. Both models may be combined with ITU-T P.862 ‘PESQ’ as an option.

The entire framework as known from SQuad including the voice samples, the insertion and capturing procedure and – of course – all of the additional signal analysis results are used and available for P.863 ‘POLQA’ in the same way.

Test definition and result presentation

The definition of tests, the timing and the selection of speech files are exactly the same as for ‘Speech’ tests using SQuad. The only difference is the naming: speech tests with P.863 are called ‘Speech POLQA’.

P.863 ‘POLQA’ is embedded in the same framework that is used for SQuad. For P.863 ‘POLQA’ tests as well, all additional information such as levels, noise analysis, delay variations and frequency response are calculated and are available. Consequently, the obtained results are presented in the same format in SwissQual’s NQDI. This underlines the close relationship between the speech quality measures. Only the MOS prediction is either made by SQuad or by P.863 ‘POLQA’; the value measured by both is ‘Listening Quality’ as indicated by the type of test.

Figure 13: P.863 ‘POLQA’ narrowband main result representation in NQDI

To differentiate P.863 ‘POLQA’ tests from SQuad and P.862 ‘PESQ’, the actually used method is given in parentheses behind label ‘Listening Quality’. For an immediate visual feedback, the POLQA logo is shown right below the predicted MOS score.

5 Since, P.863 ‘POLQA’ measures the actual spectral loss of the speech signal, the actual impact by band-limitations

depend on the actual spectral power distribution if the speech sample. That means there are samples more or less affected by this filtering due to their spectral characteristic e.g. losing more or less high frequency parts.





20

In addition to the global values for the entire speech sample, graphs illustrate the quality profile over the sample duration, the signal envelopes as well as the signal gain

Figure 14: P.863 ‘POLQA’ narrowband detail result representation in NQDI

P.863 ‘POLQA’ is treated as a separate method for listening quality measurements in NQDI. The test selection tab sheet in NQDI can be used to select individual P.863 ‘POLQA’ tests.

Figure 15: P.863 ‘POLQA’ test selection in NQDI

For reporting, the group of ‘Voice’ reports in NQDI sports a ‘LQ narrowband statistic’ report. It reports not only the P.863 results but rather the results of all other algorithms such as SQuad and P.862 ‘PESQ’ in the





21

same table. The results for each algorithm are given in a separate column.

Figure 16: P.863 ‘POLQA’ statistical report in MS EXCEL



Chapter 4 | Wideband Voice Quality measurements with P.863 ‘POLQA' in Diversity


22

4 Wideband Voice Quality measurements with P.863 ‘POLQA' in Diversity

Idea of the wideband test

The idea of a wideband or – in correct terms – a super-wideband test is a test situation in which ‘a listener’ listens to a speech signal using HiFi headphones. This means that he or she is not restricted to any bandwidth. The headphone is able to transmit the entire perceptible audio bandwidth.

As a consequence, a perfect sounding and not band-limited signal will get a high score, since it exactly matches the listener’s expectation of excellent quality in such a setup. On the one hand, the headphone equipment itself sets a high expectation; on the other hand, the listener ‘knows’ the unlimited speech signal, it is presented in this experimental context.

The modeling is similar to the narrowband telephony case as shown in Figure 11 but is adapted to the changed setup. This means that the MOS predictor, e.g. P.863 ‘POLQA’, will not ‘listen’ through a telephony handset, but rather models a headphone as listening device. It is modeled in a simplified manner as a flat filter from 50 to 14’000Hz.

6

Similarly, a wideband or super-wideband device does not follow the IRS send filter characteristic in the microphone path anymore. It is also close to a flat filter with a band limitation at a higher point in frequency.

This has the consequence that the channel or system under test receives a super-wideband input signal. In case there is a narrowband channel or device, this channel or device will restrict the bandwidth. At the other end the predictor ‘listens with a headset’ and compares the received signal to the unlimited reference signal.

This leads to recognition of missed spectral parts and this ‘missing information’ consequently leads to a drop in quality. It can be imagined as listening to HiFi signals through a headphone and suddenly being presented a narrow-band signal. Here a human listener will also perceive it as lower in quality.

The transmitted bandwidth becomes much more important for a super-wideband test. Restrictions in audio bandwidth are always compared relative to a super-wideband reference signal.

Note: The use of the super-wideband test scenario is NOT restricted to wideband or super-wideband channels or devices. The scenario just defines the reference that is super-wideband in this case. Test scenarios in super-wideband will be the common test case in the near future, they are not only required for a valid evaluation of wideband systems but rather also for correct ranking of systems or networks with different bandwidths downwards to narrowband.

A super-wideband test scenario implies some technical requirements. Within SwissQual’s product lines the whole audio processing chain from the handset’s audio connector across the analogue circuits in Diversity as well as the digital signal processing is designed and extended to higher sampling frequencies and audio bandwidth already from the beginning.

Along with Diversity Release 10.2.0, SwissQual has launched a super-wideband test application for the first time along with SQuad. Now, in Release 10.6.5 the wideband test application has been completed by the integration of the new ITU-T Recommendation P.863 ‘POLQA'.

Wideband speech reference signals

As already mentioned ITU-T P.863 ‘POLQA’ is a so-called full-reference model. It compares the received and potentially degraded signal with an undistorted reference signal but this is the undistorted reference that is practically unlimited in bandwidth.

6 For narrow-band mode P.863 ‘POLQA’ applies an IRS receive filter that emulates a narrow-band handset

(see: Figure 12: IRS in send and receive direction as specified in ITU-T P.48)





23

This is the difference to the narrow-band case. The comparison of the recorded signal is made relatively to a super-wideband reference. In the same way, the recorded signal is not post-filtered to avoid any band limitation that models a receiving HiFi headphone.

That means, in case of a ‘full-band’ audio channel (i.e. a VoIP connection using full audio bandwidth or an application using a MP3 with sufficient bitrate as in video or audio streaming), the recorded signal matches to the reference in its bandwidth. In case of a common wideband or even a narrow-band channel or device, the bandwidth becomes limited during transmission. In case this signal is recorded and compared to the full reference, the spectral loss is weighted as degradation.

Of course the exploration of a wideband channel requires also the insertion of a signal with sufficient bandwidth. To actually feed wideband signals into the channel, new voice samples were recorded. They are without a perceptual bandwidth limitation and are stored at 32kHz sampling frequency in a separate reference folder ‘Speech-Wideband’ or ‘Speech-Wideband POLQA’ respectively. As usual, the samples are constructed out of a male and a female spoken sentence and have a constant length of 6s. Thus, the continuity to the narrowband tests is completely given.

For the time being SwissQual provides samples in

German (German pronunciation)

German (Swiss pronunciation)

British English

Italian

Dutch

Each language sample is provided without any pre-filtering (except for a 50 – 14’000Hz band-pass) and called i.e. GE_fm_wide.wav. As specified for wideband devices, the microphone path is considered as flat in the transmission band. It means no IRSsend as for required narrow-band is applied. The signal remains ‘flat’, without any further band limitation and without any pre-emphasis as in the IRS.

What are the differences to narrowband?

In traditional telephony scenarios, the expectation is set to a perfect but narrowband voice signal. A signal that is close or identical to such a signal is scored with a high quality value (usually a MOS-LQ of around 4.5 on a five-point scale).

7 Additional degradations will decrease the quality value up to a minimum of 1.0.

Within a wideband scenario, the expectation of excellent quality is a perfect wideband speech signal. Since the same scale is used here, such a perfect wideband signal is scored with 4.5 too. Obviously, a narrowband signal in the same context will not fulfill the expectation of high quality due to its band limitation. Consequently, it will be scored lower in this context.

This is roughly spoken the main difference. There are other effects such as a different perception of noises, since there are noise parts in the higher frequency ranges which are less or not masked by voice anymore, as well as other effects. But the main difference will be the lower scored narrowband signals.

Most important for customers will be typical values to be obtained with the wideband application compared to narrowband measurements.

The following table shows typical values obtained in the two test scenarios for the same type of conditions by averaging the predicted scores of five different speech samples.

7 For more detailed information, please refer to:

‚White Paper – About MOS and Quality Measurements’ published by SwissQual AG in 2011.





24

Table 3: Typical P.863 ‘POLQA’ scores for common transmission techniques in a wideband and a narrowband context

P.863 in super-wideband

(50-14000 Hz)

P.863 in narrowband

(300-3400 Hz)

Transparent transmission 50 – 14000Hz or wider 4.75 -

Transparent transmission 50 – 7000Hz (‘common’ wideband) 4.3 -

AMR-WB 12.65 kbps (50 – 7000Hz) 3.8 -

Transparent transmission 50 – 3800Hz (‘Full Narrowband’) 3.6 4.5

Transparent transmission ~250 – 3500Hz (‘IRSsend’) 3.5 4.4

Transparent transmission 300 – 3400Hz (‘telephony box block’) 3.0 3.6

IRSsend + G.712 + G.711 (A-Law standard PCM channel) 3.5 4.3

IRSsend + EFR / AMR 12.2kbps 3.2 4.2

IRSsend + EVRC 9.5 kbps 3.0 3.9

IRSsend + EVRC-B 9.3 kbps 3.0 3.9

IRSsend + AMR 7.95 kbps 2.9 3.9

It can be seen that the rank-order of the systems remains independent from the test scenario. The upper range of the wide-band scale is just used for the high qualitative wideband voice samples. The common narrowband scenarios are compressed to the lower 60% of the scale and thus show a smaller gradient as well.

In case of optimizing and benchmarking pure narrowband networks and applications, the common narrowband test application can be used without any problems. The individual systems are more clearly discriminated due to the wider scale range used.

For optimizing wideband applications and networks and especially for benchmarking of wideband networks against narrowband ones, a wideband test application is required.

Firstly, the degradations in wideband mode can only be assessed in a wideband test application and secondly, a wideband signal can only ‘show’ its better quality against narrowband in wideband mode.

Note: Narrowband MOS-LQ values and wideband MOS-LQ values must never be mixed or directly compared. They are referring to different interpretations of the MOS scale.

Where wideband quality can be assessed

Although wideband is a normal use case in daily life’s communication such as TV and FM radio, it is still not popular in telecommunications.

It was used for commercial video conferencing systems, but it was Internet Telephony that enabled wideband telephony for normal users for the first time. Today, common VoIP clients support a wide range of wideband codecs and use them if a sufficient bit-rate is available for the service.

Now the next step in wideband telephony is the evolution of cellular networks and handsets. The networks and user devices are being equipped with AMR-WB, allowing an audio bandwidth up to 7000Hz while still remaining in the typical bit-rate range used for GSM and UMTS.

Typical applications where wideband is already in use are mobile connections in GSM / UMTS. Here the first operators have enabled AMR-WB in TrFO mode. Usually, in GSM the AMR-WB bitrate is restricted to 12.65 kbps while in UMTS AMR-WB bitrates up to 23.85kbps are used. Another application that can be tested in real field applications with Diversity today is VoIP connections. Here even super-wideband trans-mission is possible. There are different codecs in use, with both standardized and proprietary solutions. Both





25

were considered in the huge training set for SQuad and P.863 ‘POLQA’.

The main focus of Diversity’s wideband test solution is of course the evaluation and benchmarking of wideband channels in cellular networks.

An additional application area for wideband voice testing in Diversity is video streaming. In video streaming audio codecs are usually used; these don’t have any bandwidth restriction, except in very low bitrate conditions. Consequently, Speech Wideband as a test case is also applied to video streaming starting with Release 10.2 of Diversity and completed in Release 11.0 with the full support of ITU-T P.863 ‘POLQA’.

Wideband analysis in Diversity

The super-wideband test application forms an own-standing test ‘Speech Wideband POLQA’, whilst the tests ‘Speech’ and ‘Speech POLQA’ remain at narrowband.

While ‘Speech Wideband’ runs the SQuad algorithm, ‘Speech Wideband POLQA’ enables P.863 as the voice quality estimator. The same is true for ‘Speech’ and ‘Speech POLQA’.

All these test types, ‘Speech’, ‘Speech Wideband’, ‘Speech POLQA’ and ‘Speech Wideband POLQA’ can be selected as separate tests with SwissQual’s Test Manager and in the post-processing tools NQDI and NQView. The presentation in NQDI looks almost the same; however, the test name differs to differentiate between the tests and the used algorithms.

Figure 17: P.863 ‘POLQA’ wideband main result representation in NQDI

The application type (highlighted in red) explains the modeled listening situation in detail. In addition, since a potential bandwidth reduction is a serious impact in a wideband scenario, the actual bandwidth of the channel is measured and reported as well (highlighted in green). There are three classes:

narrowband (up to ~3’800Hz)

wideband (up to 8’000Hz)

super-wideband (up to 14’000Hz).

The remaining values are the same as usual and well known for SQuad and are visible in narrowband tests as well. They provide information about the speech level, noise floor, the amount of missed voice and the gain applied by the channel.

The tab sheet ‘Speech Details’ clearly shows the audio bandwidth of the measured audio channel, in this case a ‘common’ wide band channel up to almost 8’000Hz (Figure 18).

Figure 18: P.863 ‘POLQA’ wideband audio bandwidth representation in NQDI

The lower and upper bound are marked with blue lines. As is clearly visible, Diversity and ITU-T P.863 make use of real super-wideband signals. The frequency scale here ends at 16’000 Hz; this corresponds to an internal sampling frequency of 32’000kHz.



Chapter 5 | Real field measurements


26

5 Real field measurements

One of the most important questions is the relation of P.863 ‘POLQA’ results to previous P.862 ‘PESQ’ measurements under real field conditions. Of course, P.862 ‘PESQ’ and P.863 ‘POLQA’ are different algorithms and treat distortions in the signal differently. However, at the end the predicted MOS should accurately describe the quality of the voice or of the voice channel. This means that in cases where P.862 ‘PESQ’ delivered accurate predictions, the newer and improved P.863 ‘POLQA’ should predict almost the same value. For distortions where P.862 ‘PESQ’ produced more inaccurate predictions, P.863 ‘POLQA’ as an improved method will predict more accurate but therefore differently from P.862 ‘PESQ’.

8

In real field measurements the channel consists of more than just a codec. Even under perfect radio conditions there can be other factors that limit the maximum quality. These could be further bandwidth limitations that are due to the actual device used, or further speech processing steps such as noise and gain control that are applied in the device or in the network. There might also be trans-coding, i.e. a second encoding/decoding step, for example in case of mobile-to-mobile connections or in special gateways from the mobile core to PSTN networks. For these reasons, the MOS scores obtained in a plain codec emulation as given in Table 3 are usually only reached in real field cases where the device and the network can be considered as transparent and do not apply further speech signal processing as e.g. through noise or gain control.

Results in GSM / UMTS networks compared to P.862 ‘PESQ’

The following arbitrarily picked sample shows the correspondence. It is a perfect transmission from a perfectly matched PSTN line at the land side to a GSM network using AMR at 12.2 kbps. The audio level is perfectly adjusted, there are no perceptible audio bandwidth limitations and there is no speech missed (no temporal clipping, no interruptions). Most notably, the devices can be considered as quite transparent, as they don’t apply aggressive noise reduction or gain control mechanisms. The results can be considered as identical between the two measures.

Figure 19: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI

A good example of the difference between the two algorithms is the treatment of interruptions and lost speech. Here P.862 ‘PESQ’ is suspected of scoring inaccurately and usually too optimistic. In the example almost 4% of the original speech was lost, however P.862 ‘PESQ’ scores with 3.2, while P.863 ‘POLQA’ only predicts 2.7 which appears closer to the perceived score here.

Figure 20: P.863 ‘POLQA’ and P.862.1 ‘PESQ’ presentation in NQDI with signal interruptions

By analyzing a larger number of quality scores obtained in a drive test, the picture remains almost the same. The following figures are based on a drive test and a collection of data from a European operator. The

8 P.862 ‚PESQ’ defines the algorithm technically. The actual transformation from the P.862 outcome to a MOS-like scale

is defined in P.862.1. All predicted MOS scores in this document are computed in accordance to P.862 and were converted to the MOS domain according to P.862.1.





27

speech sample used was American English and each given number is based on a collection of around 100 individual scores.

Table 4: Comparison of P.862.1 ‘PESQ’ scores to P.863 ‘POLQA’ in high qualitative UMTS/GSM setups

P.862.1

PESQ

P.863

POLQA

P.862.1

PESQ

P.863

POLQA

Device A 3.97 3.97 4.19 4.19

Device B 4.04 4.06 4.17 4.17

Device A 3.78 3.77 4.19 4.18

Device B 3.87 3.87 4.17 4.20

Device A 3.92 3.80 4.13 4.02

Device B 4.01 3.83 4.12 4.04

Device A 3.74 3.60 4.11 4.01

Device B 3.78 3.59 4.10 3.99

GSM 900

UMTS 2100

GSM 900

Uplink

Downlink UMTS 2100

Average Maximum

Just looking at ‘Downlink’ which is usually the less critical direction, there is on average a difference between PESQ and POLQA averages of just 0.02, which is completely negligible. There are small differences in average between the phones and the two technologies GSM and UMTS. But the behavior is always the same for either method, i.e. GSM 900 is scored lower by 0.2 MOS on average with both methods.

In Uplink the situation is slightly different. Here P.863 ‘POLQA’ scores slightly lower than PESQ, on average by 0.15 MOS. This effect is due to several reasons, the main one being the more restricted audio bandwidth by using the microphone path of the mobile device as it is the case in Uplink. By contrast, the Downlink is using the (wider) loudspeaker path of the phone. The former P.862 ‘PESQ’ compensates the frequency response of the channel and therefore ‘ignores’ that band-limitation mostly. P.863 ‘POLQA’ considers changes in bandwidth as they are perceived by a user and consequently a limitation will lead to a slightly lower score here.

Besides the average values, the distribution of the predicted values provides information of the measures behavior. The following two graphs are based on the downlink scores of Device ‘A’ in UMTS 2100 as above.

Listening Quality distribution (P.862.1 'PESQ') PDF

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1-1

.1

1.2

-1.3

1.4

-1.5

1.6

-1.7

1.8

-1.9

2-2

.1

2.2

-2.3

2.4

-2.5

2.6

-2.7

2.8

-2.9

3-3

.1

3.2

-3.3

3.4

-3.5

3.6

-3.7

3.8

-3.9

4-4

.1

4.2

-4.3

4.4

-4.5

4.6

-4.7

4.8

-4.9

Listening Quality

PD

F N

um

be

r o

f V

alu

es

Figure 21: Distribution of predicted MOS scores by P.862.1 ‘PESQ’ (Device A, UMTS, Downlink as in Table 4)





28

Listening Quality distribution (P.863-NB 'POLQA') PDF

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1-1

.1

1.2

-1.3

1.4

-1.5

1.6

-1.7

1.8

-1.9

2-2

.1

2.2

-2.3

2.4

-2.5

2.6

-2.7

2.8

-2.9

3-3

.1

3.2

-3.3

3.4

-3.5

3.6

-3.7

3.8

-3.9

4-4

.1

4.2

-4.3

4.4

-4.5

4.6

-4.7

4.8

-4.9

Listening Quality

PD

F N

um

be

r o

f V

alu

es

Figure 22: Distribution of predicted MOS scores by P.863 ‘POLQA’ (Device A, UMTS, Downlink as in Table 4)

Both distribution functions are very close and concentrate a wide majority of the scores in the range of 4.0 to 4.2 that corresponds to the best quality in error-free connections. It is logical that a certain quality can’t be exceeded. It is set by the coding scheme, the channel limits and other included voice processing. Even in undistorted conditions they insert a certain amount of degradation. This defines the upper level that can’t be exceeded in this setup. This causes the steep decline towards higher values on the right-hand side. Usually, the majority of scores are in this region which corresponds to error-free transmission.

In the direction of lower values, the distribution falls shallower. Values in this region indicate degradations in addition to the unavoidable distortions. In cellular networks these problems are usually interruptions (due to handovers), falling back to lower bitrates in case of AMR (due to bad radio conditions) and frame losses that were concealed artif

voice quality with itu-t p.863 ‘polqa’€¦ · polqa is becoming an integral and central part...

Documents