principles of speech coding. - ogunfunmi - narasimha. 2010

370
Principles of Speech Coding

Upload: cobix1981

Post on 03-Feb-2016

48 views

Category:

Documents


14 download

DESCRIPTION

Principles of Speech Coding

TRANSCRIPT

Page 1: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Principles of Speech Coding

Page 2: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Principles of Speech Coding

Tokunbo Ogunfunmi

Madihally Narasimha

~CRC Press V Taylor & Francis Group Boca Raton London New York

CRC Press is an imprint of the Taylor&: Francis Group, an Inform• business

Page 3: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.

CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

© 2010 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government worksVersion Date: 20110715

International Standard Book Number-13: 978-1-4398-8254-2 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com

and the CRC Press Web site athttp://www.crcpress.com

Page 4: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

To our families:

Teleola, Tofunmi, and Tomisin

Rama, Ajay, and Anil

Page 5: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Contents

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiAuthors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii

1. Introduction to Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Characteristics of Speech Signals . . . . . . . . . . . . . . . . . . . . . . . 31.4 Modeling of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Speech Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6.1 A Very Brief History of Speech Coding . . . . . . . . . . . . . . 91.6.2 Major Classification of Speech Coders . . . . . . . . . . . . . . 101.6.3 Speech Coding Standards . . . . . . . . . . . . . . . . . . . . . . . 12

1.7 Varieties of Speech Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.7.1 Varieties of Waveform Speech Coders . . . . . . . . . . . . . 131.7.2 Varieties of Parametric (Analysis-by-Synthesis)

Speech Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.8 Measuring Speech Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.8.1 Mean Opinion Score . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.8.2 Perceptual Evaluation of Speech Quality . . . . . . . . . . . . 151.8.3 Enhanced Modified Bark Spectral Distance . . . . . . . . . . 151.8.4 Diagnostic Rhyme Test . . . . . . . . . . . . . . . . . . . . . . . . . 151.8.5 Diagnostic Acceptability Measure . . . . . . . . . . . . . . . . . 151.8.6 E-Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.9 Communication Networks and Speech Coding . . . . . . . . . . . . 161.10 Performance Issues in Speech Communication Systems . . . . . 17

1.10.1 Speech Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.10.2 Communication Delay . . . . . . . . . . . . . . . . . . . . . . . . . 181.10.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . 181.10.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 181.10.5 Robustness to Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.10.6 Robustness to Packet Losses (for Packet-Switched

Networks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.11 Summary of Speech Coding Standards . . . . . . . . . . . . . . . . . . 181.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii

Page 6: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

viii Contents

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2. Fundamentals of DSP for Speech Processing . . . . . . . . . . . . . . . . . 272.1 Introduction to LTI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.1.2 Time Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.1.3 Representation Using Impulse Response . . . . . . . . . . . . 282.1.4 Representation of Any Continuous-Time (CT) Signal . . 292.1.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1.6 Differential Equation Models . . . . . . . . . . . . . . . . . . . . 30

2.2 Review of Digital Signal Processing . . . . . . . . . . . . . . . . . . . . 312.2.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.2 Shifted Unit Pulse: δ(n − k) . . . . . . . . . . . . . . . . . . . . . . 312.2.3 Representation of Any DT Signal . . . . . . . . . . . . . . . . . 322.2.4 Introduction to Z Transforms . . . . . . . . . . . . . . . . . . . . 362.2.5 Fourier Transform, Discrete Fourier Transform . . . . . . . 372.2.6 Digital Filter Structures . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Review of Stochastic Signal Processing . . . . . . . . . . . . . . . . . . 412.3.1 Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Response of a Linear System to a Stochastic Process Input . . . . 442.5 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.6 AR Models for Speech Signals, Yule–Walker Equations . . . . . . 472.7 Short-Term Frequency (or Fourier) Transform and Cepstrum . 47

2.7.1 Short-Term Frequency Transform (STFT) . . . . . . . . . . . 472.7.2 The Cepstrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.8 Periodograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.9 Spectral Envelope Determination for Speech Signals . . . . . . . . 492.10 Voiced/Unvoiced Classification of Speech Signals . . . . . . . . . 51

2.10.1 Time-Domain Methods . . . . . . . . . . . . . . . . . . . . . . . . . 512.10.1.1 Periodic Similarity . . . . . . . . . . . . . . . . . . . . . 512.10.1.2 Frame Energy . . . . . . . . . . . . . . . . . . . . . . . . . 522.10.1.3 Pre-Emphasized Energy Ratio . . . . . . . . . . . . . 522.10.1.4 Low- to Full-Band Energy Ratio . . . . . . . . . . . 522.10.1.5 Zero Crossing . . . . . . . . . . . . . . . . . . . . . . . . . 522.10.1.6 Prediction Gain . . . . . . . . . . . . . . . . . . . . . . . . 532.10.1.7 Peakiness of Speech . . . . . . . . . . . . . . . . . . . . 532.10.1.8 Spectrum Tilt . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.10.2 Frequency-Domain Methods . . . . . . . . . . . . . . . . . . . . 532.10.3 Voiced/Unvoiced Decision Making . . . . . . . . . . . . . . . 54

2.11 Pitch Period Estimation Methods . . . . . . . . . . . . . . . . . . . . . . 542.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Page 7: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Contents ix

3. Sampling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Nyquist Sampling Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3 Reconstruction of the Original Signal: Interpolation

Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4 Practical Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5 Aliasing and In-Band Distortion . . . . . . . . . . . . . . . . . . . . . . . 643.6 Effect of Sampling Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . 653.7 Sampling and Reconstruction of Random Signals . . . . . . . . . . 673.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4. Waveform Coding and Quantization . . . . . . . . . . . . . . . . . . . . . . . 694.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3 Quantizer Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 704.4 Quantizer Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . . 714.5 Quantizer Performance under No-Overload Conditions . . . . . 724.6 Uniform Quantizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.7 Nonuniform Quantizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7.1 Nonuniform Quantizer Implementation Methods . . . . . 764.7.2 Nonuniform Quantizer Performance . . . . . . . . . . . . . . 77

4.8 Logarithmic Companding . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.8.1 Approximations to Logarithmic Companding . . . . . . . . 80

4.8.1.1 μ-Law (Continuous Version) . . . . . . . . . . . . . . 814.8.1.2 A-Law (Continuous Version) . . . . . . . . . . . . . . 82

4.8.2 Companding Advantage . . . . . . . . . . . . . . . . . . . . . . . 844.9 Segmented Companding Laws . . . . . . . . . . . . . . . . . . . . . . . . 85

4.9.1 Segmented Approximation to the Continuous μ-Lawand A-Law Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.10 ITU G.711 μ-Law and A-Law PCM Standards . . . . . . . . . . . . . 914.10.1 Conversion between Linear and Companded Codes . . . 91

4.10.1.1 Linear to μ-Law Conversion . . . . . . . . . . . . . . 924.10.1.2 μ-Law to Linear Code Conversion . . . . . . . . . 934.10.1.3 Linear to A-Law Conversion . . . . . . . . . . . . . . 944.10.1.4 A-Law to Linear Conversion . . . . . . . . . . . . . . 95

4.11 Optimum Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.11.1 Closed Form Solution for the Optimum Companding

Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.11.2 Lloyd–Max Quantizer . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.12 Adaptive Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Page 8: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

x Contents

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5. Differential Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2 Closed-Loop Differential Quantizer . . . . . . . . . . . . . . . . . . . . 1125.3 Generalization to Predictive Coding . . . . . . . . . . . . . . . . . . . . 113

5.3.1 Optimum Closed-Loop Predictor . . . . . . . . . . . . . . . . . 1155.3.2 Adaptive Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.4 ITU G.726 ADPCM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Adaptive Quantizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.1.1 Quantizer Scale Factor Adaption . . . . . . . . . . . 1215.4.1.2 Quantizer Adaption Speed Control . . . . . . . . . 121

5.4.2 Predictor Structures and Adaption . . . . . . . . . . . . . . . . 1235.5 Linear Deltamodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.5.1 Optimum 1-Bit Quantizer . . . . . . . . . . . . . . . . . . . . . . . 1265.5.2 Optimum Step Size and SNR . . . . . . . . . . . . . . . . . . . . 126

5.5.2.1 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . 1285.5.2.2 SNR for Sinusoidal Inputs with Perfect

Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.6 Adaptive Deltamodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6. Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.1.1 Linear Prediction Theory and Wiener Filters . . . . . . . . . 1356.2 Properties of the Autocorrelation Matrix, R . . . . . . . . . . . . . . . 1386.3 Forward Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.4 Relation between Linear Prediction and AR Modeling . . . . . . 1426.5 Augmented Wiener–Hopf Equations for Forward Prediction . . 1426.6 Backward Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.7 Backward Prediction-Error Filter . . . . . . . . . . . . . . . . . . . . . . 1466.8 Augmented Wiener–Hopf Equations for Backward

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476.9 Relation between Backward and Forward Predictors . . . . . . . . 1476.10 Levinson–Durbin Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.10.1 L-D Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1506.10.2 Forward Linear Prediction . . . . . . . . . . . . . . . . . . . . . . 1516.10.3 Backward Linear Prediction . . . . . . . . . . . . . . . . . . . . . 1516.10.4 Inverse L-D Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 1606.10.5 Summary of L-D Recursion . . . . . . . . . . . . . . . . . . . . . 162

6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Page 9: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Contents xi

Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7. Linear Predictive Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1657.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1657.2 Linear Predictive Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.2.1 Excitation Source Models . . . . . . . . . . . . . . . . . . . . . . . 1667.3 LPC-10 Federal Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1717.3.2 LPC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1727.3.3 FS-1015 Speech Coder . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.4 Introduction to CELP-Based Coders . . . . . . . . . . . . . . . . . . . . 1747.4.1 Perceptual Error Weighting . . . . . . . . . . . . . . . . . . . . . 1797.4.2 Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1807.4.3 Closed-Loop Pitch Search (Adaptive Codebook Search) . 180

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8. Vector Quantization for Speech Coding Applications . . . . . . . . . . 1858.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1858.2 Review of Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . 1868.3 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8.3.1 The Overall Distortion Measure . . . . . . . . . . . . . . . . . . 1888.3.2 Distortion Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 1908.3.3 Codebook Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

8.4 Lloyd’s Algorithm for Vector Quantizer Design . . . . . . . . . . . 1928.4.1 Splitting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

8.5 The Linde–Buzo–Gray Algorithm . . . . . . . . . . . . . . . . . . . . . . 1958.6 Popular Search Algorithms for VQ Quantizer Design . . . . . . . 196

8.6.1 Full Search VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1968.6.2 Binary Search VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

8.7 Other Suboptimal Algorithms for VQ Quantizer Design . . . . . 1988.7.1 Multistage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1988.7.2 Split VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2018.7.3 Conjugate VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2018.7.4 Predictive VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028.7.5 Adaptive VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8.8 Applications in Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 2038.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

Page 10: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

xii Contents

9. Analysis-by-Synthesis Coding of Speech . . . . . . . . . . . . . . . . . . . 2079.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2079.2 CELP AbS Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2089.3 Case Study Example: FS 1016 CELP Coder . . . . . . . . . . . . . . . 2109.4 Case Study Example: ITU-T G.729/729A Speech Coder . . . . . . 216

9.4.1 The ITU G.729/G.729A Speech Encoder . . . . . . . . . . . . 2179.4.1.1 The ITU G.729 Encoder Details . . . . . . . . . . . . 2209.4.1.2 Quantization of the Gains . . . . . . . . . . . . . . . . 235

9.4.2 The ITU G.729/G.729A Speech Decoder . . . . . . . . . . . . 2359.4.2.1 The ITU G.729 Decoder Details . . . . . . . . . . . . 2359.4.2.2 Long-Term Postfilter . . . . . . . . . . . . . . . . . . . . 2379.4.2.3 Short-Term Postfilter . . . . . . . . . . . . . . . . . . . . 2389.4.2.4 High-Pass Filtering and Upscaling . . . . . . . . . 2389.4.2.5 Tilt Compensation . . . . . . . . . . . . . . . . . . . . . 2389.4.2.6 Adaptive Gain Control . . . . . . . . . . . . . . . . . . 239

9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

10. Internet Low-Bit-Rate Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24310.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24310.2 Internet Low-Bit-Rate Codec . . . . . . . . . . . . . . . . . . . . . . . . 244

10.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24410.2.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24510.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24510.2.4 CELP Coders versus iLBC . . . . . . . . . . . . . . . . . . . . 246

10.3 iLBC’s Encoding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 24710.4 iLBC’s Decoding Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 25110.5 iLBC’s PLC Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25410.6 iLBC’s Enhancement Techniques . . . . . . . . . . . . . . . . . . . . . 255

10.6.1 Outline of Enhancer . . . . . . . . . . . . . . . . . . . . . . . . . 25610.7 iLBC’s Synthesis and Postfiltering . . . . . . . . . . . . . . . . . . . . 25910.8 MATLAB� Signal Processing Blockset iLBC

Demo Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25910.9 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26710.10 Evolution from PSQM/PSQM+TO PESQ . . . . . . . . . . . . . . . 267

10.10.1 PSQM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26810.11 PESQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26810.12 PESQ Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27110.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271Exercise Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Page 11: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Contents xiii

11. Signal Processing in VoIP Systems . . . . . . . . . . . . . . . . . . . . . . . . 27511.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27511.2 PSTN and VoIP Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 27511.3 Effect of Delay on the Perceived Speech Quality . . . . . . . . . . 27611.4 Line Echo Canceler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

11.4.1 Adaptive Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27911.4.2 Double-Talk Detector . . . . . . . . . . . . . . . . . . . . . . . . 28111.4.3 Nonlinear Processor . . . . . . . . . . . . . . . . . . . . . . . . . 28211.4.4 Comfort Noise Generator . . . . . . . . . . . . . . . . . . . . . 282

11.5 Acoustic Echo Canceler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28311.6 Jitter Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28511.7 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28811.8 Packet Loss Recovery Methods . . . . . . . . . . . . . . . . . . . . . . . 289

11.8.1 Transmitter-Based FEC Techniques . . . . . . . . . . . . . . 29011.8.2 Receiver-Based PLC Algorithms . . . . . . . . . . . . . . . . 291

11.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

12. Real-Time DSP Implementation of ITU-T G.729/ASpeech Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29512.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29512.2 ITU-T G.729/A Speech Coding Standard . . . . . . . . . . . . . . . 29612.3 TI TMS320C6X DSP Processors . . . . . . . . . . . . . . . . . . . . . . . 29712.4 TI’s RF and DSP Algorithm Standard . . . . . . . . . . . . . . . . . . 30012.5 G.729/A on RF3 on the TI C6X DSP . . . . . . . . . . . . . . . . . . . 301

12.5.1 IALG Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30212.5.2 ALGRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

12.6 Running the RF3 Example on EVM . . . . . . . . . . . . . . . . . . . . 30412.7 RF3 Resource Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 304

12.7.1 RF3 Memory Requirements . . . . . . . . . . . . . . . . . . . 30412.7.2 RF3 Clock Cycle Requirements . . . . . . . . . . . . . . . . . 306

12.8 Details of Our Implementation . . . . . . . . . . . . . . . . . . . . . . . 30612.8.1 Adapting, Building, and Running

the G.729/A Code . . . . . . . . . . . . . . . . . . . . . . . . . . 30612.8.2 Defining the Data Type Sizes for the Vocoder . . . . . . 30712.8.3 Early Development Using Microsoft

Visual Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30812.8.4 Microsoft Visual Studio Encoder Project . . . . . . . . . . 31112.8.5 Microsoft Visual Studio Decoder Project . . . . . . . . . . 312

12.8.5.1 Comparing Test Vectors . . . . . . . . . . . . . . . 31212.8.5.2 Measuring Performance Timing on

Microsoft Visual Studio . . . . . . . . . . . . . . . 31312.8.5.3 Automating the Test Vector Comparisons

on Windows . . . . . . . . . . . . . . . . . . . . . . . 31312.9 Migrating ITU-T G.729/A to RF3 and the EVM . . . . . . . . . . . 314

Page 12: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

xiv Contents

12.9.1 Creating a New Application . . . . . . . . . . . . . . . . . . . 31412.9.1.1 Adapting the Vocoder Library Files for

the EVM . . . . . . . . . . . . . . . . . . . . . . . . . . 31512.9.1.2 G.729/A Application for RF3 . . . . . . . . . . . 31512.9.1.3 algG729A and algInvG729A (Function

Wrappers) . . . . . . . . . . . . . . . . . . . . . . . . . 31512.9.1.4 appModules . . . . . . . . . . . . . . . . . . . . . . . 31512.9.1.5 C67xEMV_RF3 (Application Project) . . . . . 31712.9.1.6 Building the G.729/A Vocoder

Application . . . . . . . . . . . . . . . . . . . . . . . . 31812.10 Optimizing G.729/A for Real-Time Execution on the EVM . . 318

12.10.1 Project Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31812.10.2 DSP/BIOS Settings for Optimization . . . . . . . . . . . . 31912.10.3 Code Changes for Optimization . . . . . . . . . . . . . . . . 320

12.11 Real-Time Performance for Two Channels . . . . . . . . . . . . . . 32212.11.1 Memory Requirements for G.729/A . . . . . . . . . . . . . 32212.11.2 Clock Cycle Requirements for G.729/A . . . . . . . . . . 32312.11.3 Resource Requirements Summary . . . . . . . . . . . . . . 324

12.12 Checking the Test Vectors on the EVM . . . . . . . . . . . . . . . . . 32412.13 Going beyond a Two-Channel Implementation . . . . . . . . . . . 325

12.13.1 Adding Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 32612.13.2 DSP/BIOS Changes for Adding Channels . . . . . . . . 326

12.13.2.1 Source Code Changes for AddingChannels . . . . . . . . . . . . . . . . . . . . . . . . . . 327

12.13.3 Running Seven Channels of the Vocoder onthe EVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

12.13.4 Getting Eight Channels on the G.729/A Application . 32812.13.5 Going beyond Eight Channels in the G.729/A

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32912.13.5.1 Profiling the Vocoder from the Top Level . . 32912.13.5.2 Profiling the Encoder . . . . . . . . . . . . . . . . . 33012.13.5.3 Profiling ACELP_Code_A . . . . . . . . . . . . . 330

12.14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

13. Conclusions and Future Directions for Speech Coding . . . . . . . . . 33513.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33513.2 Future Directions for Speech Research . . . . . . . . . . . . . . . . . 336References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

Page 13: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Foreword

The application of speech coding has a tremendous impact on our society.Speech coding is used every day. All mobile phones use speech coding algo-rithms to encode the analog signal from the microphone into a digital form, sothat it can be transmitted through the cellular network, and then decode thedigital signal back into an analog signal that can be heard through the mobilephone’s speaker. More efficient codecs, a portmanteau for coder–decoder,mean that wireless carriers can handle more conversations in the same spec-trum band, which has contributed to lowering the cost of the phone calls overthe last decade.

Voice-over-Internet Protocol, or VoIP for short, allows free voice commu-nication for users with a broadband Internet connection. The low cost ofinternational calls through VoIP has a significant societal impact as peoplehaving friends and families in another country are now closer than before.Not only is there a cost advantage, but VoIP also allows higher quality speechthan what is available even with landline phones because it can transmit fre-quencies above 3400 Hz and below 300 Hz that add “presence” to the phonecall. In addition, transmitting voice through the Internet means you can lever-age other advantages of the Internet, such as sharing a document, initiatinga phone call by clicking on the phone number in a Web page, or transmittingvideo so that grandma can see her grandchildren thousands of miles away.

I recall the first time I encountered an answering machine and didn’t leave amessage; however, we cannot function without them now. Voicemail, the nextgeneration of the answering machine, is prevalent nowadays and it wouldnot be what it is today without speech codecs. Web sites have not only textand images, but also podcasts and other multimedia material, which also usespeech codecs.

Although speech coding technology is everywhere, the underlying tech-nology is fairly complex. This textbook by Drs. Ogunfunmi and Narasimhamakes speech coding very accessible. This book does not shy away fromequations, but they are there for a reason and only as needed. While the bookcovers the fundamentals, it also describes many speech coding standards ingood detail, including source code for a popular codec. The authors haveachieved a great balance between academic rigor and practical details, madepossible by their in-depth experience at Stanford University and companiessuch as Qualcomm. The exercises and codes included will not only assistin learning the basic principles of speech coding but also will enable read-ers to understand the implementation nuances. Therefore, this book can beaccessible to both undergraduate students and practitioners.

xv

Page 14: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

xvi Foreword

In addition, the book encompasses not only the latest standards, such asthe Internet low-bit-rate coder, but also describes in detail important practicaltechniques, such as mechanisms to handle packet losses, jitter, and clock drift,required for high-quality end-to-end voice communication. It is an invaluableresource for engineers who are involved in voice communication productstoday, and for students who will study voice communication technology inthe future.

Alex AceroResearch Area Manager

Microsoft Research

Page 15: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Preface

The purpose of this book is to introduce readers to the field of speech coding.Speech is undoubtedly the most common form of human communication. Italso plays a significant role in human–machine interactions. Efficient codingof speech waveforms is essential in myriad transmission and storage appli-cations such as traditional telephony, wireless communications (e.g., mobilephones), Internet telephony, voice-over-Internet Protocol (VoIP), and voicemail. Many of these applications are currently going through an impressivegrowth phase.

Detailed books on the subject are somewhat rare. We present a detailedyet simple-to-understand exposition of the underlying signal processingtechniques used in the area of speech coding. We discuss many of the currentspeech coding algorithms standardized by the International Telecommuni-cation Union (ITU) and other organizations, and demystify them so thatstudents can clearly understand the basic principles used in their formulation.In order to illustrate the complexity involved in the practical implementationof these speech coding algorithms, we delineate the realization of a popularstandardized speech coder on a DSP processor.

It is becoming increasingly apparent that all forms of communication—including voice—will be transmitted through packet-switched networksbased on the IP. Since the packetization of speech and its transmissionthrough such networks introduces numerous impairments that degradespeech quality, we discuss in this book key signal processing algorithms thatare necessary to combat these impairments. We also cover recent researchresults in the area of advanced speech coding algorithms from the authorsand other researchers in this perpetually evolving field.

We present simple, concise, and easy-to-understand explanations of theprinciples of speech coding. We focus specifically on the principles by whichall the modern speech coding methods that are detailed in the standardscan be understood and applied. An in-depth comprehension of these prin-ciples is necessary for designing various modern devices that use speechinterfaces such as cell phones, personal digital assistants (PDAs), telephones,video phones, speech recognition systems, and so on.

This book is not an encyclopedic reference on the subject. We have focusedprimarily on what we consider to be the basic principles underlying the vastsubject of speech coding. There are other more complete references on thesubject of speech in general. Furthermore, this book does not discuss audiocoding standards.

xvii

Page 16: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

xviii Preface

The book is intended for senior-level undergraduate and graduate-levelstudents in electrical and computer engineering. It is also suitable forengineers and researchers designing or utilizing speech coding systemsfor their work and for other technologists who wish to study the subjectthemselves.

This book grew out of our combined teaching of speech coding and relatedsignal processing classes at Stanford University, Stanford, California, and atSanta Clara University, Santa Clara, California. The manuscript for the bookhas been used for three 10-week courses on speech coding and VoIP conductedby the authors at Santa Clara University and Stanford University over the lastfew years. It can also be used as a reference text in courses on multimediasignal processing or applications of digital signal processing, which focus onthe processing of speech, image, and video signals.

The book is written so that a senior-level undergraduate or a first-year grad-uate student can read it and understand. Prerequisites include a knowledgeof calculus and of some digital signal processing.

The book is organized as follows: Chapter 1 is a general introduction tothe subject of speech processing. Chapter 2 is a basic review of some of thedigital signal processing concepts that are used frequently in speech process-ing. Chapter 3 focuses on sampling theory and a few related topics as theyapply to the subject of speech coding. Waveform coding and quantizationare discussed extensively in Chapter 4. The main goal of this chapter is toexplain the theoretical basis for the μ-law and A-law logarithmic quantiz-ers that have been standardized for speech coding by the ITU. Chapter 5presents the principles of differential coding and delineates the ITU G.726adaptive differential pulse code modulation (ADPCM) standard. Deltamod-ulation, which is a particular differential coding system that uses just a 1-bitquantizer, is also discussed in this chapter. Chapter 6 addresses the subject oflinear prediction (LP). In Chapter 7, LP is applied to speech coding using thelinear predictive coding (LPC) model. Chapter 8 presents vector quantization,which forms the basis for many of the advanced and widely used speech cod-ing methods, such as the analysis-by-synthesis systems, described in Chapter9. Chapter 10 presents the Internet low-bit-rate coder (iLBC), which is apopular speech coding standard in Internet speech applications. Chapter 11addresses the issue of impairments to speech quality in VoIP networks anddiscusses signal processing algorithms to mitigate their effects. Chapter 12presents a real-time implementation of a speech coder (ITU G.729A) on adigital signal processing chip. Finally, Chapter 13 concludes with a summaryand some of our observations and predictions about the future of speechprocessing.

We hope the material presented here will help educate newcomers to thefield (e.g., senior undergraduates and graduate students) and also help elu-cidate to practicing engineers and researchers the important principles ofspeech coding.

Page 17: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Preface xix

Any questions or comments about the book can be sent to the authors at thebook’s Web site http://www.principlesofspeechcoding.com or to either of theauthor’s email addresses: [email protected] or [email protected]

MATLAB® and Simulink® are registered trademarks of The MathWorks,Inc. For product information, please contact:

The MathWorks, Inc.3 Apple Hill DriveNatick, MA 01760-2098, USATel: 508 647 7000Fax: 508-647-7001E-mail: [email protected]: www.mathworks.com

Tokunbo OgunfunmiMadihally (Sim) Narasimha

Santa Clara, California

Page 18: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Acknowledgments

We thank the publishers, CRC Press (Taylor & Francis), for working with usthrough the challenges of our time constraints in writing a book such as this.We especially would like to thank Nora Knopka, Jill Jurgensen, and AshleyGasque for their support and patience.

Dr. Ogunfunmi thanks Santa Clara University (SCU) for support of thisproject. He also thanks James Foote, former SCU MSEE graduate student,for his work on the DSP implementation discussed in Chapter 9 and JuanMarsmela, another former SCU MSEE graduate student, for his help on iLBCand PESQ discussed in Chapter 12.

Dr. Narasimha gratefully acknowledges the class notes provided by Profes-sor David Messerschmitt of the University of California at Berkeley. Chapter4 is an extension of the original ideas presented in the notes.

xxi

Page 19: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Authors

Tokunbo Ogunfunmi is a professor at the Department of Electrical Engineer-ing and director of the Signal Processing Research Laboratory (SPRL) at SantaClara University, Santa Clara, California. His research interests include digitaladaptive/nonlinear signal processing, speech and video signal processing,artificial neural networks, and VLSI design. He has published two booksand over 100 refereed journal and conference papers in these and relatedapplication areas.

Dr. Ogunfunmi has been a consultant to the industry and government,and a visiting professor at Stanford University and the University of Texas.He is a senior member of the Institute of Electrical and Electronic Engineers(IEEE), a member of Sigma Xi (the Scientific Research Society), and a memberof the American Association for the Advancement of Science (AAAS). Heserves as the chair of the IEEE Signal Processing Society (SPS) Santa ClaraValley Chapter and as member of several IEEE Technical Committees (TC).He is also a registered professional engineer.

Madihally Narasimha is currently a senior director of technology atQualcomm Inc. Prior to joining Qualcomm, he was vice president of tech-nology at Ample Communications, where he directed the development ofEthernet physical layer chips. Before that, he served in technology leader-ship roles at several Voice-over-IP (VoIP) startup companies including IPUnity, Realchip Communications, and Empowertel Networks. He also heldsenior management positions at Symmetricom and Granger Associates (asubsidiary of DSC Communications Corporation), where he was instru-mental in bringing many DSP-based telecommunications products to themarket.

Dr. Narasimha is also a consulting professor at the Department of ElectricalEngineering at Stanford University, Stanford, California, where he teachestelecommunications courses and carries out research in related areas. He is afellow of the Institute of Electrical and Electronic Engineers (IEEE).

xxiii

Page 20: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

1Introduction to Speech Coding

1.1 Introduction

Communication by speech is by far the most popular and one of the mosteffective means of transmitting information from one person to another.Speech signals form the basic method of human communication. The infor-mation communicated in this case is verbal or auditory information. The fieldof speech processing is very extensive and continuously evolving.

Speech analysis is the means by which speech is analyzed and the physicalcharacteristics that define the speech can be extracted from the original speech.

Speech coding is the means by which the information-bearing speech signalis coded to remove redundancy. This helps to reduce transmission bandwidthrequirements, improves storage efficiency, and makes possible myriad otherapplications that rely on speech coding techniques.

Speech synthesis is the means by which speech is generated by re-creating itfrom a set of model parameters from the speech analysis techniques.

All these procedures typically assume a particular model of speechproduction.

1.2 Speech Signals

For us to understand speech signals, we need to understand the mechanismsbehind the generation of speech sounds (signals). It is possible to use severaldifferent models for speech signal communication: language models, cognitivemodels, aural models, or acoustic models. The most common model used forspeech signal generation is the acoustic model [1,2]. The acoustic model givesus the vocal tract model of speech production. It is developed by study of theanatomy of speech production. In Figure 1.1, we see that the components ofthis model include the lungs, vocal tract, nasal cavity, lips, tongue, glottis, andsoft palate.

The method of speech signal generation with this model is described next.Air from the lungs serves as the excitation and is forced through a con-

striction through the glottis into the vocal tract, through the lips to the

1

Page 21: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

2 Principles of Speech Coding

Tongue

Glottis

Lips

Hardpalate

Nasalcavity

Soft palate(velum)

Vocal tract

FIGURE 1.1 Human vocal system. (From Rabiner and Schafer, Digital Processing of Speech Signals,pp. 53–60, Prentice Hall, Englewood Cliffs, NJ, 1978. With permission.)

outside world. The vocal tract fluctuates to give the different speech sounds.Although it is an approximate model with many assumptions, it serves ourpurposes very well. Acoustic signals (which are interpreted as speech) areproduced from the human vocal system. It is essential to differentiate herebetween speech (which contains audible information) and mere acoustic sig-nals. The vocal tract is key to the production of speech that contains audibleinformation.

Speech signals can be divided into

i. Voiced soundsii. Unvoiced sounds and fricatives

iii. Plosive sounds

Vowels and nasal sounds are examples of voiced sounds, whereas conso-nants (such as m, p, t, k, f , and v) are examples of unvoiced sounds. Plosive

Page 22: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 3

0–0.15

–0.1

–0.05Am

plitu

de

0

0.05

0.1

0.5 1 1.5 2Samples

2.5 3 3.5 4× 104

FIGURE 1.2 Male speech signals [from 5 s (40,000 samples) of the speech “The boy is in sixthgrade”].

sounds are some consonants like s in “fish.” In many cases, the third class(plosive sounds) can be frequently classified as unvoiced. This leads to twomain classes: (i) voiced and (ii) unvoiced. See Figures 1.2 and 1.3 for exam-ples of speech signals. The voiced/unvoiced transition point is sometimesnot very clear.

Notice that for the voiced speech sounds, it is periodic (or quasiperiodic)and the period is related to the pitch. For unvoiced speech sounds, there isno pitch and the signal looks like random white noise.

1.3 Characteristics of Speech Signals

Voiced speech usually involves the opening and closing of the vocal cordsbreaking the airsteam into chains of pulses. Pitch is the repetition rate ofthese pulses and defines the fundamental frequency of the speech signal. Theresonant frequencies of the speech signal are formed in the vocal tract and areknown as formants. Formants are identified by number in order of increasingvalue with the sequence f1, f2, . . . , fn. Typically, pitch ranges between 80 and160 Hz for male speakers, and between 160 and 400 Hz for female speakers.Formant frequencies are typically greater than the pitch frequency and canlie in the kilohertz range.

Page 23: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

4 Principles of Speech Coding

0–0.3

–0.25

–0.2

–0.15

–0.05

–0.1Am

plitu

de

0

0.05

0.1

0.15

0.2

0.5 1 1.5 2Samples

2.5 3 3.5 4× 104

FIGURE 1.3 Female speech signals [from 5 s (40,000 samples) of the speech “The boy is in sixthgrade”].

Estimation of pitch and formants finds extensive use in speech coding, syn-thesis, and recognition. Some well-known pitch detection methods employcepstrum [2], simple inverse filtering tracking (SIFT) [3], and other methods.

For estimating formant frequencies, the envelope of the log-magnitudespectrum plot is often used. However, more precise detection methods areneeded to give satisfactory results across a wide range of speakers, applica-tions, and operating environments. We discuss these and other methods inChapter 2.

In Figure 1.4, we plot the log-magnitude spectra of the speech utterance inFigure 1.2. We can determine the pitch from the peaks of the spectra, whichare repeated at multiples of the pitch period.

In Figure 1.5, we also plot the log-magnitude spectra of the speech utterancein Figure 1.3. We can determine the pitch from the peaks of the spectra, whichare repeated at multiples of the pitch period.

As mentioned above, for unvoiced speech sounds, there is no pitch and thesignal looks like random white noise. Pitch ranges between 80 and 160 Hzfor male speakers, and between 160 and 400 Hz for female speakers. Wedemonstrate this as follows.

A male-uttered speech signal (“the fish swam in the water”) with cleardivisions between voiced and unvoiced sections is shown in Figure 1.6. Afemale-uttered version of the same speech is shown in Figure 1.7. Notice the

Page 24: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 5

20

0

–20

–40

–60

–80

–100

–120

–1400 500 1000 1500 2000

Frequency (Hz)

Short-time spectra with different window lengths

Log

mag

nitu

de (d

B)

2500 3000 3500 4000

w = 401w = 201w = 101w = 51

FIGURE 1.4 Log-magnitude spectra of the male speech in Figure 1.2 (pitch can be determined).

20

0

–20

–40

–60

–80

–100

–120

–1400 500 1000 1500 2000

Frequency (Hz)

Short-time spectra with different window lengths

w = 401w = 201w = 101w = 51

Log

mag

nitu

de (d

B)

2500 3000 3500 4000

FIGURE 1.5 Log-magnitude spectra of the female speech in Figure 1.3 (pitch can be determined).

Page 25: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

6 Principles of Speech Coding

0.015

0.01

0.005

0

–0.005

–0.01

–0.015

–0.02

–0.0250 500 1000 1500 2000

Samples

4000 samples (500 ms) of male speech

Am

plitu

de

2500 3000 3500 4000 4500

Voiced speech

Unvoiced speech

FIGURE 1.6 A male-uttered speech signal (“the fish swam in the water”) with clear divisionsbetween voiced and unvoiced sections.

0.15

0.1

0.05

0

–0.05

–0.1

–0.15

–0.20 500 1000 1500 2000

Samples

4000 samples (500 ms) of female speech

Am

plitu

de

2500 3000 3500 4000 4500

Voiced speech

Voiced speech

Unvoicedspeech

Unvoicedspeech

FIGURE 1.7 A female-uttered speech signal (“the fish swam in the water”) with clear divisionsbetween voiced and unvoiced sections.

Page 26: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 7

differences between the two. The voiced portions of the male speech seem tohave a lower frequency (lower pitch) than the female-uttered speech.

Pitch can also vary with languages. Pitch varies for male speakers of Englishversus male speakers of French, Spanish, Chinese, Japanese, Hindi, Yoruba,Arabic, or other languages. The same is the case for female speakers.

The frequency range of human audible speech is 20 Hz to 20 kHz.However, most of the energy of human speech signals is typically limited

to the narrow bandwidth of 1.5–3.4 kHz. As an exercise, the reader is asked toplot the energy distribution of the speech utterance in Figure 1.2 for example.It will demonstrate that the energy is mostly distributed in the frequencyrange of 1.5–3.4 kHz.

Therefore, most speech processing systems limit the signal bandwidth to4 kHz before sampling, which requires a sampling frequency of 8 kHz tosatisfy the Nyquist sampling theorem.

Narrowband speech signal samples are typically represented as8 bits/sample, 16 bits/sample, or 24 bits/sample depending on the amountof memory and processing power available. This translates to bit rates forsampled speech waveforms of 8 kHz × (8 or 16 or 24 bits/s) = 64 or 128 or192 kbps, respectively. The pulse-coded modulated (PCM) speech signal is64 kbps.

For high-fidelity speech and audio systems, the Nyquist sampling fre-quency is 2 × 22.05 kHz = 44.1 kHz. The extra bandwidth gives ample roomfor designing filters with the appropriate cutoff characteristics. The sampleshere can be represented by more than 8 bits (e.g., 16, 24, or 32 bits) for betterprecision, resulting in higher signal-to-noise ratio (SNR) and better speechfidelity.

1.4 Modeling of Speech

The acoustic model of speech generation models the vocal tract, glottis, andradiation lips combined together as a digital filter excited by an excitationsignal.

Voiced sounds are generated by passing a periodic (actually quasiperiodic)sequence of pulses through this digital filter. The fundamental frequency ofthe period of voiced speech is known as the pitch frequency or simply thepitch.

Unvoiced sounds are generated by passing a white noise source throughthis digital filter.

This model is very simplified. Actually, the vocal tract shape is time-varyingbecause the speech signals are nonstationary. However, we can assume this isslowly time-varying and can assume stationarity within a limited time inter-val (typically 10–20 ms). Within this frame length, the stationarity assumptionof speech statistics must be maintained. Therefore, we can apply linear,

Page 27: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

8 Principles of Speech Coding

T

Vocal tractimpulse resp.

v(t)

S(t) = e(t) * v(t)

S(w) = E(w) V(w)

(a)

(b)

Excitation signal e(t)

V(w)E(w)

2pT

FIGURE 1.8 Amodel of speech synthesis using the vocal tract: (a) time domain and (b) frequencydomain.

time-invariant (LTI) analysis results so that the output is a convolution ofthe input and the impulse response of the vocal tract digital filter.

The digital signal processing of speech is usually accomplished frame-by-frame, where a frame is typically about 10 ms, 20 ms, or 30 ms. This is becausethe speech signals are nonstationary.

Sometimes the frames are overlapped during processing. Other times, wehave lookahead frames, subframes, and other frame configurations.

It is interesting to view the input and output of the vocal tract model filterin the frequency domain (Figure 1.8). We see that the output in the frequencydomain is a multiplication of the Fourier transform of the excitation input withthe Fourier transform of the impulse response of the vocal tract model. Thismeans the vocal tract shapes the output spectrum. That is why the spectralinformation contained in speech is an important property for speech coding.

1.5 Speech Analysis

Speech analysis is the means by which speech is analyzed and the physicalcharacteristics that define the speech can be extracted from the original speech.

The techniques used include many digital signal processor (DSP) methodssuch as short-time Fourier transform (STFT), linear predictive analysis (LPA),homomorphic methods, deconvolution, etc.

For example, the frequency information in a speech signal can be shown bypower spectral density (PSD), periodograms, and so on. Speech signals maybe different in time domain but similar in frequency domain. Human earsmay be insensitive to phase differences in speech signals.

Page 28: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 9

But phase information may help improve human speech perception quality.LPA uses linear prediction to extract residual errors from speech signals

and use this to better and more efficiently code the speech. This is one of themost common analysis methods.

1.6 Speech Coding

The development of speech coders has been in response to the development oftraditional communication networks such as the plain old telephone service(POTS) and the public-switched telephone network (PSTN). More recently,there have been other different kinds of networks: wireline, wireless, Internet,cellular, and so on.

The main goal of speech coding is to provide algorithms that minimizethe bit rate in the digital representation of a speech signal without any annoy-ing loss of speech quality. High quality is attained at low bit rates by exploitingsignal redundancy as well as the knowledge that certain types of coding dis-tortion are imperceptible to humans because they are masked by the speechsignal. Rate-distortion theory applies to any coding algorithm. The goal is tominimize the distortion but increase the quality. In rate-distortion theory, lowbit rates mean higher distortion and high bit rates mean lower distortion.

Now we briefly review the history of speech coding.

1.6.1 A Very Brief History of Speech Coding

Alexander Graham Bell invented the telephone in 1876 based on a very simpleconcept of converting sound waves into electrical signals, which can be trans-mitted over a variety of channels including the twisted copper wires foundin legacy telecommunications systems. This gave rise to the POTS and thePSTN. Later in this chapter, we discuss the impact of this on speech coding.We also discuss the impact of the myriad of channels now used for speechcommunications such as wireless, cellular, Internet, and satellite channels.

The other important milestone in the history of speech occurred with theinvention of the PCM. This has enabled DSP-based processing of sampledspeech signals.

Later in 1967, researchers at AT&T Bell Labs invented the idea of linearpredictive coding (LPC) of speech. The historical account published by Atal[4] admitted that the idea was based on an earlier concept of predictive cod-ing published by Elias [5,6]. Other researchers in Japan, notably Itakura andSaito [7,8], independently about the same time developed the idea of partialcorrelation coefficients (PARCOR) for speech and also the idea of line spec-trum pairs (LSPs). The development of LPC has changed the way in whichmany narrowband speech codecs were designed even though the speech wasintelligible but not of very high quality.

Page 29: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

10 Principles of Speech Coding

The U.S. government speech coding standard (LPC-10) [9] was based on theLPC and had a low 2.4 kbps bit rate. Popular applications of speech processingsuch as the “Speak and Spell” learning device from Texas Instruments [10]were made possible by the introduction of LPC.

Bishnu Atal and his colleagues at AT&T Bell Labs later extended the LPCidea to multipulse LPC and to code-excited linear predictive (CELP) codingof speech in a series of papers [11–15] to produce better natural-soundingspeech and to lower the bit rates. Most of the speech codecs proposed since1994 have been based on the idea of CELP.

Today, there are many applications of voice communications that havebeen developed since. Examples are speech-recognition systems, securespeech communications (cryptography), voice-activated devices such asspeech-to-text and text-to-speech systems, wireless communications, andvoice-over Internet Protocol (VoIP) [16]. Wireless voice communications andVoIP are perhaps the biggest parts of this, as evidenced by the explosivegrowth in these two industries over the last few years.

More recently, there has been a need to develop speech coders that canperform well under packet-loss conditions, which is common in VoIP appli-cations. One such coder is the iLBC, which stands for Internet low-bit-ratecoder [17]. This coder is quite popular and claims to have better mean opinionscores (MOSs) than the traditional CELP coders, especially when used in thepacket-loss environment of the packet-switched networks like the Internet,wireless LAN networks, etc.

There are other speech coders developed for other communication net-works such as for cellular phones. Examples are the enhanced variable ratecoder (EVRC) for IS95 Code Division Multiple Access (CDMA) telephonyapplications, the adaptive multirate coder (GSM-AMR) for Global SystemMobile (GSM), and others.

1.6.2 Major Classification of Speech Coders

Speech coding techniques can be broadly divided into two classes (seeFigure 1.9), which form a classification of speech coders as follows:

i. Waveform coders: They aim at reproducing the speech waveform asfaithfully as possible.

ii. Parametric coders (or vocoders): They preserve only the spectral or otherstatistical properties of speech in the encoded signal.

Waveform coders are able to produce high-quality speech at high-enoughbit rates; vocoders produce intelligible speech at much lower bit rates, but thelevel of speech quality—in terms of naturalness and uniformity for differentspeakers—is also much lower. The applications of vocoders so far have beenlimited to low-bit-rate digital communication channels.

The combination of the principles of waveform coding and vocoding hasled to significant new capabilities in recent speech coding technology. There

Page 30: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 11

Speech coders

Waveform coders Vocoders

Time domain:PCM. ADPCM

Frequency domain:Sub-band coders,

adaptive transformcoder (ATC)

Linear predictive coder Formantcoders

CELP-basedcodersADM, APC-

based coders

FIGURE 1.9 Broad classification of speech coders.

are so-called hybrid speech coders defined to be those that combine wave-form and parametric coding methods in a single coder. In hybrid coders, thespeech is encoded using parametric coding and the excitation signals are alsoextracted and transmitted. The decoder acts like a waveform coder by usingthe excitation to the speech production model to reproduce the speech. It alsouses weighted perceptual filters to ensure similarity to the original speech sig-nal waveform. Examples of a hybrid speech coder are the CELP-based codersdescribed in Chapters 7 and 9.

There are also multimode speech coders which include those that combinetwo or more different methods of speech coding and switch between themethods depending on the segment of speech being coded. This leads tovariable rates for the encoded speech. Examples are TIA IS96 and ETSI AMRACELP speech coders.

These coders can support application over digital channels with bit ratesranging from 4 to 64 kbps.

Narrowband speech coding involves speech in the bandwidth from 200 Hzto about 3400 Hz and is used in POTS networks. It has been shown that most ofthe energy of speech signals is contained in this narrow bandwidth. Widebandspeech coding involves bandwidths from 50 Hz to about 7000 Hz and is ofhigher quality than narrowband speech coding. High-fidelity audio codersexist for the full spectrum of audible sounds, that is, bandwidths from 20 Hzto about 20,000 Hz. Most of the speech coding techniques we present in thisbook are for narrowband and wideband speech.

In waveform coding, the speech signals are sampled, quantized, and thencoded using various methods to reduce the required bit rate for speech.

See Figure 1.10 for a diagram of a general speech coder for a parametriccoder (vocoder).

The vocal tract filter parameters are determined for each frame of speech.The filter coefficients change from frame-to-frame. At the decoder, these pa-rameters are used to synthesize speech by changing the excitation signal fromthe pulse train for voiced speech segments to random white noise for unvoiced

Page 31: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

12 Principles of Speech Coding

Impulse train

Randomwhite noise

Vocaltract

Pitch period

Signal power

Synthesizedspeech

V/UV

X

FIGURE 1.10 General block diagram for a parametric speech coder (vocoder).

speech segments. This is the basis for the LPC method of speech coding. SeeFigure 1.11 for a model block diagram. Therefore, to be able to encode thespeech signal (frame), we need to know whether this segment of speech isvoiced or unvoiced, the pitch period, the vocal tract filter coefficients, and thelevel of gain of the speech signal.

1.6.3 Speech Coding Standards

The task of developing speech coding standards has been undertaken by thefollowing major bodies:

i. International Telecommunication Union (ITU)ii. European Telecommunication Standards Institute (ETSI)

iii. Telecommunication Industry Association (TIA)iv. Cellular phone companies (e.g., standards for GSM, Qualcomm,

AT&T, etc.)v. Internet Engineering Task Force (IETF)

Speech

Impulsetrain generator

White noisegenerator

Voiced/

Unvoicedswitch

Synthesisfilter

Pitchperiod

Voicing decision

GainFilter

coefficients

FIGURE 1.11 Linear predictive model of speech.

Page 32: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 13

vi. Video communications bodies (Moving Picture Experts Group(MPEG), International Telecommunication Union (ITU), etc.)

vii. Satellite communications company (e.g., Intelsat)viii. Military communications (e.g., U.S. Federal Standards)

There are also nonstandardized or private coders. These do not have tointeroperate with other coders that are deployed in the public communicationnetworks for example.

These standards development processes (e.g., for the ITU) begin with form-ing the organization, assembling proposals for the standard, debating, testing,and finally voting over the merits of each proposal before it can be includedin the final standard.

More details about and a summary of standardized speech coders will bediscussed later in this chapter.

1.7 Varieties of Speech Coders

Here we discuss the different varieties of standardized speech coders avail-able under the broad categories of waveform coders and parametric coders.

1.7.1 Varieties of Waveform Speech Coders

PCM is the simplest waveform coding method. It is based on a memorylessquantizer and codes telephone speech at 64 kbps. Using a simple adaptive pre-dictor, adaptive differential PCM (ADPCM) provides high-quality speech at32 kbps. The speech quality is slightly inferior to that of 64 kbps PCM.ADPCMat 32 kbps is widely used for expanding the number of speech channels by afactor of 2 using time-division multiplexing, particularly in private networksand international circuits. ADPCM is also the basis of low-complexity speechcoding in several standards for personal communication networks, includingCT2 (Europe), UDPCS (USA), and Personal Handyphone (Japan).

PCM and its variants (DPCM, ADPCM, DM, etc.) are based on com-panding and adaptive quantizations of the speech waveform. They do nottake advantage of the fact that speech is produced by a human vocal tract.However, in order to explore this fact, we use two perspectives:

i. Long term (considers time-independent, average properties of speechleading to nonadaptive or fixed speech coding strategies)

ii. Short term (considers slowly, time-varying properties of speechcaused by the mechanical properties of the vocal tract leading toadaptive speech coding strategies)

Page 33: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

14 Principles of Speech Coding

Therefore we have waveform coders such as adaptive predictive coding(APC) and its variants.

APC is the class of differential coders with adaptive predictors. They can beviewed as ADPCM systems that use an adaptive predictor to track the short-term stationary statistics of speech and achieve high coding gain by betterprediction of the speech or as waveform-excited vocoders called adaptivepredictive coders.

APCs are a vital link between waveform coding and parametric coding.Differential coders exploit the short-term predictability of the speech signal.However, because of the nonstationarity of speech, differential coders withfixed predictors can only achieve limited prediction gain.

The waveform-based speech coders are covered in Chapters 3, 4, and 5.

1.7.2 Varieties of Parametric (Analysis-by-Synthesis) Speech Coders

There are many varieties of parametric vocoders that use analysis-by-synthesis (AbS) methods. Examples of parametric (AbS) speech coders are themultipulse-excited LPC (MPLPC), the regular-pulse-excited LPC (RPLPC),and the CELP coding. These are covered in more detail in Chapters 7and 9.

1.8 Measuring Speech Quality

The quality of speech involves (i) intelligibility, (ii) speaker identifiability, and(iii) degree of natural-sounding speech (versus machine-sounding speech).For the most part, intelligibility is paramount. Here are some commonly usedsubjective and objective measures of speech quality.

1.8.1 Mean Opinion Score

The most popular subjective measure of speech quality is the mean opinionscore (MOS). It is measured by first gathering a group of (both male andfemale) listeners in a room, and then playing the original speech (encoded)and decoded version of the speech. The listeners then rate the decoded speechon a scale of 1–5 as follows: (5) excellent, (4) good, (3) fair, (2) poor, and (1)bad. The individual MOSs are then averaged over the number of listeners. Agood speech codec can have a MOS of between 4.0 and 5.0. For example, thehigh-quality ITU G.729 codec has a MOS of 4.5. It is necessary to also measurethe variance of the individual MOS in order to ensure low variance, whichindicates a reliable test. MOSs can vary from test to test, depending on thelisteners and the language of test. Also, MOSs do not test the conversationalspeech but only a static speech coding quality.

Page 34: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 15

1.8.2 Perceptual Evaluation of Speech Quality

The perceptual evaluation of speech quality (PESQ) is a new objective measureof speech quality in a two-way conversational speech communication. TheITU standard for PESQ is ITU-T P.862 [18]. This means that the effectof the communication network involved in the two-way conversation istaken into account. The resulting PESQ score can be converted to the well-known MOS. PESQ has been shown to give good accuracy for factors such asspeech input levels to a codec, transmission channel errors, packet loss andpacket concealment with CELP-based codecs, bit rates for multiple bit ratecodecs, transcodings, environmental noise, and time warping.

1.8.3 Enhanced Modified Bark Spectral Distance

The enhanced modified bark spectral distance (EMBSD) [19] is a new objectivemeasure of speech quality that is highly correlated with the MOS. It wasdeveloped to obviate the need for expensive listening tests that are requiredfor MOSs. It is a modified conventional bark spectral distortion (BSD) [20]. Ituses a noise-masking threshold to improve speech quality.

1.8.4 Diagnostic Rhyme Test

The diagnostic rhyme test (DRT) tests a listener in order to determinewhich consonant was spoken when listening to a pair of rhyming words:for example, word pairs such as “meat–beat, pool–tool, saw–thaw, andcaught–taught.” The DRT score is determined by computing

P = (R − W)100T

,

where P is the percent responses, R is the number of correctly chosenresponses, W is the number of incorrectly chosen responses, and T is thetotal number of word pairs tested. A good DRT score is 90 and is typically inthe range 75 ≤ DRT ≤ 95.

1.8.5 Diagnostic Acceptability Measure

The diagnostic acceptability measure (DAM) [21] is a test designed to makethe measurement of speech quality more systematic. It was developed byDynastat. It is a listening test where listeners are presented with encodedsentences taken from the Harvard 1965 list of phonetically balanced sen-tences. The listener assigns a number between 0 and 100 to characterize thespeech in three areas: (i) signal qualities, (ii) background qualities, and (iii)total effect. The ratings are weighted and applied to a multiple nonlinearregression model and then adjusted to compensate for listener performance.A good DAM score is between 45% and 50%.

Page 35: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

16 Principles of Speech Coding

TABLE 1.1

Mapping of E-Model into MOSs

Nearly

User Users Very Users Some Users Many Users All Users Not

Satisfaction Satisfied Satisfied Satisfied Dissatisfied Dissatisfied Recommended

R 90–100 80–89 70–79 60–69 50–59 Below 50MOS 4.3–4.5 4.0–4.3 3.6–4.0 3.1–3.6 2.6–3.1 Below 2.6

1.8.6 E-Model

The E-Model is a new objective measure standardized in ITU recommen-dations G.107 and G.108. Components of the E-model include (i) effects ofnetwork equipment and (ii) different types of impairments. Components aresummed to give an R-value. The R-value is between 0 and 100 and capturesthe level of user satisfaction as follows: (90–100) users very satisfied, (80–90)users satisfied, (70–80) some users dissatisfied, (60–70) many users dissatis-fied, and (50–60) nearly all users dissatisfied. It can be mapped to MOS byusing Table 1.1.

1.9 Communication Networks and Speech Coding

Over the last few decades, the telecommunication infrastructure has beenundergoing several changes. The analog switches and central office equip-ment used in the traditional communication networks (e.g., POTS) have beenreplaced by digital switches and central office equipment. This network isbased on circuit switching and requires real-time end-to-end circuit connec-tion for communication. This network was initially designed for the PCMspeech coding method that has a rate of 64 kbps. This is the rate used in POTSand PSTN networks.

In addition, there is a convergence of communication networks basedon circuit switching and of computer networks based on packet switching.Packet-switched networks form the backbone of the World Wide Web (WWW)and Internet services.

There are also wireless local area networks (WLANs) and wirelessmetropolitan area networks (WMANs). Examples are Wi-Max (IEEE 802.16),Wi-Fi (IEEE 802.11), and so on. They have become popular as channels ornetworks for transmitting speech signals.

Satellite communication channels are also common for long-distance speechsignals transmitted using satellites in geosynchronous orbits such as Intelsatsatellites.

When designing or choosing speech coding algorithms and applications, itis important to consider the communication network over which the speech

Page 36: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 17

will be communicated. Network issues such as end-to-end delay, transmissionnoise, and so on, are important considerations in the choices to be made.

Also, for these new networks that carry speech, the requirements for speechcoding for such applications need to be addressed. An example involves themany cellular phone networks that have been put in place. The rapid growthof the cellular phone industry has also been a factor in speech coding algo-rithm development. Due to the breakthroughs in narrowband speech coding,we can offer high-quality speech coders at 8 kbps, making this the standardrate for the digital cellular service in North America. For lower-rate speechcoders, research on high-quality speech transmission over digital cellularchannels at 4 kbps or lower is ongoing.

1.10 Performance Issues in Speech Communication Systems

The coding efficiency (or bit rate) for speech is expressed in bits per second(bps). In addition, the performance of speech coders is usually judged by oneor more of the following factors:

• Speech quality (intelligibility)• Communication delay• Computational complexity of implementation• Power consumption• Robustness to noise (channel noise, signal fading, and intersymbol

interference)• Robustness to packet losses (for packet-switched networks)

Due to the limitation of bandwidth for narrowband speech applications,speech coders are designed to minimize the bit rate for transmission or storageof speech, but at the same time provide acceptable levels of performance inone or more of the above areas.

Now, we describe briefly the above parameters of performance, withparticular reference to speech.

1.10.1 Speech Quality

Speech quality measures are discussed in Section 1.8. MOS is one of themost popular methods. Speech quality is usually evaluated on a 5-pointscale, known as the MOS scale, in speech quality testing—an average overa large number of speech data, speakers, and listeners. The five points inorder of quality are bad, poor, fair, good, and excellent. Quality scores of 3.5 orhigher generally imply high levels of intelligibility, speaker recognition, andnaturalness.

Some also rate speech quality as toll quality, less than toll quality, and so on.

Page 37: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

18 Principles of Speech Coding

1.10.2 Communication Delay

Modern speech coders often process speech in frames (or subframes). Thisinevitably introduces communication delay. This is in addition to the com-munication delays inherent in the channel. Depending on the application,the permissible total delay could be as low as 5 ms, as in network telephony,or as high as 500 ms, as in video telephony. This delay can be annoying in areal-time telephony conversation and is therefore undesirable beyond 200 msfor many speech coders.

1.10.3 Computational Complexity

The computational complexity of a speech coding algorithm is the process-ing effort required to implement the algorithm, and is typically measured interms of arithmetic capability (multiplies and adds) and memory requirement(kilobytes of storage). Examples of this measure include million instructionsper second/million floating-point operations per second (MIPS/MOPS).

1.10.4 Power Consumption

The power consumption is important especially since many applications ofmodern speech coders are in portable devices such as cellular phones andother appliances. This is related to computational complexity because a highlycomputationally complex algorithm typically requires more computations asmeasured by MIPS or MOPS and hence more power from the processor.

1.10.5 Robustness to Noise

The various types of channels for communication of speech are becomingmore varied: satellite, POTS, cellular, and so on. Some of the issues are channelnoise, signal fading, and intersymbol interference.

1.10.6 Robustness to Packet Losses (for Packet-Switched Networks)

The Internet has been very popular as a means for speech communication.Since the Internet is a packet-switched network (and not a circuit-switchednetwork), there are issues about robustness to packet losses that have becomeimportant. Newer speech coders designed for use on packet networks needto have this robustness designed into them. Later in Chapter 10, we giveexamples of one of these newer speech coders and some of the issues involved.

1.11 Summary of Speech Coding Standards

Table 1.2 gives the selected standardized speech coders and their compar-isons with respect to the coding method, bit rate, MOS, and MIPS required, if

Page 38: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 19

TABLE 1.2

Performance and Complexity Comparisons of Selected SpeechCoding Algorithms

Coding Bit Rate

Algorithm Method (kbps) MOS MIPSa

ITU-T Speech Coders (Narrowband Coders) for PSTN

G.711 (PCM) PCM 64 4.3 0.01G.721/G.726 (VBR-ADPCM) ADPCM 16, 24, 32, 40 4.1 ∼2G.723.1 (MP-MLQ/ACELP) CELP 5.3/6.3 3.7/3.9 ∼18G.727 (ADPCM) Waveform 8 4.3+ ∼19G.728 (LD-CELP) CELP 16 4.0+ ∼19G.729 (CS-ACELP) CELP 8 4.3+ ∼19G.729A (CS-ACELP) CELP 8 4.3+ ∼19

ITU-T Speech Coders (Wideband Coders)

G.722 (ADPCM) Waveform 32 4.1/94/68 ∼2G.722.1 (ADPCM) Waveform 32 4.1/94/68 ∼2G.722.2 (ADPCM) Waveform 32 4.1/94/68 ∼2

Cellular Telephony-Based Speech Coders

ETSI GSM-FR 6.10 (RPE-LTP) RPE-LTP 13 3.47 6ETSI GSM-HR 6.20 (VSELP) VSELP 5.6 6ETSI GSM-EFR (ACELP) CELP 12.2 6ETSI GSM-AMR (ACELP) CELP 12.2, 10.2,

7.95, 7.40,6.70, 5.90,5.15, 4.75

6

EVRC (Qualcomm) CELP 4.5Skyphone-MPLP 9.6 3.4 11TIA IS-54 (VSELP) VSELP 8 3.45 13.5TIA IS-127 (RCELP/ACELP) CELP 8 3.45 13.5TIA IS-96 (VBR-QCELP) CELP 0.8, 2, 4,

8.54.2 13.5

IS-893 (cdma2000) 0.8, 2.0,4.0, 8.5

3.93 at3.6 kpbsADR

18

TIA IS-641 (ACELP) CELP 7.4 13.5TIA IS-133 (ACELP) CELP 7.4 13.5

U.S. Government Standardized Speech Coders

FS 1015 (LPC-10e) LPC-10 2.4 2.3 ∼7FS1016 (CELP) CELP 4.8 3.2 16FS DoD2.4 MELP MELP-LPC 2.4LPC-LSP LPC 800 ∼20STC-1 4.8 3.52 13STC-2 2.4 2.9 13

continued

Page 39: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

20 Principles of Speech Coding

TABLE 1.2 (continued)

Performance and Complexity Comparisons of Selected SpeechCoding Algorithms

Coding Bit Rate

Algorithm Method (kbps) MOS MIPSa

Satellite Communication-Based Speech Coders

INMARSAT-M IMBE 4.15 3.4 120/3INMARSAT-Mini AMBE 3.6 3.4

Other Internet-Based Speech Coders

IETF iLBC (CS-ACELP) CELP

Notes: ∼, estimated; +, low score reported.a Processor-speed-dependent.

available and known. The delay per frame size required is also sometimesused for comparison, but is not shown here. When known, the MIPSrequirements are usually processor speed dependent.

Figure 1.12 is a plot of speech quality in MOS versus bit rate for somepopular speech codecs. Table 1.3 compares the ITU and ETSI speech codecs

1

5

4

3

2

1

02 4 8 16 32 64 128

Bit rate (kbps)

Spee

ch q

ualit

y (M

OS

scor

e)

G.726G.727

G.728

FS1015

FS1016

G.729

MELP2.4

IS54

IS96

G.723.1

FIGURE 1.12 MOSs (quality) versus bit rate for many popular speech codecs.

Page 40: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introductionto

SpeechC

oding21

TABLE 1.3

Performance Comparisons of ITU and ETSI Speech Coders

Standards Body ITU ITU ITU ITU ITU ITU ETSI ETSI ETSI

Recommendation G.711 G.726 G.728 G.729 G.729A G.723.1 GSM-(FR) GSM-(HR) GSM-(EFR)Coder type Compounded ADPCM LD-CELP CS-ACELP CS-ACELP MPC-MLQ & RPE-LTP VSELP ACELP

PCM ACELPDates 1972 1990 1992/4 1995 1996 1995 1987 1994 1995Bit rate (kbps) 64 16–40 16 8 8 6.3 and 5.3 13 5.6 12.2Peak quality Toll ≤Toll Toll Toll Toll ≤Toll <Toll =GSM TollBackground noise Toll ≤Toll Toll ≤Toll ≤Toll ≤Toll <Toll <GSM TollTandem Toll Toll Toll <Toll <Toll <Toll <Toll <GSM TollFrame erasure (%) No mechanism No mechanism 3 3 3 3 3 3 3Complexity (MIPS) �1 ∼1 ∼ 30 ≤20 ≤11 ≤18 ∼4.5 ∼30 ∼20RAM 1 byte <50 bytes 2 KB <2.5 KB 2 KB 2.2 KB 1 KB 12 KB 9 KBFrame size (ms) 0.125 0.125 0.625 10 10 30 20 20 20Lookahead (ms) 0 0 0 5 5 7.5 0 4.4 0Codec delay (ms) 0.25 0.25 1.25 25 25 67.5 40 44.4 40

Page 41: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

22 Principles of Speech Coding

TABLE 1.4

Performance Comparisons of North American Wireless Speech Coders

Standards Body TIA TIA TIA TIA TIA

Recommendation IS-54 IS-641 IS-96 IS-127 IS-133System TDMA TDMA CDMA CDMA CDMACoder type VSELP ACELP QCELP ACELP CELPDates 1990 1995 1993 1997 1997Bit rate (kbps) 7.95 7.4 0.8–8.5 0.8–8 0.8–13Peak quality =GSM Toll =GSM Toll TollBackground noise �Toll <Toll �Toll <Toll Toll ?Tandem �Toll <Toll �Toll <Toll Toll ?Frame erasures (%) 3 3 3 3 3Complexity (MIPS) 20 20 20 20 20RAM 2 KB 4 KB 2 KB 4 KB? 4 KBFrame size (ms) 20 20 20 20 20Lookahead (ms) 5 5 5 5 5Codec delay (ms) 45 45 45 45 45

using many performance parameters of interest. Table 1.4 compares the NorthAmerican wireless speech codecs using several performance parameters.Table 1.5 compares the ITU G.729, G.729A, G.729D, and G.729E series ofspeech codecs. Table 1.6 compares the bandwidth attributes of several ITUG.7xx 2 series of speech codecs.

TABLE 1.5

Performance Comparisons of ITU G.729 Series Speech Coders

Standards Body ITU ITU ITU ITU

Recommendation G.729 G.729A G.729D G.729ECoder type CS-ACELP CS-ACELP CS-ACELP CS-ACELPDates 1995 1996 1998 1998Bit rate (kbps) 8 8 6.4 11.8Peak quality Toll Toll <Toll TollBackground noise ≤Toll ≤Toll <Toll TollTandem <Toll <Toll <Toll TollFrame erasures (%) 3 3 3 3Complexity (MIPS) ≤20 ≤11 <20 <30RAM (KB) <2.5 2 <2.5 <4Frame size (ms) 10 10 10 10Lookahead 5 5 5 5Codec delay 25 25 25 25

Page 42: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 23

TABLE 1.6

Bandwith Attribute Comparisons of ITU G.7xx Series Speech Coders

Bit Rate Bytes/ Frms/ Bytes/

Codec (kbps) 10 ms pckt pckt Packets/s Bytes/s kbps

G711 (10 ms) 64 80 1 120 100 12,000 96G711 (20 ms) 64 80 2 200 50 1000 80G711 (30 ms) 64 80 3 280 33.33 9332.4 74.7G.726.16 (10 ms) 16 20 1 60 100 6000 48G.726.16 (20 ms) 16 20 2 80 50 4000 32G.726.16 (30 ms) 16 20 3 100 33.33 3333 26.7G.726.24 (10 ms) 24 30 1 70 100 7000 56G.726.24 (10 ms) 24 30 2 100 50 5000 40G.726.24 (10 ms) 24 30 3 130 33.33 4332.9 34.7G.726.32 (10 ms) 32 40 1 80 100 8000 64G.726.32 (20 ms) 32 40 2 120 50 6000 48G.726.32 (30 ms) 32 40 3 160 33.33 5332.8 42.7G.726.40 (10 ms) 40 50 1 90 100 9000 72G.726.40 (20 ms) 40 50 2 140 50 7000 56G.726.40 (30 ms) 40 50 3 190 33.33 6332.7 50.7G.728 (10 ms) 16 20 1 60 100 6000 48G.728 (20 ms) 16 20 2 80 50 4000 32G.728 (30 ms) 16 20 3 100 33.33 3333 26.7G.729A (10 ms) 8 10 1 50 100 5000 40G.729A (20 ms) 8 10 2 60 50 3000 24G.729A (30 ms) 8 10 3 70 33.33 2333.1 18.7

1.12 Summary

In this chapter, we began by showing some example speech signals anddescribing the characteristics of speech signals. Then, we introduced theideas of speech analysis and speech coding. We gave a brief history of speechcoding algorithm development. We also classified speech coders as waveformand parametric coders, and briefly described the LPC model used for speechgeneration.

We described the various measures used for speech quality. Then, wediscussed communication networks and speech coding and the variousperformance issues in speech communication systems. We concluded thechapter with a summary of speech coding standards and compared theirperformances.

Page 43: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

24 Principles of Speech Coding

EXERCISE PROBLEMS

1.1. Differentiate between speech analysis, speech coding, and speech synthesis.1.2. Name the three different classifications of speech. Under what circumstances can

the three classifications be reduced to only two? How accurate will the two classesbe in modeling speech?

1.3. What is a vocoder? Explain. Differentiate a vocoder from a waveform coder.1.4. What are the major characteristics of speech?

• Linear/nonlinear

• Stationary/nonstationary

1.5. Using MATLAB®, capture about 5 s of your speech. You will need a microphoneattached to your PC. Use the relevant MATLAB functions to capture 5 s of yourspeech. The following is sample MATLAB codes:

• fs = 8000; %% sampling frequency of 8000 Hz,x = wavrecord (5*fs, fs, “double”); %% record as .wav file in double formatwavplay(x, fs); %% play out the recorded speech.

• Now apply many of the signal-processing functions to segments of yourcaptured speech signal. Plot it, find the voiced/unvoiced parts, plot thespectral density function, plot the periodogram, and other interestingspectral functions.

1.6. Using MATLAB, plot the energy distribution versus frequency of the speech utter-ance in Figure 1.2 (or use the speech captured in Exercise Problem 1.5). Verify theenergy frequency limits of 1.5–3.4 kHz mentioned in Section 1.3.

References

1. Flanagan, J.L., Speech Analysis, Synthesis and Perception, Springer, Berlin, 1983.2. Rabiner, L.R. and R.W. Schafer, Digital Processing of Speech Signals, pp. 53–60,

Prentice Hall, Englewood Cliffs, NJ, 1978.3. Markel, J.D., The SIFT algorithm for fundamental frequency estimation, IEEE

Transactions on Acoustics, Speech and Signal Processing, 20, 149–153, 1972.4. Atal, B.S., The history of linear prediction, IEEE Signal Processing Magazine,

(March), 23(2), 154–161, 2006.5. Elias, P., Predictive coding I, IRE Transactions on Information Theory, IT-1(1), 16–24,

1955.6. Elias, P., Predictive coding II, IRE Transactions on Information Theory, IT-1(1), 24–33,

1955.7. Saito, S., Fukumura, and F. Itakura, Theoretical considerations of the statistical

optimum recognition of the spectral density of speech, Journal of the Acoustic Societyof Japan, 1967.

8. Itakura, F. and S. Saito, A statistical method for estimation of speech spectraldensity and formant frequencies, IEICE Transactions on Electronic Communicationof Japan, 53-A, 36–43, 1970.

9. Tremain, T.E., The government standard linear predictive coding algorithm: LPC-10, Speech Technology, 1, 40–49, 1982.

Page 44: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Introduction to Speech Coding 25

10. Texas Instruments, Speak and Spell manual, 1981.11. Atal, B.S. and M.R. Schroeder, Predictive coding of Speech, Proceedings of the 1967

Conference on Communications and Proc., pp. 360–361, November 1967.12. Atal, B.S. and M.R. Schroeder, Adaptive predictive coding of speech, Bell Systems

Technical Journal, 49(8), 1973–1986, 1970.13. Atal, B.S. and S.L. Hanauer, Speech analysis and synthesis by linear predic-

tion of the speech wave, Journal of the Acoustic Society of America, 50(8), 637–655,1971.

14. Atal, B.S. and J.R. Remde, A new model of LPC excitation for producing natural-sounding speech at low bit rates, Proceedings of the IEEE ICASSP Conference, pp.614–617, 1982.

15. Schroeder, M.R. and B.S. Atal, Code-excited linear prediction (CELP): High-quality speech at very low bit rates. Proceedings of the IEEE ICASSP Conference,pp. 937–940, 1985.

16. Robert, M.G., The 1974 origins of VoIP, IEEE Signal Processing Magazine, 22(4),87–90, 2005.

17. Andersen, S.V., et al., iLBC—a linear predictive coder with robustness to packetlosses, Proceedings of the IEEE Speech Coding Workshop, 2002.

18. ITU-T, Perceptual evaluation of speech quality (PESQ), Recommendation P.862.19. Wang, S., A. Sekey, and A. Gersho, An objective measure for predicting subjec-

tive quality of speech coders, IEEE Journal on Selected Areas of Communications,JSAC-10, 819–829, 1992.

20. Yang, W., M. Benbouchta, and R. Yantorno, Performance of the modified barkspectral distortion as an objective speech quality measure, Proceedings of the IEEEICASSP Conference, pp. 541–544, 1998.

21. Gibson, J., peech coding methods, standards and applications, IEEE Circuits andSystems Magazine, pp. 30–49, Fourth quarter, 2005.

Bibliography

1. Rabiner, L.R., On the use of autocorrelation analysis for pitch detection, IEEETransactions on Acoustics, Speech and Signal Processing, 25, 24–33, 1977.

2. Noll, A.M., Cepstrum pitch determination, Journal of Acoustical Society of America,41, 293–309, 1967.

3. Hess, W.J., Pitch Determination of Speech Signals: Algorithms and Devices, Springer,Berlin, 1983.

4. Barnwell, T.P., K. Nayebi, and C.H. Richardson, textitSpeech Coding: A ComputerLaboratory Textbook, Wiley, New York, 1995.

5. Childers, D., et al., The past, present and future of speech processing, IEEE SignalProcessing Magazine, pp. 24–48, 1998.

6. Childers, D., et al., Speech Processing and Synthesis Toolbox, Wiley, New York, 1999.7. Atal, B.S. and L.R. Rabiner, Speech research directions, AT&T Technical Journal,

65(5), 75–88, 1986.8. Hui, L., B.-Q. Dai, and L. Wei, A pitch detection algorithm based on AMDF and

ACF, Proceedings of the IEEE ICASSP, 377–380, 2006.

Page 45: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

26 Principles of Speech Coding

9. Kondoz, A.M., Digital Speech: Coding for Low Bit-Rate Speech, 2nd edition, Wiley,New York, 2004.

10. Chu, W.C., Speech Coding Algorithms: Foundation and Evolution of StandardizedSpeech Coders, Wiley-Interscience, New York, 2003.

11. Parsons, T., Voice and Speech Processing, pp. 115–116, McGraw-Hill, New York, 1987.12. Sanjit, K.M., Digital Signal Processing: A Computer Based Approach, pp. 450–455,

McGraw-Hill, New York, 2001.13. Jont, B.A., Short term spectral analysis, synthesis and modification by discrete

Fourier transform, IEEE Transactions on Acoustics, Speech and Signal Processing,ASSP—25(3), 1977.

Page 46: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

2Fundamentals of DSP for Speech Processing

2.1 Introduction to LTI Systems

The theory of linear, time-invariant (LTI) systems is well-studied. Analysistechniques have been developed that are in use and amenable to computertools. We briefly review the two requirements: linearity and time-invariance.Other properties can be found in References [1–3].

2.1.1 Linearity

A linear system is one that obeys the two principles of (i) additivity and (ii)homogeneity. Let the system be described by yi(t) = S[xi(t)] (Figure 2.1). Ifthe inputs xi(t), i = 1, 2, . . . , P lead to the outputs yi(t), i = 1, 2, . . . , P, respec-tively, then a linear combination of inputs xi(t), i = 1, 2, . . . , P,

x(t) =P∑

i=1

aixi(t) (2.1)

will lead to the output

y(t) =P∑

i=1

aiyi(t). (2.2)

2.1.2 Time Invariance

A time-invariant system is one that does not change with time. Let the systembe described by yi(t) = S[xi(t)]. That means that a delayed input will lead toa similarly delayed output, that is, given the input–output pairs, the outputswill be related as follows: z(t) = y(t − t0) (Figure 2.2).

The consequence of the linearity and time-invariance properties ofthe system leads to some very valuable properties of LTI systems.Such properties include impulse response, convolution, duality, stability,scaling, etc.

27

Page 47: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

28 Principles of Speech Coding

S yi(t)xi(t)

FIGURE 2.1 A system to be tested for LTI properties.

x(t)

x(t – t0) z(t) = S[x(t – t0)]

y(t) = S[x(t)]

S

S

FIGURE 2.2 Time-invariant property of systems.

2.1.3 Representation Using Impulse Response

The impulse function, δ(t), is a generalized function. It is an ideal functionthat does not exist in practice. It has many possible definitions. It is idealizedbecause it has zero width and infinite amplitude. Let the rectangular pulsefunction p(t) be of width Δ(delta) and its amplitude 1/Δ. Note that the areaunder p(t) is always 1 (Δ × 1/Δ = 1). As width Δ(delta) goes to zero, itsamplitude 1/Δ goes to infinity.

p(t)

1/Δ

t–Δ/2 Δ/2

δ(t) = limΔ→0

p(t)

Note that∫∞

−∞ δ(τ) dτ = 1 and∫∞

−∞ δ(t − t0) dt = 1.

For all t:∫t

−∞ δ(τ) dτ = u(t) and x(t)δ(t) = x(0)δ(t). u(t) is the unit-stepfunction.

Sifting x(t)δ(t − t0) = x(t0)δ(t − t0),

∞∫

−∞x(t)δ(t − t0) dt = x(t0)

∞∫

−∞δ(t − t0) dt

= x(t0).

Sifting property:∞∫

−∞x(t)δ(t − t0)dt = x(t0). (2.3)

Page 48: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 29

2.1.4 Representation of Any Continuous-Time (CT) Signal

It is a consequence of the sifting property that any arbitrary input signal x(t)can be represented as a sum of weighted x(τ) and shifted impulses δ(t − τ) asshown below:

x(t) =∞∫

−∞x(τ)δ(t − τ) dτ. (2.4)

2.1.5 Convolution

Recall that the impulse response is the system’s response to an impulse.

Impulse δ (t) Impulse response h(t)System

Recall the property of the impulse function that any arbitrary function x(t)can be represented as

x(t) =∞∫

−∞x(τ)δ(t − τ) dτ. (2.5)

At time t0, the CT signal can be represented as

x(t0) =∞∫

−∞x(τ)δ(t0 − τ) dτ. (2.6)

If the system is S, an impulse input gives the response

δ(t) S h(t),Impulse responseImpulse

and a shifted impulse input gives the response

δ(t − τ) h(t, τ).S

If S is linear,

x(τ)δ(t − τ) x(τ)h(t, τ).S

Page 49: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

30 Principles of Speech Coding

If S is time-invariant, then

δ(t − τ) h(t, τ) = h(t − τ)S

If S is linear and time-invariant, then

x(τ)δ(t − τ) x(τ)h(t − τ)LTI

Therefore, any arbitrary input x(t) to an LTI system gives the output, thelinear convolution:

x(t)=−∫∞

∞ x(τ)δ(t − τ) dτ y(t)=

−∫∞

∞ x(τ)h(t − τ) dτ,LTI

y(t) = x(t) ∗ h(t), (2.7)

y(t) =∞∫

−∞x(τ)h(t − τ) dτ. (2.8)

Note: For any time t = t0, we can obtain

y(t0) =∞∫

−∞x(τ)h(t0 − τ) dτ

for example, for any time t = 0, we can obtain

y(0) =∞∫

−∞x(τ)h(−τ) dτ.

2.1.6 Differential Equation Models

An LTI system can be described by the following differential equation:

a0y(t) + a1ddt

y(t) + · · · + an−1dn−1

dtn−1 y(t) + andn

dtn y(t)

= b0x(t) + b1ddt

x(t) + · · · + bm−1dm−1

dtm−1 x(t) + bmdm

dtm x(t), m �= n.(2.9)

The order n determines the order of the differential equation. The equationcan be written as

n∑k=0

akdk

dtky(t) =

m∑k=0

bkdk

dtkx(t). (2.10)

Page 50: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 31

2.2 Review of Digital Signal Processing

A discrete time (DT) signal is a sequence of numbers obtained by samplingan analog signal at DT instants. For example, a CT signal x(t) sampled at afrequency of fs will give a DT signal

x(n) = x(t)|t=nTs = x(nTs), where fs = 1Ts

. (2.11)

2.2.1 Sampling

To obtain a DT digital signal from a CT analog signal, sample the CT signalregularly every Ts seconds. The sampling period is Ts seconds. The samplingfrequency is fs = 1/Ts Hz (which is alsoωs = 2π/Ts = 2πfs rad/s) (Figure 2.3).

DT signals

Unit pulse:

δ (n)

1

0 forδ (n) =

1, forn ≠ 0n = 0

⎧⎨⎩

2.2.2 Shifted Unit Pulse: δ(n – k )

10 for

δ(n − k) =1 for

n ≠ k,n = k,

⎧⎨⎩

k

A/D x(t) x(n)

D/A y(t)y(n)

DSP processor

Digital

FIGURE 2.3 Digital signal processing of converted analog to digital signal and back to analog.

Page 51: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

32 Principles of Speech Coding

u(n)

1

0 n2 41–1 3 5 6–2–3

FIGURE 2.4 Unit step digital signal.

Unit step (Figure 2.4):

u(n) ={

0 for n < 0,

1 for 0 ≤ n ≤ ∞.

2.2.3 Representation of Any DT Signal

Any general arbitrary DT signal can be represented by the following formula:

x(n) =k=∞∑

k=−∞x(k)δ(n − k). (2.12)

Example

A CT signal x(t) = cos(2πft) is converted to a DT signal:

x(n) = x(t)∣∣t=nTs = cos(2πft)

∣∣t=nTs

= cos(

2πffs

n)

.

The analog frequency f is converted to a digital frequency

ω = 2πffs

, 0 ≤ ω ≤ π.

The advantages of digital signal processing over analog signal processingcan be itemized as follows:

• Unlike analog circuits, the operations of digital circuits do notdepend on the precise value of digital signals. As a result, digitalcircuits are less sensitive to tolerances of component values and arefairly independent of temperature, aging, and most of the externalparameters.

Page 52: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 33

• Digital circuits can be reproduced easily in volume quantities anddo not require any adjustments either during construction or laterwhile in use. Digital circuits are amenable to full integration andwith recent advances in VLSI circuits, it has been possible to integratehighly sophisticated and complex digital signal processing schemeson a single chip.

• Unlike in analog processing, the digital processor signals and coeffi-cients describing the processing operations are represented as binarywords. Therefore, any desirable accuracy can be achieved by sim-ply increasing the word length, subject to cost limitations. Moreover,dynamic ranges for signals and coefficients can be increased furtherby using floating point arithmetic if necessary.

• Digital processing allows sharing of a given processor among a num-ber of signals by time sharing (time-division multiplexing) or byfrequency (frequency-division multiplexing), thus reducing the costof processing per signal.

• Digital implementation permits easy adjustments of processor char-acteristics during processing, such as that needed for implementationof adaptive filters. Such adjustments can be simply carried out byperiodically changing coefficients of the algorithms representing theprocessor characteristics.

• Digital implementation allows the realization of certain characteris-tics not possible with analog implementation, such as exact linearphase, code-division multiplexing, and multirate signal processing.

• Digital signals can be cascaded without input/output loading prob-lems unlike analog signals.

• Digital signals can be stored almost indefinitely without any loss ofinformation on various storage media such as magnetic tapes anddiscs and optical drives. On the other hand, stored analog signalsdeteriorate rapidly as time progresses and cannot be recovered intheir original forms.

• Applicability of digital processing to very low-frequency signalssuch as those occurring in seismic applications (e.g., telluric sig-nals), where the sizes of inductors and capacitors needed for analogprocessing would be physically very large.

• However, there are some disadvantages of digital signal processingover analog signal processing. Some are itemized here:– Increased system complexity in the digital processing of analog

signals because of the need for additional pre- and postprocess-ing devices such as A/D and D/A convertors and their associatedfilters and complex digital circuitry.

– The Nyquist criterion states that analog signals must be sampledat more than twice the maximum frequency of the analog signal.

Page 53: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

34 Principles of Speech Coding

If this condition is not satisfied, then signal components withfrequencies above half the sampling frequency appears as signalcomponents below this particular frequency, totally distorting theinput analog waveform. For digital signal processing, only a lim-ited range of frequencies are available for processing. This propertylimits its application, particularly in the digital processing of analogsignals.

– Digital systems are constructed using active devices that consumeelectrical power. On the other hand, a variety of analog process-ing algorithms are implemented using passive circuits employinginductors, capacitors, etc., that do not consume power. Moreover,active devices are less reliable than passive components.

– Another disadvantage of digital signal processing is due to theeffects resulting from the algorithms implemented with finiteprecision arithmetic in hardware and software.

DT convolution is the output of an LTI system with impulse response h(n)

and can be represented as

y(n) =k=∞∑

k=−∞h(k)x(n − k).

Therefore, any arbitrary input x(n) to an LTI system gives the output, thelinear convolution:

k = ∞

k = −∞x(n) = x(k)δ(n − k) h(k)x(n − k),

k = ∞

k = −∞y(n) =

LTI

y(n) = x(n) ∗ h(n), (2.13)

y(n) =k=∞∑

k=−∞h(k)x(n − k). (2.14)

The sampling process can be represented by multiplication of the analogsignal x(t) by an impulse train given by (Figure 2.5)

p(t) =∞∑

n=−∞δ(t − nTs).

Page 54: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 35

Ts = Sampling period

fs = = Sampling frequency

p(t)

1Ts

FIGURE 2.5 Impulse train digital signal.

A sampled version of any signal x(t) is given by multiplying x(t) by p(t)(Figure 2.6), that is,

xs(t) = x(t)p(t) = sample version of x(t)

= x(t)∞∑

n=−∞δ(t − nTs) =

∞∑n=−∞

x(t)δ(t − nTs) =∞∑

n=−∞x(nTs)δ(t − nTs).

xs(t) = x(nTs) is the sampled version of x(t).X(ω) is the Fourier transform of x(t). Note that X(ω) is bounded in

frequency by ωc (Figure 2.7):

xs(t) = x(t)p(t), (2.15)

Xs(ω) = 12π

[X(ω) ∗ P(ω)] , (2.16)

P(ω) =∞∑

k=−∞2πCkδ(ω − kω0). (2.17)

Ck are the Fourier coefficients of the periodic signal:

ω0 = 2π

Ts, Ck = 1

Ts,

Xs(ω) = 12π

⎡⎣X(ω) ∗

∞∑k=−∞

ω0δ(ω − kω0)

⎤⎦ =

∞∑k=−∞

ω0

2πX(ω − kω0).

0 Ts 2Ts–3Ts 3Ts–2Ts –Ts

xs(t) X(ω)

–ωc ωcω

FIGURE 2.6 The sampled signal, xs(t), and the frequency response of x(t).

Page 55: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

36 Principles of Speech Coding

X(ω)

ωc = 2π fmax

–ωc ωc

FIGURE 2.7 Bandlimited frequency response of x(t).

If fs > 2fmax, then the original signal x(t) can be recovered by low-passfiltering, as shown in Figure 2.8.

If fs < 2fmax, then ALIASING will occur and the original signal x(t)CANNOT be recovered by low-pass filtering.

2.2.4 Introduction to Z Transforms

The Z transform of the DT signal is defined as

X(z) =n=+∞∑n=−∞

x(n)z−n. (2.18)

The linear convolution of a DT signal x(n) with an impulse response h(n)

is given by

x(n) ⊗ h(n) =k=+∞∑k=−∞

x(k)h(n − k). (2.19)

Taking Z transforms, we obtain

y(n) = x(n) ⊗ h(n) =k=+∞∑k=−∞

x(k)h(n − k) ⇒ Y(z) = X(z)H(z). (2.20)

The output of an LTI system is the linear convolution of the system impulseresponse and the input signal. The output can be computed using variousmethods such as fast convolution (overlap-add or overlap-save), recursive

–2w0

w022π fmax = wc <

fs > 2 fmax

2w0 2

w0 3w0 2

3w02

w02

2π fs = w0 Xs(ω)

–w0 wc w0 –wc– –

FIGURE 2.8 Frequency response Xs(ω) of sampled signal xs(t).

Page 56: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 37

convolution, and so on [1–5]. The recursive convolution methods discussedin Reference [4] are used in speech processing to save computations.

2.2.5 Fourier Transform, Discrete Fourier Transform

The Fourier transform of a time-domain signal, f (t), is defined as

Analysis: F(ω) =∞∫

−∞f (t)e−jωt dt, (2.21)

Synthesis: f (t) = 12π

∞∫

−∞F(ω)ejωt dω. (2.22)

Example

Find the Fourier transform of x(t)

x(t)

τ

–T/2 T/2

X (ω) =∞∫

−∞x(t)e−jωt dt

=T /2∫

−T /2

V e−jωt dt

= 2Vsin(ωT /2)

ω= TV sin c

(ωT2

).

The discrete-time Fourier transform (DTFT) of a DT sequence x(n) isdefined as

X(ω) =N−1∑n=0

x(n)e−jωn, (2.23)

which is a continuous function of digital frequency ω.

Page 57: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

38 Principles of Speech Coding

The Discrete Fourier Transform (DFT) of a DT sequence x(n) is defined as

X(k) =n=N−1∑

n=0

x(n)e−j(2πnk/N) for k = 0, 1, 2, . . . , N − 1, (2.24)

which is a discrete function of frequency k. Note that ω = 2πk/N.It can be related to Z transform as

X(k) =n=N−1∑

n=0

x(n)e−j(2πnk/N) = X(z)

∣∣∣∣∣z=ej(2πk/N)

for k = 0, 1, 2, . . . , N − 1.

(2.25)

2.2.6 Digital Filter Structures

A general pole-zero digital filter is represented by the transfer function

H(z) = Y(z)X(z)

= B(z)A(z)

=∑M

i=0 biz−i

1 + ∑Ni=1 aiz−i

,

where ai, i = 1, 2, 3, . . . , N are the feedback filter coefficients and bi, i =0, 1, 2, 3, . . . , M are the feedforward filter coefficients. This filter has M zerosand N poles. Note that when N = 0, then the filter H(z) is a finite impulseresponse (FIR) filter; otherwise it is an infinite impulse response (IIR) filter.An FIR filter is shown in Figure 2.9. Also when M = 0, then the filter H(z) isan all-pole filter; otherwise it is a pole-zero filter.

This filter can be implemented in a variety of ways: direct-form I, direct-form II, transpose forms, parallel forms, cascade forms, and so on [1–3] (seeFigures 2.10 and 2.11 for direct forms). A lattice filter (Figure 2.12) can alsobe used to implement the general pole-zero digital filter. The lattice filter istypically used in speech applications.

....

....

x(n – M + 1) x(n – 1) x(n – M)x(n)

b0 b1 bM–1

z–1 z–1

bM

y(n) Σ Σ Σ

FIGURE 2.9 FIR transversal filter structure.

Page 58: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 39

b0

b1

b2

bN–1

bN

z−1

z−1

z−1

z−1

z−1

+ 1/a0

–a1

–a2

–aN–1

–aN

z−1

+

+

+

x[n]w[n]

y[n]

FIGURE 2.10 Direct-form I implementation of the IIR digital filter.

x[n] y[n] 1/a0+

+

+

+

z–1

z–1

–a1

–a2

–aN–1

–aN

z–1

+

+

+

+ b1

b0

b2

bN

bN–1

FIGURE 2.11 Direct-form II implementation of the IIR digital filter.

Page 59: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

40 Principles of Speech Coding

fM(n)fM1(n)

bM(n)bM1(n)b1(n)b0(n)

f1 f0(n)f0(n)

u(n)

z–1 z–1

Stage 1 Stage M

Γ1 ΓM

ΓM∗Γ1∗

FIGURE 2.12 Lattice filter implementation of the IIR digital filter.

An all-pole digital filter is represented by

1A(z)

= 1

1 + ∑Ni=1 aiz−i

,

where ai, i = 1, 2, 3, . . . , N are the filter coefficients. The filter has N poles.(Actually, the filter also has N zeros all at zero.) This is equivalent to settingbi = 0, for all i = 0, 1, 2, 3, . . . , M in the general pole-zero digital filter.

An all-zero digital filter is represented by

A(z) = 1 +N∑

i=1

aiz−i, (2.26)

where ai, i = 1, 2, 3, . . . , N are the filter coefficients. The filter has N zeros.(Actually, the filter also has N poles all at zero.) An all-pole filter is an inverseof the all-zero filter.

Both all-pole and all-zero filters can be implemented in a variety of ways.The output of this FIR filter can be represented by the following:

y(n) = x(n) ⊗ h(n) =k=+∞∑k=−∞

x(k)h(n − k) ⇒ Y(z) = X(z)H(z). (2.27)

Page 60: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 41

An FIR digital filter is represented by the transfer function

H(z) = Y(z)X(z)

= B(z) =M∑

i=0

biz−i.

The coefficients represent the values of the FIR (filter’s impulse response),b0, b1, b2, . . . , bM−1, bM.

Note that a0 = 1.

2.3 Review of Stochastic Signal Processing

Consider two DT stochastic processes from which two sequences of randomvariables x(n) and y(m) are generated.

The statistical mean (the first-order moment) of the random variables x(n)

and y(m) are respectively

μx(n) = E[x(n)] =∞∫

−∞xpx(x) dx (2.28)

and

μy(m) = E[y(m)] =∞∫

−∞ypy(y) dy, (2.29)

where px(x) is the probability density function (pdf) of the random variablex(n) and py(y) is the pdf of the random variable y(m). For stationary DTstochastic processes, the first-order moments are not functions of time, that is,

μx(n) = μx and μy(m) = μy.

The cross-correlation function of a sequence of random variables sampled attime n, x(n), with another sequence of random variables sampled at time m,y(m), is defined as

rxy(n, m) = E[x(n)y∗(m)] =∞∫

−∞

∞∫

−∞xypxy(x, y) dx dy, (2.30)

where the superscript in y∗(m)indicates complex conjugate operation. If bothsamples are from stationary processes, then the correlation function will be afunction of the time lag between the two time instants, that is,

rxy(n − m) = E[x(n)y∗(m)]. (2.31)

Page 61: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

42 Principles of Speech Coding

The autocorrelation function (the second-order moment) of a sequence ofrandom variables sampled at time n, x(n), with another random variablefrom the same process sampled at time m, x(m), is defined as

rxx(n, m) = E[x(n)x∗(m)].

For a stationary DT stochastic process, it is a function only of the lagk = (n − m):

rx(k) = rxx(k) = E[x(n)x∗(n − k)]. (2.32)

Note that

rx(−k) = E[x(n)x∗(n + k)] = E[x(n − k)x∗(n)].

The cross-covariance function of a sequence of random variables sampled attime n, x(n), with another random variable from another process sampled attime m, y(m), is defined as

cxy(n, m) = E[(x(n) − μx(n))(y(m) − μy(m))∗

]

=∞∫

−∞

∞∫

−∞(x(n) − μx(n))(y(m) − μy(m))pxy(x, y) dx dy,

(2.33)

where μx(n) = E[x(n)] and μy(m) = E[y(m)] are the first-order moments(statistical means) of the random variables x(n) and y(m), respectively.

The autocovariance function of a sequence of random variables sampled attime n, x(n), with another random variable from the same process sampledat time m, x(m), is defined as

cxx(n, m) = E[(x(n) − μx(n))(x(m) − μx(m))∗

]. (2.34)

For a stationary DT stochastic process, it is a function only of the lag k =(n − m):

cxx(n − m) = E[(x(n) − μx)(x(m) − μx)

∗] ,

cxx(k) = E[(x(n) − μx)(x(n − k) − μx)

∗] .

For zero-mean processes, the autocovariance and autocorrelation are thesame:

cxx(k) = E[(x(n)x∗(n − k))

] = rxx(k). (2.35)

Page 62: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 43

A wide-sense stationary (WSS) process is one that is stationary only up to thesecond order, that is,

E[x(n)] =∞∫

−∞xpx(x) dx = μx for all n,

E[x(n)x∗(n − k)

] = rx(k) for all n, k.

Note that the variance is

σ2x = E

[(x(n) − μx)

2]= E

[x(n)x∗(n)

] = rx(0) for zero mean.(2.36)

2.3.1 Power Spectral Density

Power Spectal Density (PSD) is used to represent the statistical properties ofa random signal in the frequency domain.

The Z transform of the autocorrelation sequence is defined by

Rxx(z) =n=+∞∑n=−∞

rxx(n)z−n. (2.37)

This is the complex PSD function.The Fourier transform of the autocorrelation sequence is defined by

Sxx(ω) =n=+∞∑n=−∞

rxx(n)e−jωn. (2.38)

This is the real PSD function.For a WSS signal x(n), the PSD function has the following properties:

• It is real and non-negative, that is, Sxx(ω) ≥ 0.• It is periodic with period 2π, that is, Sxx(ω ± 2kπ) = Sxx(ω), for

integer k.• The power in the WSS signal x(n) in the frequency interval

ω1 ≤ ω ≤ ω2 is given by

Px(ω1, ω2) =ω2∫

ω1

Sxx(ω) dω.

Page 63: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

44 Principles of Speech Coding

• The average power of the stochastic signal x(n) with PSD, Sxx(ω), is

P = 12π

π∫

−π

Sxx(ω) dω.

• For a real input signal x(n), the spectral density function is symmetric,that is, Sxx(ω) = Sxx(−ω).

Taking the inverse Fourier transform of the real PSD function gives backthe autocorrelation sequence

rxx(n) = 12π

π∫

−π

Sxx(ω)ejωn dω. (2.39)

The spectral density function of speech signals is widely used in analysis,synthesis, and coding of speech signals, as we will see in later chapters.

2.4 Response of a Linear System to a StochasticProcess Input

If a WSS input signal x(n) with PSD function Sxx(ω) is applied to a lineartransversal system with impulse response h(n) and PSD function H(ω), thenthe output signal y(n) will have a PSD given by

Syy(ω) = Sxx(ω) |H(ω)|2 .

It is clear from this relationship that the system transfer function shapes thespectrum of the input to give the modified spectrum of the output signal. SeeFigure 2.13 for the figure showing this relationship.

Sxx(ω) Syy(ω)

Syy(ω) = Sxx (ω)|H(ω)|2

Input signal x(t) Output signal y(t)

H(ω)

Impulse response

FIGURE 2.13 Response of an LT1 system to a stochastic process input in time domain andspectal domain.

Page 64: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 45

2.5 Windowing

A time-window function w(n) is applied to the input signal x(n) and is used tomodify its spectral properties. There are many window functions defined andused extensively in signal processing [1,3,5,6]. These include the followingwindows:

• Rectangular window• Hamming window• Hanning window• Bartlett window• Blackman window• Kaiser window• Chen window, and so on

The Rectangular window function is defined as

w(n) =⎧⎨⎩

1 for 0 ≤ n ≤ N − 1,

0 otherwise.

The Hamming window function is defined as

w(n) =

⎧⎪⎨⎪⎩

0.54 − 0.46 cos(

2πnN − 1

)for 0 ≤ n ≤ N − 1,

0 otherwise.

The Hanning window function is defined as

w(n) =

⎧⎪⎨⎪⎩

0.5[

1 − cos(

2πnN − 1

)]for 0 ≤ n ≤ N − 1,

0 otherwise.

This is applied for example in LPC of speech (see Chapter 7). In this case,we use 240 samples or 30 ms frame, a symmetric Hanning window is appliedto all samples, and then the coefficients are calculated using N = 240. TheHanning window and its frequency response are shown in Figure 2.14.

The Bartlett window function is defined as

w(n) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

2nN − 1

for 0 ≤ n ≤ N − 12

,

2 − 2nN − 1

forN − 1

2≤ n ≤ N − 1,

0 otherwise.

Page 65: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

46 Principles of Speech Coding

FIGURE 2.14 Hanning window and its frequency response.

The Blackman window function is defined as

w(n) =

⎧⎪⎨⎪⎩

0.42 − 0.5 cos(

2πnN − 1

)+ 0.08 cos

(2π(2n)

N − 1

)for 0 ≤ n ≤ N − 1,

0 otherwise.

The Kaiser window function is defined as

w(n) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

I0

√1 − (2n/N − 1 − 1)2

)I0(β)

for 0 ≤ n ≤ N − 1,

0 otherwise,

where

I0(β) =∞∑

k=0

β/22k

(k!)2

is a zeroth-order Bessel function.

Page 66: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 47

The Barnwell window function is a hybrid function that is defined as

w(n) =⎧⎨⎩

(n + 1)αn for 0 ≤ n ≤ ∞,

0 otherwise,

where 0 < α < 1. Alternatively, w(n) = (n + 1)αnu(n).The Chen window function is a hybrid function that is defined as

w(n) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0 for n ≤ 0,

sin(cn) for 0 < n ≤ N − 1,

bαn−L−1 for n ≥ N + 1,

where α, b, and c are constants that must be found for a particular windowspecification.

2.6 AR Models for Speech Signals, Yule–WalkerEquations

Earlier, in this chapter, we recalled that if a WSS input signal x(n) with PSDfunction Sxx(ω) is applied to a linear transversal system with impulse responseh(n) and PSD function H(ω), then the output signal y(n) will have a PSDgiven by

Syy(ω) = Sxx(ω) |H(ω)|2 .

This means that an autoregressive (AR) process can be generated by passingwhite noise through an all-pole filter. An AR process can also be analyzed bypassing the AR signal through a linear predictor that is an FIR (or all-zero)filter. The output of the AR process analyzer is the prediction error. This signalwill be like white noise if the order of the FIR filter is same or higher than theorder of the AR process.

2.7 Short-Term Frequency (or Fourier) Transformand Cepstrum

2.7.1 Short-Term Frequency Transform (STFT)

If short lengths (about 10 ms) of sampled speech are taken, we can assume sta-tionarity. A very important application of digital signal processing to speech

Page 67: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

48 Principles of Speech Coding

signals is that time and frequency analysis methods are used in the study ofhuman speech.

A time-window w(n) is applied to the input signal x(n). The Short-TermFrequency (or Fourier) Transform (STFT) is defined as

XSTFT(ω, n) =k=+∞∑k=−∞

x(n − k)w(k)e−jωk .

Note that the STFT is a time-varying function of frequency ω. It is periodicof period 2π. For a rectangular window where w(n) = 1 over the duration, theSTFT reduces to the DTFT of the input signal x(n). The STFT plays an impor-tant role in visualizing the time-varying frequency content of various signalssuch as speech. One can use a longer sliding time window to obtain higherspectral resolution and a shorter time window to achieve better temporalresolution.

2.7.2 The Cepstrum

The cepstrum of a signal is the Fourier transform of the logarithm of itspower spectrum [7,8]. The term “cepstrum” (derived from the word “spec-trum”) indicates the fact that this transform does not take us back to the timedomain, but into a new domain named “quefrency” (derived from the word“frequency”).

The cepstrum can also be used for pitch determination. See References [7,8]for details.

2.8 Periodograms

A time-window w(n), n = 0, 1, 2, . . . , N − 1, is applied to the input signal x(n).Let us define the windowed signal as

xN(n) = x(n)w(n), n = 0, 1, 2, . . . , N − 1.

The DTFT of xN(n) is

XN(ω) =N−1∑n=0

xN(n)e−jωn, (2.40)

which is a continuous function of frequency ω.

Page 68: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 49

The periodogram of xN(n) is

IN(ω) = 1N

|XN(ω)|2 .

If we use a rectangular window w(n) = 1, n = 0, 1, 2, . . . , N − 1 and xN(n) =x(n), n = 0, 1, 2, . . . , N − 1, then the sample autocorrelation function is

rN(n) = 1N

+∞∑k=−∞

xN(n + k)xN(k) = 1N

xN(n) ∗ xN(−n),

rN(n) = 0 for |n| ≥ N,

rN(0) = 1N

N−1∑n=0

x2N(n) = 1

π∫

−π

IN(ω) dω,

and the periodogram can be seen as an asymptotically unbiased estimator ofthe PSD, S(ω), that is,

limN→∞ E[IN(ω)] = S(ω).

Example

Compute and plot the periodogram of a portion of the speech signal in Figure 1.2.The MATLAB� code is shown here.

% Example Program% Spectrogram of a Speech Signal%load speechn=1:4000;plot(n,speech);xlabel(’Time index n’);ylabel(’Amplitude’);pausenfft=input(’Type in the window length=’);ovlap=input(’Type in the desired overlap=’);specgram(speech,nfft,7418,hamming(nfft),ovlap)

The plots are shown in Figures 2.15 and 2.16 with different parameters.

2.9 Spectral Envelope Determination forSpeech Signals

The auditory system in humans depends on the power of the speech signalsin the frequency domain. Therefore, the spectral envelope of speech signals

Page 69: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

50 Principles of Speech Coding

0.050

500

1000

1500

2000

2500

3000

3500

0.1 0.15 0.2 0.25Time

Spectrogram of speech signal: window = 256, overlap = 50Fr

eque

ncy

0.3 0.35 0.4 0.45 0.5

FIGURE 2.15 PSD of a portion of speech signal from Figure 1.2. (See color insert followingpage 104.)

0.050

500

1000

1500

2000

2500

3000

3500

0.1 0.15 0.2 0.25Time

Spectrogram of speech signal: window = 1024, overlap = 100

Freq

uenc

y

0.3 0.35 0.4 0.45 0.5

FIGURE 2.16 Spectrogram of a portion of speech signal from Figure 1.2. (See color insertfollowing page 104.)

Page 70: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 51

is very important. The spectral envelope of speech signals can be determinedby using the PSD discussed earlier or the limit of the periodogram as N goesto infinity as an estimator of the PSD.

2.10 Voiced/Unvoiced Classification ofSpeech Signals

Speech can be classified into voiced and unvoiced parts for simplicity of pro-cessing [9]. The voiced–unvoiced classification of speech is a very importantsignal processing operation. Any errors in the classification will affect thequality and fidelity of the coded speech and the synthesized speech may notbe intelligible. Next, some time-domain methods are described.

2.10.1 Time-Domain Methods

There are many methods for speech voiced/unvoiced classification. Most ofthem can be achieved by computing some or all of these voicing parametersin the time domain:

i. Periodic similarityii. Frame energy

iii. Pre-emphasized energy ratioiv. Low- to full-band energy ratiov. Zero crossing

vi. Prediction gainvii. Peakiness of speech

viii. Spectrum tilt

2.10.1.1 Periodic Similarity

Periodic similarity is defined as

PS[m] =[∑n=m

n=m−N+1 s(n) − s(n − T)]2∑n=m

n=m−N+1 s2(n)∑n=m

n=m−N+1 s2(n − T).

It measures the similarity between samples of speech signals in consecutivepitch cycles. A value of 0 means no similarity and a value of 1 means totalsimilarity. Voiced speech has more similarity than unvoiced speech.

Page 71: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

52 Principles of Speech Coding

2.10.1.2 Frame Energy

The energy in a speech frame can be used to determine voiced/unvoiceddecision. In voiced frames, the energy is typically many orders of magnitudehigher than in an unvoiced frame.

For a speech signal s(n), the energy in a frame of length N ending at timeinstant m is given by

E[m] =n=m∑

n=m−N+1

s2(n).

Alternatively the magnitude sum function can be used instead of the energyfunction. It is given by

MSF[m] =n=m∑

n=m−N+1

|s(n)| .

A low-pass filter of bandwidth 800 Hz can be used to bandlimit the speechbefore computing the energy or the magnitude sum function so as to reducecomputations. This is valid because the energy of the speech signal is mostlybelow 3400 Hz and the highest pitch frequency is about 500 Hz.

2.10.1.3 Pre-Emphasized Energy Ratio

In voiced frames, the pre-emphasized energy ratio defined as

Pr[m] =∑n=m

n=m−N+1 |s(n) − s(n − 1)|∑n=mn=m−N+1 |s(n)|

is typically lower than in an unvoiced frame.

2.10.1.4 Low- to Full-Band Energy Ratio

In voiced frames, the ratio of low- to full-band energy of a frame of speechdefined as

LF[m] =∑n=m

n=m−N+1 s2lpf(n)∑n=m

n=m−N+1 s2(n)

is close to 1, but in an unvoiced frame, the ratio is less than 1. The parameters2

lpf(n) is the speech signal that is low-pass filtered at 1 kHz.

2.10.1.5 Zero Crossing

The number of zero crossings can be used to decide if a frame is voiced orunvoiced.

Page 72: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 53

For voiced speech, the number of zero crossings in a frame is relativelylow, whereas for unvoiced speech, the number of zero crossings in a frame isrelatively high.

For a speech signal s(n), the number of zero crossings in a frame of lengthN ending at time instant m is given by

ZC[m] = 12

n=m∑n=m−N+1

∣∣sgn(s(n)) − sgn(s(n − 1))∣∣.

2.10.1.6 Prediction Gain

Voiced frames of speech signals, which are more predictable than unvoicedframes, typically achieve 3 dB or more in prediction gain than unvoicedframes. The ratio of energy of the speech signal divided by the energy ofthe prediction error is defined as the prediction gain. It is given by

PG[m] = 10 log10

(∑n=mn=m−N+1 s2(n)∑n=mn=m−N+1 e2(n)

).

2.10.1.7 Peakiness of Speech

Voiced frames of speech signals usually contain regular pulses that are notpresent in unvoiced speech frames. The LPC residual (discussed in Chapters6 and 7) can be used to compute the so-called peakiness of speech as follows:

PK[m] =⎛⎜⎝√

1/N∑n=m

n=m−N+1 r2(n)

1/N∑n=m

n=m−N+1 |r(i)|

⎞⎟⎠ ,

where r(i) is the LPC residual signal.

2.10.1.8 Spectrum Tilt

The spectral tilt is defined as

St[m] =(∑n=m

n=m−N+1 s(n)s(n − 1)∑n=mn=m−N+1 s2(n)

).

2.10.2 Frequency-Domain Methods

It is possible to use other frequency-domain parameters for speech voiced/unvoiced classification. Examples are

Page 73: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

54 Principles of Speech Coding

i. Normalized distance in frequency bandsii. Normalized correlation between frequency bands

These are not as popular as the time-domain methods and will not bediscussed further here.

2.10.3 Voiced/Unvoiced Decision Making

These voicing parameters are fed into a voiced/unvoiced decision-makingalgorithm. This can be as simple as majority vote or weighted combination ofthe voicing parameters as shown in Figure 2.17.

2.11 Pitch Period Estimation Methods

In the linear predictive model of speech generation, we assume that the speechis generated by passing periodic (or quasiperiodic) pulses through a linearfilter to generate the voiced sounds and passing white noise excitation througha linear filter to generate the unvoiced sounds. The period of periodic pulsesis the pitch period, T0. The inverse of the pitch period is the pitch fundamentalfrequency, F0.

Estimating the pitch of a speech signal is a very important operation. Thisis because the performance of many other speech signal processing tasksdepends on the correct estimation of the pitch period. A bad estimate of thepitch period may lead to poor speech quality.

Bad estimates of speech can occur due to the fact that speech signals are non-periodic and nonstationary. The pitch period may be quasiperiodic becausethe period defined as the time between two consecutive excitation pulsesvaries slowly.

The methods of pitch determination can be classified into three categories:

i. Time-domain methodsii. Frequency-domain methods

iii. Mixed time- and frequency-domain methods

Parameter 1Voiced/unvoiced

decisionV/UV

DECISIONALGORITHM

Parameter 2

Parameter n

FIGURE 2.17 Voiced/unvoiced decision implementation.

Page 74: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 55

For a completely stationary and periodic signal, many of the pitch estima-tion algorithms should produce the same result. However, since this is strictlynot the case, some methods produce better results than others.

Examples of time-domain methods are

• Autocorrelation methods• Average magnitude difference (AMDF) method

Examples of frequency-domain methods are

• Harmonic peak detection method• Spectrum similarity method• STFT method• Cepstrum method

Examples of mixed time- and frequency-domain methods are

• Spectral estimation method• Spectrum synthesis method• Wigner–Ville distribution method (it is based on both time and

frequency)

In References [8,10], the Wigner–Ville time-frequency distribution was alsoused for pitch determination with good results.

Next, we discuss a few of these examples, especially time-domain methods.

• Time-domain methods of pitch determination• AMDF method

The AMDF method depends on minimizing the function

A(τ) =n=N−1∑

n=0

|x(n) − x(n + τ)|,

where τ is the lag. It is computed over a given predetermined range for τ. Thevalue of τ minimizing A(τ) is the pitch period.

• Autocorrelation methods

The following metric is used for

E(τ) = 1N

n=N−1∑n=0

[x(n) − x(n + τ)]2.

Page 75: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

56 Principles of Speech Coding

For stationary speech input, x(n) = x(n + τ), this criterion can be rewrittenin terms of autocorrelation as

E(τ) = R(0) − R(τ),

where

r(τ) =n=N−1∑

n=0

x(n)x∗(n + τ).

Minimization of E(τ) in the above equation is equivalent to maximizing theautocorrelation r(τ) function of the speech signal. The value of the lag τ thatgives the maximum r(τ) is the pitch.

Ageneralized version of the autocorrelation method and theAMDF methodis given by the measure

E(τ) = 1N

{n=N−1∑n=0

|x(n) − x(n + τ)|2}1/k

,

where k is an arbitrary constant but it is typically chosen to be 1, 2, or 3.A choice of k = 2 implies the autocorrelation method, which is superior

to the AMDF method. A normalized similarity version of this criterion thatconsiders the effect of nonstationarity of speech signals is given by

E(τ) = 1N

n=N−1∑n=0

[x(n) − βx(n + τ)]2,

where β is a pitch gain that controls the changes in the signal level.The optimum choice of β is chosen by setting the derivative to zero,

resulting in

β =∑n=N−1

n=0 x(n)x(n + τ)∑n=N−1n=0 x2(n + τ)

.

Using this optimum gain in the previous equation, the pitch can beestimated by maximizing the function

E(τ, β) =n=N−1∑

n=0

x2(n) −[∑n=N−1

n=0 x(n)x(n + τ)]

∑n=N−1n=0 x2(n + τ)

2

,

which is equivalent to maximizing only the second term

R2n(τ) =

∑n=N−1n=0 [x(n)x(n + τ)]2∑n=N−1

n=0 x2(n + τ).

Page 76: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 57

2.12 Summary

In this chapter, we began by reviewing the LTI system theory. Then wereviewed salient points in digital signal processing and stochastic signal pro-cessing theory. Our goal was to recall them and later apply these DSP andstochastic signal processing principles to speech processing in subsequentchapters.

EXERCISE PROBLEMS

2.1. Review of LTI Systems Consider a time-varying DT system described by

y(n) = 2nx(n − 1) + 2n2x(n − 2) − x(n − 3).

Determine if this system is linear. Explain why (or why not).2.2. Review of LTI Systems For each of the analog systems described by the following

input–output relationship:Determine if the system (i) has memory, (ii) is invertible, (iii) is bounded-input–

bounded-output (BIBO) stable, (iv) is time-invariant, and (v) is linear. Explainwhy (or why not).

a. y(t) = ∫+∞−∞ 3x(τ − α) dτ, where α is a constant

b. y(t) = loge[2x(t)]c. y(t) = 3x(2t − 5)

d. y(t) = cos[x(t − 5)]e. y(t) = cos[(2t)x(t − 1)]f. y(n) = |5x(n)|g. y(n) = loge[x(n)]h. y(n) = ∑k=n+1

k=−∞ x(k)

i. y(n) = cos[x(n)]j. y(n) = x(n − 1)

2.3. Review of DSP The samples of a digital signal are spaced 0.2 × 10−5 s apart.What is the maximum possible frequency content of the original analog signal inorder to avoid aliasing?

2.4. Review of DSP

a. What are any three advantages of digital signal processing over analog signalprocessing?

b. What are any three disadvantages of digital signal processing over analogsignal processing?

2.5. Review of DSP

a. The DT signal x(n) = cos(3πn/8), −∞ ≤ n ≤ +∞ was obtained by samplingan analog signal x(t) = cos(2πft), −∞ ≤ t ≤ +∞ at a sampling rate of fs =8 kHz.

What are the any two possible values of the analog frequency, f , that couldhave resulted in the sequence x(n)?

Page 77: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

58 Principles of Speech Coding

b. A certain CT seismic signal is 3 min long. The spectrum of the signal rangesfrom dc to 400 Hz. This analog signal is to be sampled and converted to adigital signal for digital signal processing on a DSP processor system. (i) Whatis the theoretical minimum number of samples that must be taken? (ii) Assumethat each sample is to be represented by an 8-bit binary number. What is theminimum storage in bits required to handle this signal?

2.6. Review of Stochastic Signal Processing Using MATLAB, plot the PSD of aportion of the speech signal in Figure 1.2.

2.7. Review of Stochastic Signal Processing Use MATLAB to explore the effect ofwindow length on the spectrogram of speech signals.

2.8. Pitch Period Estimation Methods Using the sample speech signal in Figure 1.2,compute the pitch using the methods of both AMDF and autocorrelation. Trythe generalized autocorrelation for k = 1. Compare the result with the AMDFmethod. Write a MATLAB program for doing this example.

References

1. Proakis, J. and D. Manolakis, Digital Signal Processing, 4th edition, PearsonPrentice Hall, Englewood Cliffs, NJ, 2007.

2. Phillips, P. and E. Riskin, Signals, Systems and Transforms, 4th edition, Prentice Hall,Englewood Cliffs, NJ, 2008.

3. Mitra and K. Sanjit, Digital Signal Processing: A Computer Based Approach, 3rdedition, McGraw-Hill, New York, 2006.

4. Chu, W.C., Speech Coding Algorithms: Foundation and Evolution of StandardizedSpeech Coders, Wiley-Interscience, New York, 2003.

5. Rabiner, L.R. and R.W. Schafer, Digital Processing of Speech Signals, Prentice Hall,Englewood Cliffs, NJ, pp. 53–60, 1978.

6. Harris, F.J., On the use of windows for harmonic analysis with the discrete Fouriertransform, Proceedings of the IEEE, 66(1), 51–83, 1978.

7. Noll, A.M., Cepstrum pitch determination, Journal of Acoustical Society of America,41, 293–309, 1967.

8. Zhao, W., Speech processing applications using time-frequency distribution andcepstrum, PhD dissertation, Santa Clara University, Santa Clara, CA, Decem-ber 1999.

9. Kondoz, A.M., Digital Speech: Coding for Low Bit-Rate Speech, 2nd edition, Wiley,New York, 2004.

10. Zhao, W. and T. Ogunfunmi, A pitch and formant estimation method based onthe Wigner–Ville distribution, International Journal of Speech Technology, 3(1), 35–49,1999.

Page 78: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Fundamentals of DSP for Speech Processing 59

Bibliography

1. Allen and B. Jont, Short term spectral analysis, synthesis and modification by dis-crete Fourier transform, IEEE Transactions on Acoustics, Speech and Signal Processing,ASSP—25(3), pp. 235–238, 1977.

2. Stephan, M.S., Pitch scaling using the Fourier transform, http://www.dspdimension.com/html/pscalestft.html, 1999.

3. Joseph, P.C., Jr. and E.T. Thomas, Voiced/unvoiced classification of speech withapplications to the US government LPC-10E algorithm, Proceedings of IEEEICASSP, 473–476, 1986.

4. Hess, W.J., Pitch Determination of Speech Signals: Algorithms and Devices, Springer,New York, 1983.

5. Rabiner, L.R., On the use of autocorrelation analysis for pitch detection, IEEETransactions on Acoustics, Speech and Signal Processing, 25, 24–33, 1977.

6. Markel, J.D., The SIFT algorithm for fundamental frequency estimation, IEEETransactions on Acoustics, Speech and Signal Processing, 20, 149–153, 1972.

7. Bracewell, R.N., The Fourier Transform and Its Applications, McGraw-Hill, NewYork, 1978.

Page 79: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

3Sampling Theory

3.1 Introduction

Before a speech waveform can be coded for transmission or storage purposes,it must first be represented by its sampled values. Since this topic is discussedextensively in most digital signal processing and communication textbooks,a handful of specialized topics related to this area will be treated here. Theseinclude the Nyquist sampling theorem, the reconstruction of CT waveformfrom its sampled values, the need for antialiasing filters before sampling,the effect of sampling clock jitter, and the sampling and reconstruction ofrandom signals.

3.2 Nyquist Sampling Theorem

The Nyquist sampling theorem stipulates the minimum sampling frequencyneeded to represent a CT signal with an equivalent DT series. As such, it formsthe corner stone of all modern digital signal processing and communicationsystems. In essence, it states that a bandlimited signal can be recovered fromits samples if the sampling frequency is at least twice the highest frequency ofthe signal. A straightforward way to derive this result is to find the spectrumof the sampled signal and determine the conditions under which its integrityis preserved.

Suppose we sample a CT signal x(t) with a clock of period T, which corre-sponds to a radian frequency Ω = 2π/T, and would like to find the conditionsunder which the original waveform can be recovered. Although the samplingapparatus, which normally resides within the analog-to-digital converter(ADC), yields a numerical sequence x(nT) corresponding to the sampled val-ues of the signal x(t), it is analytically more convenient (and equivalent) torepresent the resulting DT sequence as a CT impulse train:

xs(t) =∑

n

x(nT)δ(τ − nT) = x(t)∑

n

δ(t − nT). (3.1)

61

Page 80: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

62 Principles of Speech Coding

A

A/T

X(ω)

Xs(ω)

–2Ω

–Ω/2 Ω/2

–Ω –Ω/2 Ω/2 Ω 2Ω0

0 ω

ω

FIGURE 3.1 Spectrum of the continuous signal x(t) and its sampled version.

Using the Fourier transform relationship

∑n

δ(t − nT) ↔ Ω∑

n

δ(ω − nΩ), (3.2)

we can obtain the spectrum of xs(t) by applying the convolution theoremas follows:

Xs(ω) = 12π

∞∫

−∞

[X(ω − y)

] [Ω

∑n

δ(y − nΩ)

]dy = 1

T

∑n

X(ω − nΩ). (3.3)

Thus the spectrum of the sampled signal consists of the repetitions of theoriginal signal spectrum X(ω) at intervals of Ω as illustrated in Figure 3.1. Itcan be easily verified that if X(ω) is bandlimited to Ω/2 rad/s, or (1/2T) Hz,then these repetitions do not overlap, and the integrity of the original signalspectrum is preserved. In such a situation, it is possible to recover the originalsignal from the sampled values.

3.3 Reconstruction of the Original Signal: InterpolationFilters

Assume, for the time being, that we can actually generate the impulsesequence xs(t) corresponding to the sampled values. Let this sequence bepassed through an ideal brick-wall low-pass filter (LPF) with the transferfunction

H(ω) ={

T, |ω| ≤ π/T,

0, otherwise,(3.4)

Page 81: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Sampling Theory 63

whose impulse response is given by

h(t) = sin(πt/T)

πt/T. (3.5)

It is clear from the spectral plots of Figure 3.1 that the output of this filteryields the original signal if x(t) is properly bandlimited to start with. We canonce again use the convolution theorem to compute the output of the filter asshown below:

x(t) =∞∫

−∞xs(τ)h(t − τ) dτ =

∞∫

−∞

[∑n

x(nT)δ(τ − nT)]h(t − τ) dτ

=∑

n

x(nT)

∞∫

−∞δ(τ − nT)h(t − τ) dτ =

∑n

x(nT)h(t − nT)

=∑

n

x(nT)sin(π(t − nT)/T)

π(t − nT)/T. (3.6)

This is the classical interpolation formula for reconstructing the continuoussignal x(t) from its sampled values x(nT).

3.4 Practical Reconstruction

Since it is not possible to generate the infinite amplitude impulse sequencerepresented by xs(t), the usual method for reconstructing x(t) is shown inFigure 3.2. Here, a digital-to-analog converter (DAC) generates a staircasewaveform corresponding to the sampled values. Thus the input to the recon-struction filter is different from the impulse sequence that was assumed in theprevious section. The transfer function of this filter therefore has to be mod-ified from the ideal brick-wall characteristics to account for the hold effect

Reconstructionfilterx(nT )

DAC xdac(t) x(t)

FIGURE 3.2 Practical reconstruction of the continuous signal x(t) from its sampled values.

Page 82: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

64 Principles of Speech Coding

of the DAC. Specifically, the DAC can be modeled as a zero-order hold withimpulse response

hDAC(t) ={

1, |t| ≤ T

0, otherwise(3.7)

and with the corresponding frequency response

|HDAC(ω)| = T∣∣∣∣sin(ωT/2)

ωT/2

∣∣∣∣ . (3.8)

The reconstruction filter must now equalize the high frequency roll-offcaused by the zero-order hold in addition to rejecting the alias components.Its spectral magnitude is thus given by

|H(ω)| =

⎧⎪⎨⎪⎩∣∣∣∣ ωT/2sin(ωT/2)

∣∣∣∣ , |ω| ≤ π/T,

0, otherwise.

(3.9)

It should be a linear phase filter, with the absolute delay adjusted to yielda good approximation of the noncausal infinite-duration ideal response.

3.5 Aliasing and In-Band Distortion

If the input signal x(t) is not bandlimited to a radian frequency Ω/2, thereconstructed version will be distorted due to aliasing, where the higherfrequency components of the input signal masquerade as low frequencycomponents. This situation is illustrated in Figure 3.3. Assuming the idealbrick-wall reconstruction filter (Equation 3.4), the average distortion powerdue to aliasing can be computed in the frequency domain as (Problem 3.2)

12π

∑n�=0

Ω/2∫

−Ω/2

|X(ω − nΩ)|2 dω = 1π

∞∫

Ω/2

|X(ω)|2 dω. (3.10)

In deriving the above equation, we have assumed that the distortion com-ponents folding into the Nyquist band from the various spectral replicas areuncorrelated, and hence they can be simply added up on a power basis. Inaddition to the aliasing distortion, there is also an in-band distortion com-ponent in the reconstructed signal as the higher frequency elements of theoriginal input signal are not faithfully recovered. The average power of this

Page 83: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Sampling Theory 65

A

A/T

–2W –W –W/2 W/2 W 2W 0

ω

ω

–W/2 W/20

Xs(ω)

X(ω)

FIGURE 3.3 Spectrum of x(t) and its sampled version when the signal bandwidth exceeds theNyquist limit.

in-band distortion is the power of the input signal at frequencies exceedingthe Nyquist limit, and hence it is given by

12π

|ω|>Ω/2

|X(ω)|2 dω = 1π

∞∫

Ω/2

|X(ω)|2 dω. (3.11)

Comparing Equations 3.10 and 3.11, it is clear that these two distortioncomponents are equal in magnitude. Thus the total distortion in the recon-structed signal will be (2/π)

∫∞Ω/2 |X(ω)|2 dω. But, we know that the aliasing

distortion component (Equation 3.10) can be completely eliminated if X(ω)

is bandlimited to Ω/2 rad/s. Hence, by applying a LPF that attenuates fre-quencies higher than Ω/2 rad/s to the input signal, the total distortion canbe limited to just the in-band component (Equation 3.11), thereby improvingthe SNR by 3 dB.

3.6 Effect of Sampling Clock Jitter

The effect of sampling jitter can be accounted for by assuming that the actualsampling point is a random variable about the ideal sampling instance. Let t0and tj denote the ideal and the jittered sampling instances, respectively, of theinput signal x(t). The amplitude error due to sampling at the jittered instancecan be approximated as

e = x′(t0)(tj − t0), (3.12)

where x′(t0) is the derivative of the input signal at time t0. Since the samplingjitter is assumed to be a random variable independent of the input signal, the

Page 84: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

66 Principles of Speech Coding

mean square amplitude error due to it can be estimated as

E(e2) = E{[

x′(t0)(tj − t0)]2}

= E{[

x′(t0)]2}

E{[

(tj − t0)]2}

. (3.13)

If we assume that the jitter is zero-mean and uniformly distributed in therange [−τ/2, +τ/2], then

E{[

(tj − t0)]2}

= τ2

12(3.14)

and the mean square error (MSE) is

E(e2) =(

τ2

12

)E{[

x′(t0)]2}

. (3.15)

To appreciate the usefulness of this expression, assume a sinusoidalinput signal x(t) = A sin 2πft. In this case x′(t) = 2πfA cos 2πft and E(e2) =π2f 2τ2A2/6. Since the power of the sine wave is A2/2, we can obtain anexpression for the SNR due to jitter as

SNR = E(x2)

E(e2)= A2/2

π2f 2τ2A2/6= 3

π2f 2τ2 . (3.16)

It is more convenient to express this SNR in dB, where we defineSNR(dB) � 10 log10(SNR). With this definition, the SNR for a sinusoidal sig-nal of frequency f (Hz) due to jitter being distributed uniformly in the range[−τ/2, +τ/2] simplifies to

SNR (dB) = −20 log( f τ) − 5.17. (3.17)

As an example, if f = 1 MHz and τ = 1 ns, the SNR due to jitter is about55 dB. While digitizing this signal, we need to make sure that the quantizationnoise is well below this value. It is shown in Chapter 4 that the maximumachievable signal to quantization noise ratio (SQNR) for a sinusoidal signal,using an n-bit uniform ADC, is (6n + 1.8) dB. Therefore, a 10-bit ADC wouldbe adequate in this case. Employing a more precise ADC does not yield anyimprovement in the performance.

If we assume that the jitter distribution is zero-mean Gaussian with standarddeviation σ, instead of uniform, then the SNR for a sine wave of frequencyf (Hz) can be shown to be

SNR (dB) = −20 log( f σ) − 15.96. (3.18)

Again, for example, if f = 1 MHz and σ = 1 ns, the SNR due to Gaussian-distributed jitter amounts to 44 dB. An 8-bit ADC would be satisfactory inthis situation.

Page 85: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Sampling Theory 67

3.7 Sampling and Reconstruction of Random Signals

Consider a bandlimited (in the power spectrum sense) finite power CT ran-dom process X(t). Sampling this at time intervals T results in a DT randomprocess Y(kT). Assuming X(t) is WSS, its autocorrelation function is given by

RX(τ) = E{X(t + τ)X∗(t)}. (3.19)

The autocorrelation function of the sampled process can be written as

RY(mT, nT) = E{X(mT)X∗(nT)} = RX(kT), (3.20)

where k = m − n. Thus Y(kT) is also WSS, and its autocorrelation is just thesampled version of the original autocorrelation function. Therefore, the powerspectrum SY(ejωT) of the sampled process is simply the repetitions of theoriginal power spectrum SX(ω) at intervals of Ω = 2π/T:

SY(ejωT) = 1T

∑n

SX(ω − nΩ). (3.21)

From the reconstruction theory for deterministic signals, if SX(ω) is band-limited to Ω/2 rad/s, we can recover the original power spectrum by filteringY(kT) with the ideal brick-wall filter. This, however, does not imply that weactually recover the original random process X(t), as two identical powerspectrums may correspond to entirely different processes. It is thereforeenlightening to explore how well the reconstructed process approximatesthe original.

Toward this end, assume that the sampled process Y(kT) is passed throughthe ideal reconstruction filter H(ω) defined in Equation 3.4. It turns out thatthe (CT) reconstructed output of this filter, denoted here by Yr(t), is not WSSbut merely cyclostationary. This poses a dilemma in comparing the originaland reconstructed signals. A common technique employed to get around thisproblem is to construct a new process by randomizing the phase epoch. Ifwe define a new random process Yr(t + θ), where θ is a random variable uni-formly distributed in the interval [0, T] and independent of Y(kT), then it canbe shown that Yr(t + θ) is indeed WSS. We compare this with a randomizedphase version of the original process X(t + θ) and assess the mean squarevalue of the reconstruction error defined as

E{|X(t + θ) − Yr(t + θ)|2}. (3.22)

Then, we can show that this mean square value is zero [1]. Thus, for WSSrandom signals, the reconstructed process approximates the original processin the mean square sense, which is not equivalent to the stronger result X(t) =Yr(t) applicable to bandlimited deterministic signals.

Page 86: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

68 Principles of Speech Coding

3.8 Summary

The focus of this chapter was to discuss a few specialized topics related to thesampling of deterministic and random signals. We touched upon the Nyquistsampling theorem and then discussed practical methods for recovering theoriginal waveform from the sampled values. We also discussed the need foran antialiasing LPF to limit the total distortion, the effect of sampling jitter onthe SNR, and the sampling and reconstruction of random signals.

EXERCISE PROBLEMS

3.1. The continuous-time Fourier transform of the sampled signal viewed as animpulse sequence was shown to be

Xs(ω) = 1T

∑n

X(ω − nΩ). (3.23)

On the other hand, the DTFT of the sampled signal viewed as the DT sequencex(nT) is defined as

X(ejωT) =

∑n

x(nT)e−jωnT . (3.24)

Show that these two expressions are equal.3.2. When the input signal is not bandlimited to Ω/2 rad/s, show that the distortion

due to aliasing is given by

∞∫

Ω/2

|X(ω)|2 dω. (3.25)

Reference

1. Barry, J.R., E.A. Lee, and D.G. Messerschmitt, Digital Communication, 3rd edition,Kluwer Academic Publishers, Boston, 2004.

Bibliography

1. Cattermole, K.W., Principles of Pulse Code Modulation, Iliffe Books Ltd, London,1969.

2. Jayant, N.S. and P. Noll, Digital Coding of Waveforms, Prentice Hall, EnglewoodCliffs, 1984.

Page 87: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

4Waveform Coding and Quantization

4.1 Introduction

We know that a bandlimited analog signal can be recovered from its sampledvalues as long as the sampling rate satisfies the Nyquist criterion. For trans-mitting the information, however, the infinite-precision sampled values haveto be rounded off to a finite number of bits. This process, referred to as quan-tization, is the principal source of signal degradation in waveform coders.This degradation, referred to as quantization noise, depends on the quantizercharacteristics as well as the input signal statistics. We will derive analyti-cal expressions for quantifying this noise and the achievable SNR for bothuniform and nonuniform quantizers. However, the main goal of this chapteris to explain the theoretical basis for the μ-law and A-law logarithmic quan-tizers that have been standardized for speech coding. Finally, we touch uponthe topics of optimal and adaptive quantizers, which are actually used in thedifferential coding standards discussed in the next chapter.

4.2 Quantization

The conventional method of deploying a quantizer in a digital communica-tion system is shown in Figure 4.1. The bandlimited input waveform x(t) isfirst sampled to obtain the DT samples x(nT). These samples are coded andsent to a far-end receiver where they are decoded to get the quantized sam-ples y(nT). An analog reconstruction filter acting on these samples yields anapproximation z(t) to the original input waveform.

It is clear from Figure 4.1 that the quantizer in reality consists of the com-bination of a coder and a decoder. These are generally referred to as ADCand DAC, respectively. The coder analyzes the input sample and generates acode corresponding to the quantization region to which the sample belongs,and the decoder reconstructs the proper rounded-off value from the receivedcode. The decoder output is the actual finite precision representation of theinput sample.

69

Page 88: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

70 Principles of Speech Coding

Outputwaveform

z(t)Quantizer

Sampler

Inputwaveform

x(t)Coder (ADC)

Decoder (DAC)

Code In

Inputsamples

x(nT)

Reconstfilter

Quantizedsamples

y(nT)

FIGURE 4.1 Quantizer deployment in a digital communication system.

4.3 Quantizer Performance Evaluation

Although Figure 4.1 depicts the typical deployment scenario, it is more con-venient to interchange the positions of the quantizer and the sampler blocksfor analytical performance analysis, as shown in Figure 4.2a. While this isnot a practical system, note that it yields the same exact values for the quan-tized samples, and hence the output waveforms will be identical in both thesystems.

The actual performance of interest, from a communications point of view, isthe quantizing noise power in the final output waveform z(t). Under certainassumptions, generally satisfied by quantizers with a large number of outputlevels operating on random signals, we can model the quantization error as abroadband additive noise q(t) that is independent of the input signal x(t), asshown in Figure 4.2b. The contribution of this noise to the final output wave-form can then be evaluated by taking into account the action of the samplerand the reconstruction filter. If an ideal brick-wall filter H(ω) bandlimited tothe range [−Ω/2, +Ω/2], where Ω = 2π/T is the radian sampling frequency,

Inputwaveform

x(t)

Outputwaveform

z(t)

Quantizer Sampler Reconstructionfilter

Outputwaveform

z(t)

Reconstructionfilter

Quantizedsamples

y(nT)

Quantizedwaveform

y(t)

Inputwaveform

x(t)

Quantizedsamples

y(nT)

Quantizedwaveform

y(t)

(a)

(b)

Sampler +

Quantization noise q(t)

FIGURE 4.2 (a) Modified quantizer deployment. (b) An additive noise model for performanceanalysis.

Page 89: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 71

is employed for reconstruction, we can show that the total noise power in thefinal output z(t) is identical to that in the quantized waveform y(t), which issimply the mean square value of the additive quantization noise q(t).

Toward this end, let SY(ω) and SZ(ω) represent the power spectrum of theCT signals y(t) and z(t), respectively, and let SY(ejωT) be the power spectrumof the sampled signal y(nT). Then, referring to Figure 4.2b, the mean squarevalue of the output signal is given by

E[z2(t)] = 12π

Ω/2∫

−Ω/2

SZ(ω) dω = 12π

Ω/2∫

−Ω/2

1T

SY(ejωT)|H(ω)|2 dω

= 12π

Ω/2∫

−Ω/2

1T

[1T

∑m

SY(ω − mΩ)

]|H(ω)|2 dω. (4.1)

Now, interchanging the summation and integration operations in the aboveequation, and noting that the ideal reconstruction filter H(ω) has a gain of Tin the passband and zero outside of it, we can simplify Equation 4.1 to obtain

E[z2(t)] = 12π

∞∫

−∞SY(ω) dω =E[y2(t)] = E[x2(t)] + E[q2(t)], (4.2)

where we have used the fact that the signal and quantization noise processesare independent. It should be clear from Equation 4.2 that the noise power atthe final output is exactly the same as the noise power caused by the quanti-zation process at the input. Hence, in the following sections, we analyze theperformance for just the quantizer shown in Figure 4.2a, and presume that itis also applicable to the overall communication system.

4.4 Quantizer Transfer Function

As explained earlier, conceptually, the quantizer can be defined as the map-ping of an analog value that can take a continuum of levels into one of afinite set of levels. The input–output relationship of the quantizer can be rep-resented by a staircase function shown in Figure 4.3. A sample x belonging tothe interval (xk−1 < x ≤ xk), k = 2, . . . , N − 1, is represented by the code valueIk . This code is transmitted to the decoder, which transforms it into an out-put amplitude yk . The thresholds xk , k = 1, 2, . . . , N − 1, are called the decisionlevels and the values yk , k = 1, 2, . . . , N, are called the reconstruction or outputlevels. The length of the decision interval (xk − xk−1) is the step size Δk .

Page 90: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

72 Principles of Speech Coding

Input x

Error q

Input x

Output y

y1

xN–1xN

x0 x1

yN

ykyk–1

xk–1 xk

Step size Δk

FIGURE 4.3 Quantizer and error transfer functions.

For analytical convenience, two overload decision levels x0 and xN are addedat the boundaries such that x1 − x0 = x2 − x1 and xN − xN−1 = xN−1 − xN−2.(Note that y1 is the reconstruction level when x < x1 and yN is the recon-struction level if x ≥ xN−1.) If the input signal is limited to the range x0 − xN ,referred to as a no-overload situation, the resulting noise is generally known asgranular noise, or often simply as quantization noise, whereas the term overloaddistortion denotes the noise due to the signal exceeding this range.

The error introduced by the quantization process is q = x − y. It is shownas a function of the input signal x in Figure 4.3. If x represents samples of arandom signal, then we could model the errors introduced by the quantizeras a form of additive noise. The performance of the quantizer is then normallycharacterized by the mean square value of the noise.

4.5 Quantizer Performance under No-Overload Conditions

Let p(x) be the probability density function (PDF) of the input signal x. Thenthe mean square error (MSE) due to quantization is given by

σ2q =

xN∫

x0

p(x) dx =N∑

k=1

xk∫

xk−1

(x − yk)2p(x) dx. (4.3)

Page 91: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 73

If N is large and if the input density function p(x) is smooth, we canapproximate p(x) in the region xk−1 < x ≤ xk as

p(x) ≈ p(

xk−1 + xk

2

), xk−1 < x ≤ xk . (4.4)

Substituting for p(x) in Equation 4.3 yields

σ2q =

N∑k=1

p(

xk−1 + xk

2

) xk∫

xk−1

(x − yk)2 dx. (4.5)

It can be verified that

xk∫

xk−1

(x − yk)2 dx = Δk

[(yk − xk−1 + xk

2

)2

+ Δ2k

12

], (4.6)

where Δk = xk − xk−1 is the step size. To minimize the MSE, we must chooseyk = (xk−1 + xk)/2 so that the first term on the right-hand side (RHS) vanishes.That is, the reconstruction levels should be midway between the decisionlevels. With this choice of yk and substituting

pk = p(

xk−1 + xk

2

)Δk ≈ Pr(xk−1 < x ≤ xk), (4.7)

we obtain

σ2q = 1

12

N∑k=1

pkΔ2k . (4.8)

4.6 Uniform Quantizer

For a uniform quantizer, the step size is constant (Δk = Δ for all k). In this case

σ2q = Δ2

12

N∑k=1

pk = Δ2

12since

N∑k=1

pk = 1. (4.9)

Note that the quantization noise power σ2q is independent of the signal

distribution. The performance of a quantizer is normally expressed as a signal-to-(quantizing)-noise ratio (SNR) defined as

SNR = 10 log

(σ2

x

σ2q

)= 10 log

(12σ2

x

Δ2

)dB. (4.10)

Page 92: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

74 Principles of Speech Coding

We will now derive expressions for the achievable SNR of uniformquantizers excited by sinusoidal and Gaussian signals.

Consider an n-bit uniform quantizer that has 2n reconstruction levels. If theinput is a sinusoidal signal of the form A sin ωt, we can choose the step sizeΔ such that the overload point 2n−1Δ of the quantizer equals the peak ampli-tude A, which implies Δ = A/2n−1. This choice guarantees that the quantizeroperates in the granular noise region and prevents overload distortion. Now,the power of the sinusoidal signal is σ2

x = A2/2. Substituting for Δ and σ2x in

Equation 4.10, we obtain the result

SNR (sinusoidal) = 6n + 1.76 dB. (4.11)

For a signal with the Gaussian density function p(x) = 1/√

2πσ e−(x2/2σ2),the overload distortion can be made negligible by using the so-called 4σ

loading factor. This means, we pick step size Δ such that the overload point2n−1Δ = 4σ ⇒ Δ = σ/2n−3. Noting that the mean square power of the inputsignal is simply σ2, and substituting for Δ and σ2

x in Equation 4.10, we obtainthe following expression for the SNR

SNR (Gaussian) = 6n − 7.3 dB. (4.12)

Figure 4.4 is a typical SNR plot for the uniform quantizer. Note the 6ndependence of the SNR. This means that for every additional bit in thequantizer, SNR improves by 6 dB. Once an n-bit uniform quantizer is fixedfor a specific overload point, however, the amount of quantization noise isessentially constant, independent of the signal power. Thus the SNR of auniform quantizer varies linearly with signal power. Deviations from thelinear occur at large signals (overload distortion) and at very low signallevels, where the signal rarely excites more than one or two steps of thequantizer.

6 dB

Signal power (dBm)

n +1 bits

n bits

SNR

(dB)

FIGURE 4.4 SNR for a uniform quantizer.

Page 93: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 75

Example 4.1

Consider a signal with the exponential PDF p(x) = ke−|x |. Determine k and σ2x .

Assuming that the overload levels are at −4σx and +4σx and neglecting overloaddistortion, find the MSE and the SNR if this signal is passed through an 8-bituniform quantizer.

SOLUTION

The value of k is determined by the fact that the area under p(x) should be 1.Hence,

2

∞∫

0

ke−x dx = 1 ⇒ k = 12

. (4.13)

The variance σ2x can be computed as

σ2x = 2

∞∫

0

x2ke−x dx = 2. (4.14)

The step size Δ is determined by the overload points and the number of outputlevels:

Δ = 8σx/N = 8 × √2/28 = √

2/25. (4.15)

The MSE and SNR are given by

σ2q =Δ2

12= 2−11

3,

SNR =σ2x

σ2q

= 3 × 212 = 40.9 dB.(4.16)

4.7 Nonuniform Quantizer

From Figure 4.4 it is evident that the SNR of a uniform quantizer varieslinearly with signal power. If such a quantizer is employed in a speech trans-mission system, loud talkers would enjoy higher SNR values, whereas softtalkers would be penalized with a poorer SNR. One method of providing sub-stantially similar performance over a wide range of signal powers is to designthe quantizer such that the step sizes are finer at lower signal amplitudes,whereas they are coarser near overload levels. Such a quantizer is referredto as a nonuniform quantizer as it has uneven step sizes. This has the addedbenefit of reducing the total quantization noise since the PDF of speech signalis typically concentrated about the origin.

Page 94: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

76 Principles of Speech Coding

1/3 Δ2 Δ1 1

FIGURE 4.5 4-bit nonuniform quantizer.

Figure 4.5 is an example of a 4-bit nonuniform quantizer, where theassumed overload levels are +1 and −1. For signals with amplitude less than1/3, the step size of this quantizer is Δ1 = Δ2/2, and consequently, the MSEwould be less (and the SNR would be better) than that of a 4-bit uniformquantizer. Signals spanning the entire range would experience quantizationerrors corresponding to both the step sizes and hence would incur a higherMSE compared with a uniform quantizer.

4.7.1 Nonuniform Quantizer Implementation Methods

Three methods for implementing an n-bit nonuniform quantizer are shownin Figure 4.6. In the first method, the coder analyzes the input sample usingnonuniformly spaced decision thresholds and generates the appropriate code

n-bitnonuniform

coder

n-bitnonuniform

decoder

m-bit uniformcoderm > n

Linear code tononlinear code

converter

Nonlinear codeto linear code

converter

m-bit uniformdecoder

m > n

Analogcompressor

F(x)

n-bituniformdecoder

n-bituniformdecoder

Analogexpandor

F –1(x)

(a)

(b)

(c)

FIGURE 4.6 Nonuniform quantizing methods.

Page 95: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 77

value while the decoder outputs the corresponding reconstruction valuesthat are also nonuniformly spaced. The second method employs higherresolution uniform quantizers and code converters between the linear andnonlinear codes that accomplish the necessary compression and expansionfunctions in the digital domain. Here, the number of bits required in theuniform quantizer is generally much larger than the transmitted nonlin-ear code. The converter at the transmitter performs digital compressionwhile its counterpart at the receiver does the corresponding expansion func-tion. Practical nonuniform quantizers generally employ one of these twomethods.

Although not very practical, a third technique for realizing the nonuniformquantizer, shown in Figure 4.6c, is more convenient for deriving analyticalperformance results such as the MSE and SNR, and to specify the step-sizevariations necessary to achieve particular performance objectives. The ana-log compressor and expandor shown in the figure are memoryless nonlinearfunctions. The compressor is designed to amplify lower signal amplitudesat the expense of attenuating higher level signals, whereas the expandorperforms the inverse operations. Together, they are referred to as the com-pandor. As an example, the companding characteristics for the 4-bit quantizerof Figure 4.5 are depicted in Figure 4.7. Only the positive half of thecompressor and expandor curves is shown; the negative halves are themirror images.

4.7.2 Nonuniform Quantizer Performance

The MSE of a nonuniform quantizer is given by Equation 4.8, which we repeathere for convenience:

σ2q = 1

12

N∑k=1

pkΔ2k , (4.17)

1/3

1/3

1/2

1/2

1

1

Compressor

Expandor

FIGURE 4.7 Compandor characteristics for the 4-bit quantizer of Figure 4.5.

Page 96: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

78 Principles of Speech Coding

x

Δ = 2/N

F(x)F(xk)

F(xk–1)

xk–1 xk 1 0

1

Δk

FIGURE 4.8 Compressor characteristic.

where pk = Pr(xk−1 < x ≤ xk) and Δk = xk − xk−1 is the step size.Assume thatthere are N decision intervals, the input density function p(x) is zero meanand symmetric, and the input is restricted to the range [−1, +1].

Consider the compressor characteristic F(x) shown in Figure 4.8, where onlythe positive half of the curve is depicted and the negative half is assumed tobe symmetric. To evaluate the MSE of the nonuniform quantizer, using themodel of Figure 4.6c, we need to determine the step size Δk of the nonuni-form coder as a function of F(x). Note that the decision levels xk−1 and xkcorresponding to the nonuniform coder are mapped by the compressor to theuniformly spaced decision levels F(xk−1) and F(xk), respectively. Since thereare N decision intervals in the range [−1, +1], the step size of the uniformcoder is Δ = 2/N. Knowing Δ and the slope of F(x) in the interval [xk−1, xk],the step size Δk can be obtained as

Δk = Δ

F′(x∗k )

= 2NF′(x∗

k ), xk−1 < x∗

k < xk . (4.18)

Substituting for Δk in the expression (Equation 4.17) for the MSE, we obtain

σ2q = 1

3

N∑k=1

pk

N2{F′(x∗k )}2 . (4.19)

The summation in Equation 4.19 can be approximated by an integral if thenumber of levels N is large. This yields the following expression for the MSEof the nonuniform quantizer with overload points at −1 and +1:

σ2q = 1

3N2

+1∫

−1

p(x) dx{F′(x)}2 = 2

3N2

+1∫

0

p(x) dx{F′(x)}2 (4.20)

Page 97: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 79

If the signal overload levels are −V and +V, instead of −1 and +1, it canbe shown that the MSE of the nonuniform quantizer is given by

σ2q = 2V2

3N2

V∫

0

p(x) dx{F′(x)}2 . (4.21)

Example 4.2

Consider a signal with the triangular PDF:

p(x) ={

1 − |x |, |x | < 1,

0 else.(4.22)

Find the SNR if this signal is quantized with a 4-bit uniform quantizer. Whatwould be the SNR if this signal is quantized with a 4-bit nonuniform quantizerwith the companding characteristics given in Figure 4.7?

SOLUTION

For the uniform quantizer, we can find the variance, step size, MSE, and SNR asfollows:

σ2x = 2

1∫

0

x2p(x) dx = 2

1∫

0

x2(1 − x) dx = 16

,

Δ = 2N

= 224 ,

σ2q = Δ2

12= (2/16)2

12= 1

768,

SNR = σ2x

σ2q

= 7686

= 128 = 21.1 dB.

(4.23)

The MSE of the nonuniform quantizer is given by Equation 4.20. The slope F ′(x)

of the compressor curve equals 3/2 in the region (0, 1/3) and equals 3/4 in theregion (1/3, 1). Hence,

σ2q = 2

3N2

1∫

0

p(x) dx{F ′(x)}2 = 2

3 × 162

⎡⎢⎣

1/3∫

0

(1 − x) dx{3/2}2 +

1∫

1/3

(1 − x) dx{3/4}2

⎤⎥⎦ = 7

5184.

(4.24)The corresponding SNR is

SNR = σ2x

σ2q

= 1/67/5184

= 123.42 = 20.9 dB. (4.25)

Page 98: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

80 Principles of Speech Coding

4.8 Logarithmic Companding

In a telecommunications system, as discussed earlier, we desire a constantSNR irrespective of the signal distribution and hence we wish to find thecompressor curve that can achieve this. Mathematically, this problem can besolved as follows. The SNR of a general nonuniform quantizer in terms of thecompressor characteristics F(x) is

SNR = σ2x

σ2q

= 2∫1

0 x2p(x) dx

{2/(3N2)} ∫+10 p(x) dx/{F′(x)}2

. (4.26)

We can make the above expression a constant by selecting

1F′(x)

= kx or F′(x) = k−1

x. (4.27)

The actual compressor curve F(x) is obtained by integrating F′(x)and settingthe constant of integration to the boundary condition F(1) = 1. This yields thelogarithmic curve

F(x) = 1 + k−1 ln x, (4.28)

where k is a parameter yet to be specified. The corresponding constant SNRfrom Equation 4.26 is

SNR = 3N2

k2 . (4.29)

Note that this compressor characteristic defines only the positive half of therange. The other half is its symmetrical image. If the overload levels are at−V and +V, instead of −1 and +1, the corresponding logarithmic compressorcharacteristic can be shown to be

F(x) = V + k−1 ln( x

V

). (4.30)

4.8.1 Approximations to Logarithmic Companding

It turns out that the logarithmic compression law F(x) = 1 + k−1 ln x cantheoretically achieve a constant SNR independent of the signal distribution.In such a nonuniform quantizer, the step size at a given signal amplitude isproportional to the amplitude itself. This proportionality, however, cannot beachieved near the origin, since ln(x) diverges as x → 0. It is easy to verify thisfact by drawing the graph of F(x) = 1 + k−1 ln x, shown in Figure 4.9. Notethat the logarithmic function F(x) does not pass through the origin. In fact,the function is zero at x = e−k , and it is not defined in the range 0 ≤ x < e−k .

Page 99: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 81

x

F(x) = 1 + k–1 ln x

1 0

1

e–k

FIGURE 4.9 Graph of F(x) = 1 + k−1 ln x.

A practical companding law cannot be discontinuous at the origin asit must also specify the transfer function for low-level signals. Thus, twomodifications to the logarithmic law have been proposed to alleviate thisproblem.

4.8.1.1 μ-Law (Continuous Version)

This approximation shifts the zero crossing occurring at x = e−k to the origin.Hence it relates F(x) to ln(1 + μx) rather than ln(x). It is normally written inthe form,

F(x) = log(1 + μx)

log(1 + μ), 0 ≤ x ≤ 1. (4.31)

where the base of the logarithm is irrelevant.As shown in Figure 4.10, the parameter μ controls the nonlinearity of the

curve. The practical companding law used in the ITU G.711 PCM standardis a segmented approximation to the μ = 255 continuous law. When μ � 1,this law approximates the logarithmic curve for large signal levels. To verifythis, note that when x � μ−1 or μx � 1,

F(x) = log(1 + μx)

log(1 + μ)≈ log(μx)

log(μ)= 1 + log(x)

log(μ)= 1 + ln x

ln μ. (4.32)

Thus k = ln(μ) and the SNR for large signals is approximately 3N2/(ln μ)2.Typical performance curves for the μ-law, with parameter μ > 100, show thatthe SNR is practically constant over a fairly large input dynamic range. Fur-thermore, these curves are quite insensitive to signal distribution (providedproper overload points are chosen).

Page 100: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

82 Principles of Speech Coding

F(x)

Small μ

Large μ1

10 x

μ = 0

FIGURE 4.10 μ-law compressor curves.

If the overload points are at +V and −V, the equation for the μ-law curveis given by

F(x) = V ln(1 + μx/V)

ln(1 + μ), 0 ≤ x ≤ V, (4.33)

which, as before, specifies only the positive half of the compressor function.

4.8.1.2 A-Law (Continuous Version)

The compressor curve in this case is specified to be strictly linear at lowsignal levels and strictly logarithmic at high signal levels. The linear segmentextension to the origin is obtained by drawing a tangent to the logarithmiccurve F(x) = 1 + k−1 ln x as shown in Figure 4.11.

Let m be the slope of the tangent and x = xt be the tangential point. Then,since y = mxt is a point on the logarithmic curve, we have

mxt = 1 + k−1 ln xt. (4.34)

Furthermore, the tangent condition implies that the slope,

m = dF(x)

dx

∣∣∣∣x=xt

= k−1

xt. (4.35)

Solving for m and xt from the above two equations, we obtain

xt = e1−k and m = k−1ek−1. (4.36)

Page 101: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 83

x

F(x) = 1 + k –1 ln x

1 0

1

e–k xt

Tangentslope = m

FIGURE 4.11 A-law compressor.

The A-law compressor curve in terms of the parameter k can therefore bewritten as

F(x) ={

k−1ek−1x, 0 ≤ x ≤ e1−k ,1 + k−1 ln x, e1−k ≤ x ≤ 1.

(4.37)

However, this law is normally specified in terms of the related parameterA = ek−1, which implies

k = 1 + ln A. (4.38)

Then, the positive half of the compressor curve is defined by

F(x) =

⎧⎪⎪⎨⎪⎪⎩

Ax1 + ln A

, 0 ≤ x ≤ 1/A,

1 + ln Ax1 + ln A

, 1/A ≤ x ≤ 1.(4.39)

The SNR for large signal levels for this law can be approximated by3N2/(1 + ln A)2. The parameter A controls the extent of the linear segmentin the companding characteristic. Large values of A imply a small linear seg-ment and hence a larger signal range over which the SNR is constant. Forsmall signal levels, the A-law is clearly linear and thus behaves like a uniformquantizer and hence the SNR decreases linearly with signal power.

Again, if the overload points are at +V and −V, the equation for theA-law is

F(x) =

⎧⎪⎪⎨⎪⎪⎩

Ax1 + ln A

, 0 ≤ x ≤ V/A,

V + V ln(Ax/V)

1 + ln A, V/A ≤ x ≤ V.

(4.40)

Page 102: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

84 Principles of Speech Coding

4.8.2 Companding Advantage

As discussed earlier, practical nonuniform quantizers are designed to yieldgood SNR at small signal levels. Hence their step size near the origin will besmaller compared with the step size of a uniform quantizer having the samenumber of bits. Companding advantage (CA) is a measure to characterize thisgain in SNR over a uniform quantizer at lower signal levels. For a nonuniformquantizer with the compressor characteristic F(x), we define the CA as F′(0),which represents the slope of F(x) at the origin. This is often expressed as a dBvalue 20 log10 F′(0). For the μ-law compandor defined in Equation 4.31 the CAis μ/ ln(1 + μ), whereas it is A/(1 + ln A) for the A-law curve (Equation 4.39).

Example 4.3

Consider a signal with the density function,

p(x) ={

(π/4) cos(πx/2), |x | ≤ 1,0, otherwise.

(4.41)

Find the MSE and SNR, if this signal is passed through an 8-bit nonuniformquantizer having the continuous A-law compressor characteristic (Equation 4.39),with A = 94.16.

SOLUTION

The variance corresponding to the PDF is

σ2x = 2

1∫

0

x2(π

4

)cos

(πx2

)dx = 1 −

(8π2

)= 0.18943. (4.42)

Since (d/dx) ln(Ax) = 1/x , we have

F ′(x) =

⎧⎪⎪⎨⎪⎪⎩

A1 + ln A

, 0 ≤ x ≤ 1/A,

(1/x)

1 + ln A, 1/A ≤ x ≤ 1.

(4.43)

Substituting in Equation 4.20, we obtain

σ2q = 2

3N2

⎡⎢⎣

1/A∫

0

(1 + ln A)2

A2 (π/4) cos(πx/2) dx

+1∫

1/A

(1 + ln A)2x2(π/4) cos(πx/2) dx

⎤⎥⎦

Page 103: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 85

= 2(π/4)(1 + ln A)2

3N2

⎡⎢⎣

1/A∫

0

1A2 cos(πx/2) dx +

1∫

1/A

x2 cos(πx/2) dx

⎤⎥⎦

= 2(π/4)(1 + ln A)2

3N2

[2π

− 16π3 − 8

Aπ2 cos(π/2A) + 16π3 sin(π/2A)

]. (4.44)

For N = 28 and A = 94.16, Equation 4.44 evaluates to

σ2q = 2.962 × 10−5. (4.45)

Thus,

SNR = σ2x

σ2q

= 6394 = 38 dB. (4.46)

4.9 Segmented Companding Laws

Referring to Equation 4.18, it is apparent that the step size of a nonuniformquantizer at a given signal amplitude is inversely proportional to the slope ofthe compression law curve F(x) at that point. Since the slope of F(x) varies withthe signal amplitude, a large number of disparate step sizes are needed for thecontinuous μ-law and A-law quantizers, and thus it is difficult to implementthese quantizers in practice. However, we can approximate the compressorcurve F(x) using linear segments without seriously affecting the quantizerperformance. The number of dissimilar step sizes will then be limited tothe number of segments in the approximation, thereby yielding a simplerimplementation.

The segmented approximations to the compressor curve F(x) can obviouslybe done in many different ways. But it is advantageous to restrict the choicesto only those that yield what are called digitally linearizable companding laws.For such companding laws, the decision and output levels of the nonuniformquantizer form a subset of the corresponding levels of a uniform quantizerwith resolution equal to the smallest step size. (Note that for a typical nonuni-form quantizer, the smallest step size occurs at the origin.) This propertyenables us to translate between the nonlinear and linear codes, and hence lin-ear signal processing functions, such as level control, filtering, conferencing,and echo cancellation, can be readily carried out on the transmitted codes.

To satisfy the digitally linearizable property, it is sufficient that the step sizecorresponding to any of the segments be an integer multiple of the step size atthe origin. Since a logarithmic variation of the step size with signal amplitudeis desired, however, we insist that the step size actually double from segmentto segment, or equivalently the slopes should halve.

Furthermore, it is desirable to perform the segmented approximation so thatit yields a simple representation of the nonlinear code. It is natural to partition

Page 104: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

86 Principles of Speech Coding

the n-bit nonlinear code into a polarity bit, an m-bit segment number, and an(n − m − 1)-bit step number within a segment. This partitioning implies thatthe total number of approximation segments and the number of quantizationsteps in each segment are both restricted to powers of 2.

The conditions discussed so far to obtain segmented approximations to thecontinuous μ-law and A-law curves are summarized below:

1. The n-bit nonlinear code consists of three partitions: a polarity bit,an m-bit segment number, and an (n − m − 1)-bit quantization stepnumber.

2. The number of segments on each side of the origin is 2m.3. The number of quantization steps in each segment is 2n−m−1. Since

there are an equal number of steps in each segment, referring toFigure 4.8, it can be verified that the segment end points yr have to beuniformly spaced on the y-axis. Noting that the overload point is at+1, these end points are given by

yr = r2−m, r = 0, 1, . . . , 2m. (4.47)

The step size doubles in successive segments. That is, if Δr is the step sizein the rth segment, we require Δr+1/Δr = 2. This choice ensures the desireddigital linearization and the logarithmic step-size variation features. (In thecase of A-law, the first two segments are forced to be collinear, which impliesΔ1/Δ0 = 1; from then on, the doubling rule applies.)

Approximating the continuous curves with segments under the conditionsstated above restricts the allowed values for the μ and A parameters. Inthe following section, we derive the general form of the expressions theseparameters have to satisfy and justify their chosen values in the ITU G.711standards.

4.9.1 Segmented Approximation to the Continuous μ-Law and A-LawCurves

Consider Figure 4.12, which shows the 8-segment approximation to thecontinuous μ-law function,

F(x) = ln(1 + μx)

ln(1 + μ), 0 ≤ x ≤ 1. (4.48)

In this example, there are eight segments on each side of origin, whichimplies m = 3. Since we assume the same number of steps in each segment,the segment end points yr are uniformly spaced on the y-axis: yr = r/8,r = 0, 1, . . . , 8. Furthermore, the condition Δr+1/Δr = 2 implies that the

Page 105: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 87

(r+1)/8

r/8

xr xr+1

7

x2 x7x1

1

0

7/8

1 (a)

(b)

1 0

2/8

1/8

Segment r

FIGURE 4.12 (a) Eight-segment approximation to the continuous μ-law. (b) Typical segmentend points.

segment end points on the x-axis are given by

x2 = x1 + 2x1 = 3x1 = (22 − 1)x1

x3 = x2 + 4x1 = 7x1 = (23 − 1)x1

·xr = (2r − 1)x1

·

x8 = 1 = (28 − 1)x1 ⇒ x1 = 128 − 1

.

(4.49)

Figure 4.12b shows the x- and y-coordinates for the rth segment. Note thatthe coordinates of the lower end point of this segment are given by

xr = (2r − 1)x1, yr = r/8. (4.50)

Since this end point lies on the continuous μ-law function F(x) specified byEquation 4.48, we obtain

r8

= ln(1 + μ(2r − 1)x1)

ln(1 + μ), 0 ≤ r ≤ 8, (4.51)

which can be written as

(1 + μ(2r − 1)x1)8 = (1 + μ)r, 0 ≤ r ≤ 8. (4.52)

Page 106: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

88 Principles of Speech Coding

This equation is satisfied only when

μ = 1x1

= 28 − 1. (4.53)

The same argument can be applied for the general case of 2m segments. Itis easy to show that in such a case

x1 = 122m − 1

μ = 1x1

= 22m − 1.(4.54)

The step size in the rth segment is given by

Δr = Signal rangeNumber of steps

= xr+1 − xr

2n−m−1 = 2rx1

2n−m−1 = 2r

μ2n−m−1 . (4.55)

Table 4.1 shows the values of m and the corresponding values for μ. Thechoices μ = 3 or 15 do not provide a reasonably flat SNR over the dynamicrange of signals encountered in practice. The option μ = 65,535 is much toolarge for practical implementation as it is a 16-segment approximation, andis not really needed. Hence the value μ = 255 is specified in the standards.

A similar procedure can be followed in the case of the segmented approx-imation to the continuous A-law. The only difference is that the first twosegments are forced to have the same step size. From then on, the step sizedoubles in successive segments. Also, the tangential segment of the continu-ous A-law, shown as a dotted line in Figure 4.13, is completely ignored and allthe segment end points, except the origin, must lie on the logarithmic curveF(x) = 1 + k−1 ln x.

As before, the segment end points yr, which are uniformly spaced on they-axis, are given by

yr = r2−m, r = 0, 1, . . . , 2m. (4.56)

TABLE 4.1

Variation of Parameter μ with the Number of ApproximatingSegments

Allocated bits for segment m 1 2 3 4Number of segments 2m 4 4 8 16μ 3 15 255 65,535

Page 107: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 89

1

x2

2

1

0

0 e–k

Tangent

F(x) = 1 + k –1 ln x

1 x1

y1

y2

xt

FIGURE 4.13 Segmented approximation to the A-law.

Except for the first two segments, which are collinear, the x-coordinates ofthe remaining segment end points are obtained using the doubling rule:

x2 = x1 + x1 = 2x1

x3 = x2 + 2x1 = 4x1 = 22x1

·xr = 2r−1x1

·

x2m = 1 = 22m−1x1 ⇒ x1 = 122m−1 .

(4.57)

The segment end points with the x- and y-coordinates

xr+1 = 2rx1, yr+1 = (r + 1)2−m, r = 0, 1, . . . , 2m − 1 (4.58)

must lie on the logarithmic curve F(x) = 1 + k−1 ln x. Substituting thesevalues and solving for k, we obtain

k = 2m ln 2. (4.59)

Table 4.2 shows the values for the parameters k and A as a function of m. Forreasons similar to those given under the μ-law, the values m = 3, k = 5.545(or A = 94.16) are chosen for the practical companding law. The step size inthe rth segment is given by

Δr = 2r

22m 2n−m−1 , r = 1, 2, . . . , 2m − 1,

Δ0 = Δ1.(4.60)

Page 108: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

90 Principles of Speech Coding

TABLE 4.2

Variation of Parameter A with the Number of Approximating Segments

Allocated bits for segment m 1 2 3 4Number of segments 2m 2 4 8 16k 1.386 2.773 5.545 11.09A 1.47 5.89 94.16 24109

Note that the parameter value A = 94.16 derived above is different fromthe value A = 87.6 normally specified in the standards. We believe that thelatter is (erroneously) based on the equation CA = A/(1 + ln A) for the CA ofthe continuous A-law curve (see Section 4.7.2). Assuming a CA of 16, whichis the actual slope of the first segment in the approximation (see below), it iseasy to see that A = 87.6 satisfies the above equation. But the tangent of thecontinuous A-law curve, whose slope is A/(1 + ln A), is not the first segmentof the approximation. So, the value A = 87.6 cannot be justified. Since all thesegment end points, except the origin, clearly lie on the A = 94.16 logarithmiccurve, it would have been more appropriate to specify this particular valuein the specification of the standard.

Table 4.3 delineates the differences between the segmented μ-law andA-law. Most of them are a direct result of the fact that the smallest step size ofA-law is twice that of μ-law. This theoretically provides a 6 dB SNR advantage

TABLE 4.3

Comparison of μ-Law and A-Law

Parameter μ-law (μ = 255) A-law (A = 94.16) Comments

x1 (end point ofthe firstsegment on thex-axis)

1/(28 − 1) 1/27

Smallest stepsize

Δ0 = 1μ24 ≈ 1

2824 = 1212 Δ0 = Δ1 = 21

2824 = 1211 Smallest step size of

A-law is twice thatof μ-law

CA CA = (1/8)

x1≈ 32 CA = (1/8)

x1= 16 μ-law is 6 dB better

SNR Better at lower signallevels

Better at higher signallevels

A-law is about 1 dBbetter at highersignal levels, whileμ-law is about6 dB better atlower signal levels

Digitallinearization

Slightly complicated Simple

Page 109: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 91

for the μ-law at very low signal levels. But thermal noise floor and offset in theADC, which shifts the zero operating point of the quantizer thereby changingthe quantizing step size for idle channels, limit the actual gain achievable inpractice.

4.10 ITU G.711 μ-Law and A-Law PCM Standards

The discussion in the previous section provides the theoretical foundation forthe implementation of the segmented μ and A companding laws in practice.However, selecting the overload points at ±1 obviously leads to fractionalvalues for the step sizes, decision levels, and reconstruction levels. It is clearlyadvantageous to use an integer representation for all these values. Anotherimportant aspect of the quantizer that is not considered so far is its behaviornear the origin: whether it is mid-riser or mid-tread. This section will discussthese and other practical details of the ITU G.711 PCM standard.

The complete encoding and decoding tables for the μ-law and A-law areshown in Table 4.4. The first few segments are shown in Figure 4.14.

The following are some of the differences between the compandinglaws:

1. The μ-law has a mid-tread characteristic, whereas the A-law has amid-riser characteristic. This implies that there are two zero outputlevels (one positive and one negative) defined for the μ-law and nonefor the A-law.

2. The ratio of the smallest step size to the overload point is 2/8159 ≈2−12 for the μ-law and 2/4096 = 2−11 for the A-law.

4.10.1 Conversion between Linear and Companded Codes

As noted earlier, the nonlinear μ-law and A-law codes specified by the G.711standard are digitally linearizable. It is necessary to convert the nonlin-ear codes to linear codes to perform typical DSP algorithms such as levelcontrol, filtering, conferencing, and echo cancellation. After executing thealgorithms, the linear codes have to be translated back to the nonlinear formfor transmission and switching. Thus, there is a need to convert betweenthe two code domains. We delineate these conversion algorithms in thissection.

We assume that the 8-bit nonlinear code is partitioned as follows: Themost significant bit is the polarity bit, the next three bits denote the segmentnumber S, 0 ≤ S ≤ 7, and the last four bits represent the quantization stepnumber within the segment Q, 0 ≤ Q ≤ 15. The linear code uses a sign-magnitude format and the number of bits needed for the two laws is different.Since 8159 is the overload point of the μ-law, 14 bits are needed to represent

Page 110: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

92 Principles of Speech Coding

TABLE 4.4

Encoding/Decoding Tables for μ-Law and A-LawTable for μ-Law

InputAmplitude

Range x

0–11–33–5

29–3131–35

91–9595–103

215–223223–239

463–479479–511

959–991991–1055

1951–20152015–2143

3935–40634063–4319

7903–8159

••

••

••

••

••

••

••

••

012

1516

3132

4748

6364

7980

9596

111112

127

••

••

••

••

••

••

••

••

024

3033

9399

219231

471495

9751023

19832079

39994191

8031

••

••

••

••

••

••

••

••

0–22–44–6

30–3232–34

62–6464–68

124–128128–136

248–256256–272

496–512512–544

992–10241024–1088

1984–20482048–2176

3968–4096

••

••

••

••

••

••

••

••

012

1516

3132

4748

6364

7980

9596

111112

127

••

••

••

••

••

••

••

••

135

3133

6366

126132

252264

504528

10081056

20162112

4032

••

••

••

••

••

••

••

••

CodeValue

In

DecoderOutput

y

Table for A-LawInput

AmplitudeRange x

CodeValue

In

DecoderOutput

y

the corresponding linear code, whereas 13 bits are sufficient for the A-law asits overload point is at 4096.

4.10.1.1 Linear to μ-Law Conversion

Given a linear 14-bit input sample x in sign-magnitude format, the 8-bit μ-lawcode is determined as follows:

1. The polarity bit of the μ-law code is the same as that of the linear code2. The segment number is determined by the smallest value of S, which

satisfies the inequality

x < 64 × 2S − 33, S = 0, 1, . . . , 7. (4.61)

Page 111: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 93

1 3 5 29 3531 39

246

30

33

37(a) (b)

Segment 0 Segment 2Segment 0 &1Segment 1

2 4 62 6864 72135

63

66

70

FIGURE 4.14 (a) G.711 μ-law quantizer. (b) G.711 A-law quantizer.

3. To find the quantization step Q, first compute the residue

r ={

x, S = 0,x − (32 × 2S − 33), S = 1, 2, . . . , 7.

(4.62)

4. The quantization step within a segment is determined as the smallestvalue of Q that satisfies the inequalities

r <

{2Q + 1, S = 0, Q = 0, 1, . . . , 15,2S+1(Q + 1), S = 1, 2, . . . , 7, Q = 0, 1, . . . , 15.

(4.63)

Alternatively, we can use Table 4.5 to find the segment number S and thequantization step Q. We ignore the sign bit and express the biased linearcode |x| + 33 as a 13-bit binary number as shown on the left-hand columns ofthe table. The segment number is then determined by the leading one of thebiased linear code, and the four bits w, x, y, and z following this leading onerepresent the step number. The bits labeled a, b, . . . , h are ignored.

4.10.1.2 μ-Law to Linear Code Conversion

Given the 8-bit μ-law code with segment number S and quantization stepnumber Q, the decoder output level y is given by

y = (2Q + 33)2S − 33. (4.64)

Page 112: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

94 Principles of Speech Coding

TABLE 4.5

Linear to μ-Law Conversion

12 11 10 9 8 7 6 5 4 3 2 1 0 6 5 4 3 2 1 013-Bit Magnitude of Biased Linear Code |x| + 33 μ-Law Code

0 0 0 0 0 0 0 1 w x y z a 0 0 0 w x y z0 0 0 0 0 0 1 w x y z a b 0 0 1 w x y z0 0 0 0 0 1 w x y z a b c 0 1 0 w x y z0 0 0 0 1 w x y z a b c d 0 1 1 w x y z0 0 0 1 w x y z a b c d e 1 0 0 w x y z0 0 1 w x y z a b c d e f 1 0 1 w x y z0 1 w x y z a b c d e f g 1 1 0 w x y z1 w x y z a b c d e f g h 1 1 1 w x y z

Example 4.4

Given a sample value of 400 mV for a μ-law coder capable of encoding a maximumlevel of 2 V , determine the compressed μ-law code and the decoder output levelcorresponding to the code.

SOLUTION

2V corresponds to an input code of 8159. Hence 400 mV corresponds to an inputcode of 8159/5 = 1631.8, which is rounded to 1632. To convert the linear code1632 to μ-law, add 33 to it and represent it as a binary number as shown inTable 4.6. Referring to this table, the leading one in position 10 yields segmentS = 101 (decimal 5) and the four bits following the leading one yield the quantiza-tion step Q = 1010 (decimal 10). Thus the μ-law code is +90. The correspondingdecoder output level = (2Q + 33)2S − 33 = 1663.

4.10.1.3 Linear to A-Law Conversion

Given a 13-bit input sample x in sign-magnitude format, the A-law code isdetermined as follows:

1. The polarity bit of the A-law code is the same as that of the linear code.2. The segment number is determined as the smallest value of S that

satisfiesx < 32 × 2S, S = 0, 1, . . . , 7. (4.65)

TABLE 4.6

Linear to μ-Law Conversion Example

12 11 10 9 8 7 6 5 4 3 2 1 0 6 5 4 3 2 1 013-Bit Magnitude of Biased Code 1632 + 33 = 1665 μ-Law Code

0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0

Page 113: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 95

TABLE 4.7

Linear to A-Law Conversion

11 10 9 8 7 6 5 4 3 2 1 0 6 5 4 3 2 1 012-Bit Magnitude of Linear Code |x| A-Law Code

0 0 0 0 0 0 0 w x y z a 0 0 0 w x y z0 0 0 0 0 0 1 w x y z a 0 0 1 w x y z0 0 0 0 0 1 w x y z a b 0 1 0 w x y z0 0 0 0 1 w x y z a b c 0 1 1 w x y z0 0 0 1 w x y z a b c d 1 0 0 w x y z0 0 1 w x y z a b c d e 1 0 1 w x y z0 1 w x y z a b c d e f 1 1 0 w x y z1 w x y z a b c d e f g 1 1 1 w x y z

3. To find the quantization step Q, first compute the residue r as

r ={

x, S = 0,x − 16 × 2S, S = 1, 2, . . . , 7.

(4.66)

4. The quantization step is determined as the smallest Q that satisfies

r <

{2(Q + 1), S = 0, Q = 0, 1, . . . , 15,2S(Q + 1), S = 1, 2, . . . , 7, Q = 0, 1, . . . , 15.

(4.67)

As in the μ-law case, we can use Table 4.7 for the conversion. In this case, itis not necessary to “add 33” to the input sample.

4.10.1.4 A-Law to Linear Conversion

Given the 8-bit A-law code with segment number S and quantization stepnumber Q, the decoder output level y is given by

y ={

2Q + 1, S = 0,2S−1(2Q + 33), S = 1, 2, . . . , 7.

(4.68)

4.11 Optimum Quantization

The logarithmic companding laws are advantageous from a practical point ofview because their performance is essentially independent of signal statis-tics. However, for applications where a single density function describesadequately the distribution of input samples, it would be preferable to use

Page 114: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

96 Principles of Speech Coding

the optimum quantizer matched to that particular density function. Thisapproach, when used along with adaptive range scaling, described in the nextsection, yields good performance when coarse quantization is mandated. Thecombination of optimum and adaptive quantization is in fact used in the ITUG.726 ADPCM speech coding standard described in the next chapter.

There are two approaches to solving the optimum quantization problem.The first method, due to Panter and Dite [1], assumes that the number ofquantization levels is large, ignores the overload distortion, and leads to aclosed form solution for the optimum companding characteristics. The sec-ond is a general iterative solution, known as the Lloyd–Max quantizer [2,3],for directly finding the optimum decision and reconstruction levels takinginto account both the quantization noise and the overload distortion. Thisapproach is quite popular and extensively used in practice.

4.11.1 Closed Form Solution for the Optimum CompandingCharacteristics

Assuming a signal range of −1 to +1, a smooth symmetric input den-sity function p(x), a symmetrical compressor characteristic F(x) with theboundary conditions F(0) = 0 and F(1) = 1, and N output levels, the MSEof the nonuniform quantizer is given by Equation 4.20, repeated here forconvenience:

σ2q = 2

3N2

1∫

0

p(x) dx{F′(x)}2 . (4.69)

We can use Holder’s inequality to obtain the optimum F′(x) and integrateit to find the actual compressor function F(x). This inequality states that forany positive a and b with a−1 + b−1 = 1,

∫u(x)v(x) dx ≤

[∫ua(x) dx

]1/a [∫vb(x) dx

]1/b

(4.70)

with equality if and only if ua(x) ∝ vb(x). Choosing

a = 3, b = 3/2, u(x) =[

p(x)

{F′(x)}2

]1/3

, v(x) = [F′(x)]2/3 (4.71)

and applying Holder’s inequality, we obtain

1∫

0

p1/3(x) dx ≤⎡⎣ 1∫

0

p(x) dx{F′(x)}2

⎤⎦

1/3 ⎡⎣ 1∫

0

F′(x) dx

⎤⎦

2/3

. (4.72)

Page 115: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 97

Noting that∫1

0 F′(x) dx = F(1) − F(0) = 1 and using Equation 4.69, theabove inequality can be written as

σ2q ≥ 2

3N2

⎡⎣ 1∫

0

p1/3(x) dx

⎤⎦

1/3

(4.73)

with equality if and only if

p(x)

{F′(x)}2 ∝ F′(x). (4.74)

This relation can be equivalently expressed as

F′(x) = kp1/3(x). (4.75)

Integrating this expression from 0 to 1, and using the fact∫1

0 F′(x) dx =F(1) − F(0) = 1 again, yields

k = 1∫10 p1/3(x) dx

. (4.76)

Hence, the optimal F′(x) is given by

F′(x) = p1/3(x)∫10 p1/3(x) dx

. (4.77)

Integrating once again from 0 to x, we obtain the optimal compressorfunction,

F(x) =∫x

0 p1/3(x) dx∫10 p1/3(x) dx

. (4.78)

The corresponding MSE from Equation 4.73 is

σ2q = 2

3N2

⎡⎣ 1∫

0

p1/3(x) dx

⎤⎦

3

. (4.79)

4.11.2 Lloyd–Max Quantizer

Assume that the density function p(x) is continuous and nonzero over therange x0 < x < xN . The MSE is given by Equation 4.3, repeated here forconvenience:

σ2q =

N∑k=1

xk∫

xk−1

(x − yk)2p(x) dx, (4.80)

Page 116: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

98 Principles of Speech Coding

where x0, x1, . . . , xN are the decision levels and y1, y2, . . . , yN are the outputlevels. Since the end points x0 and xN are given, we have to pick the N − 1decision levels x1, x2, . . . , xN−1 and the N output levels y1, y2, . . . , yN so that theMSE is minimized. The necessary conditions for the minimum are as follows:

∂(σ2q)

∂xk= 0, k = 1, 2, . . . , N − 1, (4.81)

∂(σ2q)

∂yk= 0, k = 1, 2, . . . , N. (4.82)

These conditions are also sufficient if log[p(x)] is concave (i.e., if d2/dx2

log[p(x)] < 0 for all x).Using the Leibnitz rule for differentiating:

∂u

β(u)∫

α(u)

H(x, u) dx =β(u)∫

α(u)

∂H(x, u)

∂udx + H(β, u)

∂β

∂u− H(α, u)

∂α

∂u, (4.83)

we obtain from Equation 4.81,

(xk − yk)2p(xk) − (xk − yk+1)

2p(xk), (4.84)

which yields

xk = yk + yk+1

2, k = 1, 2, . . . , N − 1. (4.85)

This condition is known as the nearest neighbor rule. It states that the opti-mum decision levels must be midway between neighboring output levels.Similarly, from Equation 4.82, we obtain

yk =∫xk

xk−1xp(x) dx∫xk

xk−1p(x) dx

, k = 1, 2, . . . , N, (4.86)

which is known as the centroid rule. It implies that the optimum yk must belocated at the centroid of the interval (xk−1, xk).

Since Equations 4.85 and 4.86 cannot be solved directly, iterative proceduresare necessary to obtain the decision and output levels. The first procedure,known as the Lloyd–Max II algorithm, is described here. Since the end point x0is given, an initial guess of the output level y1 is made, and x1 is determinedfrom Equation 4.86 by solving

y1 =∫x1

x0xp(x) dx∫x1

x0p(x) dx

; (4.87)

Page 117: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 99

The variable y2 can then be obtained from Equation 4.85 as

y2 = 2x1 − y1. (4.88)

This procedure is repeated till the end point yN is obtained as

yN = 2xN−1 − yN−1. (4.89)

This value of yN is compared with the value of yN obtained directly fromEquation 4.86 using the known end point xN . If they are not close enough,the initial guess y1 is perturbed in an appropriate direction and the processrepeated till the desired accuracy is achieved.

The Lloyd–Max II algorithm finds the optimum quantizer by performingseveral passes through the set of quantization parameters {y1, x1, y2, x2, . . . ,xN−1, yN}. However, it does not generalize to the design of vector quantizers.Asecond algorithm, known as Lloyd–Max I, has broader applicability, and can beused for designing vector quantizers as well as optimum quantizers for datawith unknown density functions. The Lloyd–Max I algorithm is summarizedbelow:

1. Choose initial set of output levels: {y1, y2, . . . , yN}.2. Pick decision levels midway between output levels:

xk = yk + yk+1

2, k = 1, 2, . . . , N − 1. (4.90)

3. Pick output levels at centroids of decision regions:

yk =∫xk

xk−1xp(x)dx∫xk

xk−1p(x)dx

, k = 1, 2, . . . , N. (4.91)

4. Repeat the last two steps until convergence.

4.12 Adaptive Quantization

The optimum quantization scheme described in the previous section assumesthat one knows the exact density function of the input signal. In practice,however, although the approximate shape of the density function is known,or can be assumed, the power of the signal can fluctuate slowly with time.Hence, to obtain optimum performance over a broad range of power levels,it is necessary to adapt the input signal level to these power fluctuations.

The block diagram of an adaptive quantizer based on the so-called back-ward estimation of the variance of the input signal is shown in Figure 4.15.

Page 118: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

100 Principles of Speech Coding

Coder x(k) I(k)

× /

Scalefactor

estimate

Decoder

Scalefactor

estimate

xq(k)

Δ(k) Δ(k)

x(k)/Δ(k)

FIGURE 4.15 Adaptive quantizer.

In this method, the variance normalizing scale factor Δ(k) is estimatedbased solely on the transmitted code as shown. Since both the coder anddecoder have access to this code, explicit adaption information need not betransmitted to the far end.

A recursive algorithm to estimate the current scale factor from the trans-mitted code can be derived as follows. We first estimate the variance of thequantized signal xq(k − 1), which depends on the previously transmitted codeI(k − 1) and scale factor Δ(k − 1), and then evaluate its square root to obtainthe current scale factor Δ(k). A LPF, with cutoff frequency defined by theparameter α, can be employed to estimate the variance of the quantized signal,as shown in Figure 4.16.

The square of the current scale factor Δ(k) at time k is given by

Δ2(k) = αΔ2(k − 1) + (1 − α)x2q(k − 1). (4.92)

The quantized signal xq(k − 1) can be written in general as

xq(k − 1) = Δ(k − 1)Q[I(k − 1)], (4.93)

where Q[·] is the (quantizer) decoder transfer function. Substituting inEquation 4.92, we obtain

Δ2(k) = {α + (1 − α)Q2[I(k − 1)]}Δ2(k − 1). (4.94)

By taking the square root of both sides, the above expression can berewritten as

Δ(k) = M[I(k − 1)]Δ(k − 1). (4.95)

xq2(k – 1)

1 – α z–1

X

X

α

Δ2(k)

FIGURE 4.16 Estimation of square of the scale factor.

Page 119: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 101

Thus, the scale factor at time k is obtained as the product of the previousscale factor and the function M[·], which depends only on the transmittedcode. This function is normally chosen empirically. It should be greater thanone for code values corresponding to large signal amplitudes and less thanone for codes that result in small signal levels. Note that

Δ(k) =k−1∏j=0

M[I( j)]Δ(0), (4.96)

where Δ(0) is the initial condition. If the initial condition at the decoder isdifferent from that at the encoder, or if there are transmission errors, the scalefactors at the two ends do not track. To minimize the effects of this mistracking,the adaption scheme can be made robust by modifying the recursion [4] witha leakage factor β as follows:

Δ(k) = M[I(k − 1)]{Δ(k − 1)}β, β = 1 − ε, (4.97)

where ε is a small positive constant. In this case the effect of differing initialconditions is “forgotten” after some time due to the presence of the leak factorβ. It is also necessary to impose upper and lower limits on the scale factor inorder to prevent it from adapting to very low or high values.

4.13 Summary

The main focus of this chapter was to study the performance of memorylessscalar quantizers. We showed that their MSE under no-overload conditionscan be expressed as a function of the PDF of the input signal and the quantizerstep-size distribution. We then considered the SNR performance of uniformand nonuniform quantizers. The uniform quantizer is simple to implementand yields good SNR for higher-level signals, but performs poorly at lowersignal powers. On the other hand, nonuniform quantizers can be designed toachieve specific performance objectives by the proper selection of the com-pandor. For speech communications, the quantizer should provide a fairlyconstant SNR over a wide dynamic range of input signals, irrespective of thesignal distribution. This leads to the logarithmic companding law, which isunfortunately discontinuous at the origin and hence must be modified. Thecontinuous μ-law and A-law curves are two possible modifications to the log-arithmic companding law that provide the desired continuity. For practicalimplementation, however, these curves are approximated by linear segmentsto yield the G.711 PCM standards used in practice. In addition to the standard-ized PCM quantizers, we also discussed the theory of optimum and adaptivequantizers that are employed in the G.726 ADPCM speech coding standarddescribed in the next chapter.

Page 120: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

102 Principles of Speech Coding

EXERCISE PROBLEMS

4.1. Consider a signal with the exponential PDF:

p(x) ={

ke−|x|, |x| ≤ 1,0, otherwise.

a. Determine k and σ2x.

b. Find the MSE and the SNR if this signal is passed through an 8-bit uniformquantizer.

c. Now consider the 8-bit nonuniform quantizer with the compressor character-istic shown in Figure 4.17. (Only the positive half is shown; the other half issymmetrical.) What is the new MSE and SNR for the nonuniform quantizer?

d. Suppose you fix the overload points at –1 and +1 for the uniform and nonuni-form quantizers (as above) but lower the input signal power. Let the newdensity function be

p(x) =⎧⎨⎩k1e−|x|, |x| ≤ 1

4,

0, otherwise.

Calculate the new SNR for both cases.

4.2. Asine wave of amplitude 1/64 V is quantized by an 8-bit uniform quantizer withoverload level of 1 V. Find the SNR in dB. What would be the SNR if the samesine wave is quantized by an 8-bit segmented A-law quantizer with overloadlevel equal to 1 V?

1/8 1/4 1/2 1

1/4

1/2

3/4

1

x

F(x)

FIGURE 4.17 Compressor characteristic for Problem 4.1.

Page 121: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 103

4.3. Consider the uniform PDF

p(x) ={

1/2, |x| < 1,0, otherwise.

a. Find the signal power σ2x.

b. Find the MSE and the SNR if this signal is quantized by an n-bit uniformquantizer.

4.4. This problem demonstrates the importance of selecting the proper overloadlevels in a quantizer.

Suppose a signal with the uniform density function

p(x) =⎧⎨⎩

12a

, |x| ≤ a

0, otherwise.

is quantized by an N-level uniform quantizer with the overload points fixedat −1 and +1. Assuming a >1, find an expression for the total MSE in termsof the number of levels N and the parameter a. If N = 256 and a = 1.1, findthe total MSE of the quantizer and compare it with the MSE under no-overloadconditions. (Note that the total MSE = σ2

q + σ2ol, where σ2

q is the quantizing noise

when the signal is within the overload limits −1 and +1, and σ2ol is the overload

noise when it is outside these limits.)4.5. Consider an N-level nonuniform quantizer with the inverse hyperbolic com-

pressor characteristic

F(x) = sinh−1(cx)

sinh−1(c), 0 ≤ x ≤ 1,

where c is a parameter that determines the nonlinearity of the curve. The char-acteristic for negative signal amplitudes is assumed to be symmetrical, and theinput signal x is distributed in the range (−1, +1).

a. Show that the MSE of this quantizer is given by

σ2q = (sinh−1 c)2

3N2(1 + c2σ2

x)

c2 .

b. Prove that the SNR depends only on the signal power σ2x and not on the

specific shape of the input density function p(x). Also show that the SNR isapproximately constant for large signal powers, where c2σ2

x � 1.

c. Find the value of c so that the CA of this quantizer is 32.

Page 122: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

104 Principles of Speech Coding

4.6. Consider an N-level nonuniform quantizer with the square-root compressorcharacteristic

F(x) = √x, 0 ≤ x ≤ 1.

The characteristic for negative signal amplitudes is assumed to besymmetrical:

a. Derive an expression for the SNR of this quantizer in terms of the number ofquantization levels N, and the statistics of the input signal.

b. What is the CA of the quantizer?

4.7. If the signal overload levels of the nonuniform quantizer are −V and +V, insteadof −1 and +1, derive the following results:

a. The MSE of the nonuniform quantizer is

σ2q = 2V2

3N2

V∫

0

p(x) dx{F′(x)}2 .

b. The equation for the logarithmic curve that yields a constant SNR is

F(x) = V + k−1 ln(x/V).

c. The continuous μ-law curve compressor characteristic is

F(x) = V ln(1 + μx/V)

ln(1 + μ), 0 ≤ x ≤ V.

d. The continuous A-law compressor characteristic is

F(x) =

⎧⎪⎪⎨⎪⎪⎩

Ax1 + ln A

, 0 ≤ x ≤ V/A,

V + V ln(Ax/V)

1 + ln A, V/A ≤ x ≤ V.

4.8. A signal with the PDF

p(x) =

⎧⎪⎨⎪⎩

32(1 − |x|)2, |x| ≤ 1

0, otherwise

is quantized with an 8-bit nonuniform quantizer with the compressor character-istic

F(x) =√

2x − x2, x ≤ 1.

(Only the positive half of the compressor is given; the other half is symmetri-cal.) Evaluate the MSE and the SNR in dB.

Page 123: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 105

4.9. Consider an N-level nonuniform quantizer with the hyperbolic compressorcharacteristic

F(x) = (1 + m)x1 + mx

, 0 ≤ x ≤ 1,

where m is a parameter that determines the nonlinearity of the curve. Thecharacteristic for negative signal amplitudes is assumed to be symmetrical.

a. Derive an expression for the step size Δ(x) of this quantizer at signal ampli-tude x in terms of the parameter m, the number of quantization levels N, andthe signal amplitude x.

b. Find the value of the nonzero signal amplitude x, in terms of the parameter m,at which the logarithmic step size Δ(x)/x achieves its minimum value. Whatis this value?

c. Find the value of m so that the CA of this quantizer is 32. What would be theSNR of an 8-bit quantizer for a signal uniformly distributed in the range [−1,+1] with this choice of m?

4.10. You are to obtain a table of SNR versus signal power for the continuous μ-lawquantizer excited by sine wave input signals. Consider an input sinusoidal signalof the form:

x(t) = a sin(ωt + φ), a ≤ 1.

If φ is uniformly distributed between 0 and 2π, then the density function ofx is

p(x) =

⎧⎪⎨⎪⎩

1

π√

a2 − x2, |x| < |a|,

0, otherwise.

a. Find an expression for the SNR of the continuous μ-law quantizer, with Nlevels and overload points at −1 and +1, in terms of the amplitude a and theparameter μ, for sinusoidal inputs.

b. Tabulate the SNR (in dB) for an 8-bit, μ = 255 quantizer for signal powers of0, −10, −20, −30, −40, and −50 dB, relative to the maximum signal power(a = 1). Assume the range of quantizer to be −1 to +1. Note that, for example,−20 dB corresponds to a = 0.1.

c. Plot the results of part (b) and on the same graph show the results for an 8-bituniform quantizer having the same range.

4.11. Consider a segmented approximation to the continuous μ-law compandor(defined for a signal range of −1 to +1) under the following conditions:

i. The number of segments on either side of the origin is S, numbered0, 1, . . . , S − 1.

ii. The same number of quantization steps Q in each segment.

iii. If Δk is the step size in the kth segment, then (Δk+1/Δk) = R,k = 0, 1, . . . , S − 2.

Assuming S and R are arbitrary integers, find the value of the parameter μ

(in terms of S and R) that ensures that all the segment end points lie on the

Page 124: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

106 Principles of Speech Coding

continuous μ-law compression characteristics. What is the CA of this quantizer?Find an expression for the step size in the kth segment.

4.12. Consider a segmented approximation to the logarithmic law compressor F(x) =1 + k−1 ln(x) that satisfies the following conditions:

i. The number of segments on either side of the origin is S, numbered0, 1, . . . , S − 1.

ii. The number of quantization steps in each segment is the same, say Q.

iii. If Δj is the step size in the jth segment, then

Δ1Δ0

= R − 1 andΔj+1

Δj= R, j = 1, 2, . . . , S − 2.

Assuming that Sand R are arbitrary integers (S, R > 1), find the value of theparameter k (in terms of S and R) that ensures that all the segment end points,except the origin, lie on the logarithmic law compression characteristics. What isthe CA of this quantizer? Find an expression for the step size in the jth segment.What is the approximate SNR (for high signal levels) of an 8-bit quantizer withR = 4 and S = 8?

4.13. In a μ-law coder an input voltage X produces the codeword 01000101 (decimal+69) and a second input voltage Y produces the codeword 10100110 (decimal−38). Find the codeword that would result if the input voltage to the coder isX + Y.

4.14. If a certain voltage level produces the codeword 0 010 1111 (decimal +47)when applied to an A-law encoder with overload levels of +/−1 V, what code-word would result if the same voltage is applied to a μ-law encoder with thesame overload levels?

4.15. Suppose you have to conference three parties X, Y, and Z. Assume X and Y residein the USA, which employs the μ-law PCM standard, while Z is in Europe, whichemploys the A-law standard. The following decimal codes are received at theconferencing equipment: +64 (μ-law coded) from X, −32 (μ-law coded) from Y,and +64 (A-law coded) from Z. What PCM codes should be sent out to X, Y,and Z? (Note that X should receive the sum of Y and Z, appropriately coded,etc.)

4.16. You are to determine an optimum uniform quantizer (i.e., an optimum quantizerconstrained to have uniform steps) for a symmetric density function p(x) thatis defined in the range (−∞, +∞) by taking into account both the quantizationand overload distortions. Consider the five-level uniform quantizer shown inFigure 4.18 as an example:

a. Show that the optimum value of Δ that minimizes the MSE in the case of thefive-level uniform quantizer satisfies the equation

3Δ/2∫

Δ/2

(x − Δ)p(x) dx + 2

∞∫

3Δ/2

(x − 2Δ)p(x) dx = 0.

Page 125: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 107

Δ/2

Δ

3Δ/2

FIGURE 4.18 Five-level uniform quantizer for Problem 4.16.

b. Show that the general solution for N levels is given by

(N−3)/2∑k=1

(2k+1)Δ/2∫

(2k−1)Δ/2

(x − kΔ)p(x) dx

+∞∫

(N−2)Δ/2

[(N − 1)/2)][x − (N − 1)Δ/2]p(x) dx = 0.

4.17. Consider the three-level quantizer shown in Figure 4.19. If the input to thequantizer is a random signal with the density function

p(x) ={

k(1 − x2), −1 ≤ x ≤ 1,0, otherwise,

–Δ

Δ

–δ

δ

FIGURE 4.19 Three-level uniform quantizer for Problem 4.17.

Page 126: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

108 Principles of Speech Coding

find the decision level δ and the reconstruction level Δ so that the MSE isminimized. What is the corresponding MSE and the SNR?

4.18. Assume that the random input signal x to an N-level optimum quantizer has azero-mean continuous density function p(x). Let y denote the output signal, andlet q = x − y denote the quantization error. If the decision and reconstructionlevels are chosen to satisfy the optimality conditions, show that

a. The mean of the quantization error is zero; that is, E[q] = 0.

b. The quantization error is orthogonal to the output; that is, E[qy] = 0.

c. The input and the quantization error are correlated; the expectation E[qx] =σ2

q.

d. The variance of the output is less than that of the input; the variance σ2y =

σ2x − σ2

q.

References

1. Panter, P.F. and W. Dite, Quantization distortion in pulse count modulation withnonuniform spacing of levels, Proceedings of the IRE, 39, 44–48, 1951.

2. Lloyd, S.P., Lease squares quantization in PCM, IEEE Transactions on InformationTheory, IT-28, 127–135, 1982.

3. Max, J., Quantizing for minimum distortion, IEEE Transactions on InformationTheory, IT-6, 7–12, 1960.

4. Goodman, D. and R. Wilkinson, A robust adaptive quantizer, IEEE Transactions onCommunications, 23, 1362–1365, 1975.

Bibliography

1. Cattermole, K.W., Principles of Pulse Code Modulation, Iliffe Books Ltd, London,1969.

2. Jayant, N.S. and P. Noll, Digital Coding of Waveforms, Prentice Hall, EnglewoodCliffs, NJ, 1984.

3. Gersho, A. and R.M. Gray, Vector Quantization and Signal Compression, KluwerAcademic Publishers, Boston, 1991.

4. Messerschmitt, D.G., EECS 290I Digital Transmission Course Notes, University ofCalifornia, Berkeley, Winter Quarter 1983.

5. Kaneko, H., A unified formulation of segmented companding laws, Bell SystemsTechnical Journal, 49, 1555–1588, 1970.

6. Montgomery, W.L., Digitally linearizable compandors with comments on projectfor digital telephone network, IEEE Transactions on Communication Technology,COM-18, 1–4, 1970.

Page 127: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Waveform Coding and Quantization 109

7. ITU-T Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies,International Telecommunication Union, Geneva, 1993.

8. ITU-T Recommendation G.726, Adaptive Differential Pulse Code Modulation(ADPCM) of Voice Frequencies, International Telecommunication Union, Geneva,1990.

Page 128: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

5Differential Coding

5.1 Introduction

The G.711 PCM standard we discussed in the last chapter employs amemoryless quantizer that acts on the instantaneous values of the signalwithout taking advantage of the correlation that may exist between adja-cent samples. Since neighboring speech samples are generally correlated, thevariance of the difference between successive samples will be smaller thanthat of the original signal. It would therefore be advantageous to quantize thedifference signal x(n) − x(n − 1) instead of the instantaneous sample valuesx(n). This is the basic idea behind differential coding.

First, consider the simple difference signal,

d(n) = x(n) − x(n − 1). (5.1)

Assuming x(n) is a zero-mean stationary process with variance σ2x, the

variance of d(n) is

σ2d = E[d2(n)] = σ2

x{2(1 − r1)

}, (5.2)

where

r1 = E{x(n)x(n − 1)}E{x2(n)} (5.3)

is the normalized correlation between adjacent samples. It is clear fromEquation 5.2 that σ2

d < σ2x whenever r1 > 0.5. In such a case it would be better

to quantize d(n) instead of x(n) because, for a given number of bits in thequantizer—uniform or nonuniform—the quantization noise power increaseswith the variance of the input signal. As an example, for Gaussian signalswith standard deviation σ, the step size of an n-bit uniform coder is generallychosen as Δ = 4σ/2n−1, and the quantization noise power σ2

q = Δ2/12, whichis directly proportional to the variance of the signal at the input of the quan-tizer. Alternatively, to achieve a given end-to-end SNR, differential codingpermits a coarser quantizer thereby reducing the transmission bit rate.

111

Page 129: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

112 Principles of Speech Coding

5.2 Closed-Loop Differential Quantizer

Figure 5.1 shows an open-loop differential coding scheme in which the above-mentioned difference signal d(n) = x(n) − x(n − 1) is coded and transmittedinstead of the original signal x(n). At the receiver, these codes are first decodedto obtain the quantized difference signal dq(n), which is then accumulated toyield the reconstructed signal xr(n).

In such a configuration, however, the reconstruction error e(n) = x(n) −xr(n) in recovering x(n) is not only the present quantization error q(n) =d(n) − dq(n), but an accumulation of all the previous quantization errors ascan be seen from the following:

e(n) = x(n) − xr(n)

= {x(n − 1) + d(n)} − {xr(n − 1) + dq(n)}= {d(n) − dq(n)} + {x(n − 1) − xr(n − 1)}= q(n) + {x(n − 1) − xr(n − 1)}, q(n) = d(n) − dq(n)

= q(n) + q(n − 1) + q(n − 2) + · · · . (5.4)

Thus, although the mean of the reconstruction error is zero, its varianceis (theoretically) unbounded due to the accumulation of all the quantizationerror terms. It is obviously essential to avoid such an accumulation in orderto guarantee adequate SNR for the reconstructed signal. This can be achievedby using the closed-loop scheme shown in Figure 5.2.

In the new scheme, the difference sample d(n) is formed as x(n) − xr(n − 1),instead of x(n) − x(n − 1), where xr(n) is the sample recovered at the receiver.Note that the circuit for recovering xr(n) from dq(n) is duplicated at thetransmitter. With this scheme, the reconstruction error is

e(n) = x(n) − xr(n) = x(n) − {xr(n − 1) + dq(n)} = d(n) − dq(n) = q(n).(5.5)

Hence, the error in reconstructing x(n) is precisely equal to the present quan-tization error q(n) for the sample d(n), and not an accumulation of previouserrors. Note that the variance of the new difference signal x(n) − xr(n − 1) isnot the same as that of the simple difference signal x(n) − x(n − 1), which isgiven by Equation 5.2. However, if xr(n) is close enough to x(n), as is necessary

Coder x(n) d(n)

Decoder

z–1

Code I(n)+

xr(n)dq(n)

xr(n –1)z–1

∑ ∑

FIGURE 5.1 Open-loop differential quantizer.

Page 130: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 113

Coder

Inputsignalx(n)

Diff.signald(n)

Decoder

Decoder

z–1

Code I(n)+

- Signal

estimatexr(n – 1) Quantized

diff. signaldq(n)

Reconstructedsignal xr(n)

Quantizeddiff. signal

dq(n)

Signalestimatexr(n – 1)

Encoder Decoder

Reconstructedsignal xr(n)

z–1

∑ ∑

FIGURE 5.2 Closed-loop differential quantizer.

in a good speech coding system, the discrepancy should be minimal. We willdiscuss the computation of the variance of the new difference signal later.

5.3 Generalization to Predictive Coding

As a generalization to the simple differential quantizer, consider the modifieddifference signal,

d(n) = x(n) − a1x(n − 1). (5.6)

Here we subtract a1x(n − 1) instead of x(n − 1), where a1 is a coefficient to bedetermined, from the current sample. The variance of this modified differencesignal is

σ2d = σ2

x(1 + a21 − 2 a1r1), (5.7)

where r1 is the first (normalized) autocorrelation coefficient, as defined inEquation 5.3. It is readily verified that the preceding variance is minimizedif a1 = r1. Substituting for a1 in Equation 5.7, the optimum variance of themodified difference signal is

σ2d = σ2

x(1 − r21). (5.8)

The variance reduction factor (1 − r21) in this case is always less than or

equal to one, and thus this scheme is guaranteed to perform as good as orbetter than the memoryless quantizer. The inverse of this factor is known as

Page 131: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

114 Principles of Speech Coding

the prediction gain, and it is given by

G = σ2x

σ2d

= (1 − r21)

−1. (5.9)

The term a1x(n − 1) in Equation 5.6 is known as the first-order predictor,and the resulting differential coding scheme is referred to as predictive coding.This first-order predictor can be generalized to a predictor of order p, wherethe estimate of the input signal x(n) is obtained as the linear combination ofthe past p speech samples:

xe(n) =p∑

k=1

akx(n − k). (5.10)

It can be shown that the optimum set of prediction coefficients thatminimizes the variance of the difference signal x(n) − xe(n) is given by thefollowing matrix equation:

⎡⎢⎢⎢⎣

a1a2...ap

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

r0 r1 · · · rp−1r1 r0 · · · rp−2...

.... . .

...rp−1 rp−2 · · · r0

⎤⎥⎥⎥⎦

−1 ⎡⎢⎢⎢⎣

r1r2...rp

⎤⎥⎥⎥⎦ , (5.11)

where rk = E{x(n)x(n − k)}/E{x2(n)}, k = 0, 1, . . . , p, are the normalized auto-correlation values.

In order to obtain a closed-loop differential coding system, however, insteadof utilizing Equation 5.10, the predictor output is generated using the pastvalues of the reconstructed signal as xe(n) = ∑p

k=1 akxr(n − k), and the differ-ence signal to be quantized is computed as x(n) − xe(n). Figure 5.3 depictsthis generalized predictive coding scheme.

Note that in this case also, the reconstruction error e(n) is just thequantization error q(n) as shown below:

e(n) = x(n) − xr(n) = x(n) − {xe(n) + dq(n)} = d(n) − dq(n) = q(n). (5.12)

In fact this result holds good even if an arbitrary signal is subtracted beforequantizing as long as the same signal is added later to the quantizer output.By choosing the subtracted value to be the signal estimate xe(n), however, thevariance of the difference signal d(n) can be minimized, which is the primaryobjective in differential coding.

If the predictor coefficients are chosen to be the optimum values givenby Equation 5.11, the prediction gain for the general (open-loop) predictive

Page 132: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 115

Coder

Inputsignalx(n)

Diff.signald(n)

Decoder

Decoder

Predictor

Code I(n)+

– Signal

estimatexe(n)

Quantizeddiff. signal

dq(n)

Reconstructedsignal xr(n)

Quantizeddiff. signal

dq(n)

Signalestimate

xe(n)

Reconstructedsignal xr(n)

Predictor

FIGURE 5.3 Generalized predictive coding.

coding system can be shown to be

G =[

1 −p∑

k=1

akrk

]−1

. (5.13)

But, as explained in the following section, the actual closed-loop predictiongain will be less.

5.3.1 Optimum Closed-Loop Predictor

We derived the optimum predictor coefficients and the correspondingprediction gain in the previous section assuming an open-loop system. But,as explained earlier, practical differential coding systems must use the closed-loop paradigm in order to prevent the accumulation of quantization errors.It is generally difficult to analyze such systems since the difference signalis contaminated by the quantization noise that is fed back via the predictor.Nevertheless, by using certain simplistic assumptions about the quantizer,we can get better estimates for the optimum predictor coefficients and thecorresponding prediction gain in the closed-loop case.

We will illustrate this technique for the first-order DPCM coder depicted inFigure 5.4. In the open-loop case, the optimum first predictor is a1 = r1, andthe corresponding prediction gain is G = (1 − r2

1)−1, as shown above. These

values will be different in the closed-loop case, as we will demonstrate below.We assume that the reconstructed signal xr(n) can be modeled as the input

signal x(n) contaminated by zero-mean uncorrelated additive quantizationnoise q(n) = d(n) − dq(n), whose variance is σ2

q. Furthermore, we assume thatthe quantizer can be characterized by the noise-to-signal ratio ρ = σ2

q/σ2d,

which is the reciprocal of the SNR. For a multibit quantizer, ρ is a positive

Page 133: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

116 Principles of Speech Coding

dq(n)

xr(n)

a1xr(n–1)

d(n)

+x(n)

a1z–1

Coder

Decoder

FIGURE 5.4 DPCM coder with first-order predictor.

number that is generally much smaller than unity. Its value depends primarilyon the number of bits and the loading factor of the quantizer.

Referring to Figure 5.4, the difference signal is given by

d(n) = x(n) − a1xr(n − 1) = x(n) − a1{x(n − 1) − q(n − 1)}, (5.14)

where we have used the fact that in the closed-loop differential coding sys-tem the reconstruction error equals the quantization error; that is, x(n − 1) −xr(n − 1) = q(n − 1). The variance of the difference signal is

σ2d = σ2

x + a21(σ

2x + σ2

q) − 2a1r1σ2x. (5.15)

The optimum value of the predictor coefficient a1 that minimizes σ2d is

obtained by setting the derivative (∂/∂a1)σ2d to zero. This yields

a1 = r1

1 + (σ2q/σ2

x)= r1

1 + (ρ/G), (5.16)

where, as assumed above, ρ = σ2q/σ2

d is the noise-to-signal ratio of thequantizer, and G = σ2

x/σ2d is the prediction gain. It is clear from Equation 5.16

that the optimum closed-loop predictor coefficient is a scaled version of theopen-loop value r1. The scaling factor depends on the overall noise-to-signalratio σ2

q/σ2x = ρ/G. It will be close to unity if the number of levels in the

quantizer is reasonably large.In order to determine the closed-loop prediction gain G, substitute Equa-

tion 5.16 in Equation 5.15 to obtain the corresponding variance of thedifference signal,

σ2d = σ2

x

[1 − r2

11 + (ρ/G)

]. (5.17)

Page 134: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 117

Using the definition G = σ2x/σ

2d and rearranging, Equation 5.17 can be

written as a quadratic equation in G:

G2(1 − r21) − G(1 − ρ) − ρ = 0. (5.18)

Solving this equation yields the optimum closed-loop prediction gain,

G =1 − ρ +

√(1 − ρ)2 + 4ρ

(1 − r2

1)

2(1 − r2

1) . (5.19)

Only the positive sign before the radical in Equation 5.19 is chosen since G isalways greater than zero. Note that Equation 5.19 simplifies to G = (

1 − r21)−1

if ρ ≈ 0.Since the reconstruction error in the closed-loop system is identical to the

quantization error, the overall SNR in such a system can be expressed as theproduct of the prediction gain and the SNR of the quantizer as follows:

SNRDPCM = σ2x

σ2q

=(

σ2x

σ2q

)(σ2

dσ2

q

)= G × SNRquantizer. (5.20)

Hence, the prediction gain determines the improvement in the overall SNRof the differential coding system compared with memoryless quantization.

5.3.2 Adaptive Prediction

A typical plot of the prediction gain G as a function of the predictor orderp would show that it saturates to a modest value at about p = 3 due to thenonstationary nature of speech signals. That is, higher gains are not achievablewith fixed predictors of higher order. Thus, in order to improve the systemperformance, the predictors have to be designed to adapt to the local statisticsof the input signal.

Since the predictor coefficients a1, a2, . . . , ap are functions of r0, r1, . . . , rp,one method for adapting them is to estimate the autocorrelation at frequentintervals and then solve for the optimum coefficients using Equation 5.11.However, in this scheme, known as forward adaption, these adapted valueshave to be communicated to the receiver as side information so that the pre-dictors at the two ends can track one another. Furthermore, the input has tobe buffered in order to estimate the autocorrelation values, thereby incurringcoding delays. These disadvantages can be overcome with a backward adaptionscheme in which the predictor is adapted solely on the basis of the transmittedcode as described below.

Instead of estimating the autocorrelation function and then solvingEquation 5.11 for the optimum predictor coefficients, an equivalent adaption

Page 135: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

118 Principles of Speech Coding

strategy is to adjust them so that the variance of the difference signal d(n) isminimized. But the receiver has no knowledge of d(n) either. Hence a sub-optimal scheme that minimizes the variance of the quantized difference signaldq(n), which is known to both the ends, is generally employed in practice.

The least mean squares (LMS) algorithm is one of the most popular methodsto accomplish this minimization. The minimum is achieved by iterativelyupdating the coefficients in a direction opposite to the gradient of the squareof the instantaneous quantized difference signal. This gradient is given by

∂ai

{d2

q(n)}

= 2dq(n)∂

∂ai

{dq(n)

}, i = 1, 2, . . . , p. (5.21)

To find (∂/∂ai)[dq(n)], note that

dq(n) = xr(n) − xe(n) = xr(n) −p∑

i=1

aixr(n − i). (5.22)

Hence,∂

∂ai{dq(n)} = −xr(n − i), i = 1, 2, . . . , p. (5.23)

Thus, the LMS update equations for updating the predictor coefficients canbe written as

ai(n + 1) = ai(n) + μdq(n)xr(n − i), i = 1, 2, . . . , p, (5.24)

where μ is the step size that determines the adaption speed. Polarity cross-correlation in the update term is often used to simplify the implementationas follows:

ai(n + 1) = ai(n) + μ sgn[dq(n)]sgn[xr(n − i)], i = 1, 2, . . . , p. (5.25)

5.4 ITU G.726 ADPCM Algorithm

Figure 5.5 is a simplified block diagram of the ITU G.726 ADPCM algorithm.The G.726 encoder assumes 8-bit PCM (μ-law or A-law) code at the inputand converts it to 2-, 3-, 4-, or 5-bit ADPCM code, which corresponds to 16,24, 32, or 40 kbps transmission rates, respectively. The decoder performs thereverse translation. Conversion between the PCM and linear codes is notshown in the figure. In addition, the ITU algorithm incorporates a “tone andtransition detector” function to improve the performance for frequency shiftkeyed (FSK) signals and a “synchronous coding adjustment” in the decoder

Page 136: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 119

Adaptivecoder

Adaptivedecoder

Adaptivedecoder

+

Adaptivepredictor

Adaptivepredictor

Signalestimate

se(k)

Inputsignals1(k)

Differencesignald(k)

Transmitted code I(k)

Reconstructedsignal sr(k)

Quantizeddiff. signal

dq(k)

Signalestimate

se(k)

Quantizeddiff. signal

dq(k)Reconstructedsignal sr(k)

Encoder Decoder

∑ ∑

FIGURE 5.5 Simplified block diagram of the G.726 ADPCM algorithm.

to reduce the accumulation of quantization errors during tandem code con-versions between PCM and ADPCM. These refinements are also not shown.Furthermore, we will restrict our discussion to just the 32 kbps rate in whatfollows.

Referring to the encoder in Figure 5.5, an estimated value se(k) of the input,obtained from the adaptive predictor, is subtracted from its actual value sl(k)

to produce a difference signal d(k). This difference signal is encoded with fourbits using an adaptive coder. The adaptive decoder generates the quantizedversion dq(k) of this difference signal, which is used to drive the adaptivepredictor that generates the signal estimate. The reconstructed signal sr(k)

is the sum of the signal estimate and the quantized difference signal. (Theadaptive coder and decoder are referred to as the adaptive quantizer andinverse quantizer, respectively, in the G.726 standard.) The functional blocksin the decoder are similar to their counterparts in the encoder.

5.4.1 Adaptive Quantizer

The 4-bit adaptive quantizer employed in the G.726 algorithm is an optimumquantizer that assumes that the input density function is Gaussian. Its range isadapted according to the variance of the difference signal. The adaption scalefactor is computed in the logarithmic domain in order to simplify the signalmultiplication and division operations. The rate of adaption is chosen to befast for speech-like signals that yield a difference signal with rapid powervariations, and slow for voice band data signals that produce a differencesignal with relatively constant power.

Page 137: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

120 Principles of Speech Coding

+ –

Log Optimum coder

Optimumdecoder

d(k)

al(k)

Scale factor

adaption

Speedcontrol

parameter

Sign bit

Signbit

Antilog

I(k)

dq(k)

y(k)log scale

factor

Adaptive coder

Adaptive decoder

FIGURE 5.6 Adaptive quantizer.

A block diagram of the adaptive quantizer is shown in Figure 5.6. Thedecision and reconstruction levels of the 4-bit quantizer are shown in Table 5.1.The table is delineated for positive values only and the negative half issymmetrical.

TABLE 5.1

Quantizer Normalized Input/Output Characteristic for 32 kbps Operation

Normalized Quantizer Absolute ADPCM Normalized Quantizer

Input Range log2 |d(k)| − y(k) Code |I(k)| Output log2 |d(q)| − y(k)

[3.12, +∞) 7 3.32[2.72, 3.12) 6 2.91[2.34, 2.72) 5 2.52[1.91, 2.34) 4 2.13[1.38, 1.91) 3 1.66[0.62, 1.38) 2 1.05[−0.98, 0.62) 1 0.031[−∞, −0.98) 0 −∞

Page 138: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 121

TABLE 5.2

The W(·) Function for 32 kbps Operation

|I(k)| 7 6 5 4 3 2 1 0W[I(k)] 70.13 22.19 12.38 7.00 4.00 2.56 1.13 −0.75

5.4.1.1 Quantizer Scale Factor Adaption

As stated earlier, the log scale factor, denoted by y(k), is adapted such thatit changes rapidly for difference signals with large power fluctuations butslowly for difference signals with small variations. It is formed as a linearcombination of a fast (unlocked) scale factor yu(k) and a slow (locked) scalefactor yl(k):

y(k) = al(k)yu(k − 1) + [1 − al(k)]yl(k − 1), (5.26)

where the speed control parameter a1(k) tends toward one for speech and zerofor voice band data. The fast scale factor yu(k) is updated using the robustadaption strategy:

yu(k) = (1 − 2−5)y(k) + 2−5W[I(k)], 1.06 ≤ yu(k) ≤ 10. (5.27)

The leak factor (1 − 2−5) introduces finite memory to aid encoder/decodertracking recovery following transmission errors. The function W(·) is shownin Table 5.2. Filtering yu(k) with a LPF yields the slow scale factor yl(k):

yl(k) = (1 − 2−6)yl(k − 1) + 2−6yu(k). (5.28)

5.4.1.2 Quantizer Adaption Speed Control

The speed control parameter al(k) should be such that it tends toward onefor difference signals with large power fluctuations and toward zero other-wise. As illustrated in Figure 5.7, it is estimated by comparing the short- andlong-term averages of a (nonlinear) measure of the rectified difference signal.These averages will essentially be the same for constant power signals such as

Short-term average

Long-term average

Compare

Measure of therectified

differencesignal Speed control

parameter1, for speech0, for data

FIGURE 5.7 Estimation of the speed control parameter.

Page 139: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

122 Principles of Speech Coding

voice band data, whereas they will be different for speech-like signals whosepower fluctuates considerably. Furthermore, the value of al(k) estimated atthe receiver must track its counterpart in the transmitter. The measure of therectified signal in Figure 5.7 is obtained by applying a nonlinear function F(.)to the transmitted ADPCM code, as it is the only common entity known toboth ends. This function is specified in Table 5.3.

The short-term average dms(k) and the long-term average dml(k) arecomputed using the following equations:

dms(k) = (1 − 2−5)dms(k − 1) + 2−5F[I(k)],

dml(k) = (1 − 2−7)dml(k − 1) + 2−7F[I(k)].

(5.29)

The comparison of these averages to obtain the speed control parameterinvolves many steps. First, a decision variable x(k) is computed that is forcedto one if these averages differ substantially or to zero otherwise. Additionally,x(k) is also set to one if the channel is idle, as indicated by a small scale factor,or for partial-band signals, which is signaled by the value of the predictorcoefficient a2. This unlocks the quantizer, thereby allowing rapid adaption,under such conditions. The computation of x(k) is delineated below:

x(k) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

1

⎧⎪⎨⎪⎩

|dms(k) − dml(k)| > 2−3dml(k)

y(k) < 3a2(k) < −0.71875

0 otherwise.

(5.30)

Next, the decision variable x(k) is filtered to yield a smoother estimate ap(k):

ap(k) = (1 − 2−4)ap(k − 1) + 2−3x(k). (5.31)

Since the gain of the above LPF is two, ap(k) tends toward two if x(k) has aconstant value equal to one, while it approaches zero if x(k) = 0. This is thenasymmetrically limited to get the actual speed control parameter,

al(k) ={

1 ap(k − 1) > 1,ap(k − 1) otherwise.

(5.32)

The asymmetrical limiting delays the start of a fast-to-slow transition untilthe ADPCM code I(k) remains constant for some time, resulting in betterspeech quality.

TABLE 5.3

The F(·) Function for 32 kbps Operation

|I(k)| 7 6 5 4 3 2 1 0F[I(k)] 7 3 1 1 1 0 0 0

Page 140: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 123

5.4.2 Predictor Structures and Adaption

With a standard predictor of the form P1(z) = ∑pi=1 aiz−i, the receiver transfer

function has only poles, as shown in Figure 5.8a. If there are more than twopoles, as is generally required, it is difficult to guarantee stability of the systemwhile adapting the predictor coefficients. On the other hand, it is possible tohave an all-zero receiver transfer function by driving the predictor directlywith the quantized difference signal, as illustrated in Figure 5.8b. Althoughstability is guaranteed, this arrangement does not yield good performancefor low-frequency signals.

dq(k)

dq(k)

dq(k)

sr(k)

sr(k)

sr(k)

se(k)

se(k)

se(k)

se(k)

se(k)

se(k)

sr(k)

sr(k)

dq(k)

dq(k)

d(k)

d(k)

sr(k)dq(k)d(k)

+ sl(k)

sl(k)

sl(k)

(a)

(b)

(c)

Coder

Coder

Coder

Decoder

Decoder

Decoder

Decoder

Decoder

Decoder

P1

P1

P2

+

P2

+

P2

P2

P1

P1

FIGURE 5.8 Predictor structures: (a) receiver transfer function has only poles; (b) receivertransfer function has only zeros; and (c) receiver transfer function has both poles and zeros.

Page 141: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

124 Principles of Speech Coding

In the ITU algorithm, a combination of the two predictors P1 and P2 is used,as shown in Figure 5.8c. P1 is restricted to second order so that stability can beeasily guaranteed, and six zeros are employed in P2 for achieving the desiredperformance. The overall receiver transfer function of the pole-zero predictorshown in Figure 5.8c is

Sr(z)Dq(z)

= 1 + P2(z)1 − P1(z)

, (5.33)

where P1(z) = ∑2i=1 aiz−i and P2(z) = ∑6

i=1 biz−i.The predictor coefficients are updated using a simplified gradient algo-

rithm. From Figure 5.8c, the signal estimate is given by

se(k) =2∑

i=1

ai(k − 1)sr(k − i) +6∑

i=1

bi(k − 1)dq(k − i), (5.34)

and the reconstructed signal is sr(k) = se(k) + dq(k). Note that the quantizeddifference signal is given by

dq(k) = sr(k) − se(k). (5.35)

The gradients of the square of the instantaneous quantized difference signalare

∂ai{d2

q(k)} = 2dq(k)∂

∂ai{dq(k)} = −2dq(k)sr(k − i), i = 1, 2,

∂bi{d2

q(k)} = 2dq(k)∂

∂bi{dq(k)} = −2dq(k)dq(k − i), i = 1, 2, . . . , 6.

(5.36)

A simplified polarity cross-correlation scheme is used for updating the bi:

bi(k) = (1 − 2−8)bi(k − 1) + 2−7 sgn[dq(k)]sgn[dq(k − i)], i = 1, 2, . . . , 6.

(5.37)

This updating restricts all the bi to the range ±2.The coefficients a1 and a2 corresponding to the poles are updated slightly

differently. It is based on the zero-based reconstruction signal

p(k) = dq(k) + sez(k), where sez(k) =6∑

i=1

bi(k − 1)dq(k − i) (5.38)

and the function

f (a) ={

4a, |a| ≤ 12 ,

2sgn(a), |a| > 12 .

(5.39)

Page 142: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 125

The update algorithm is

a1(k) = (1 − 2−8)a1(k − 1) + 3 × 2−8sgn[p(k)]sgn[p(k − 1)],a2(k) = (1 − 2−7)a2(k − 1) + 2−7

× {sgn[p(k)]sgn[p(k − 2)] − f [a1(k − 1)]sgn[p(k)]sgn[p(k − 1)]},(5.40)

with the stability constraints:

|a2(k)| ≤ 0.75, |a1(k)| ≤ 1 − 2−4 − a2(k). (5.41)

This modification yields improved encoder/decoder tracking performancefor dual tone multifrequency (DTMF) signals in the presence of transmissionerrors and also improves the narrowband SNR.

5.5 Linear Deltamodulation

Deltamodulation is the simplest form of differential coding in which thequantizer is restricted to just two levels. It is a 1-bit coding system, andtherefore the transmission rate is the same as the sampling rate. No cur-rently standardized speech coding algorithms are based on deltamodulation,however, as multibit differential coding generally yields better performance.Nevertheless, it is worthwhile discussing this scheme because of its uttersimplicity.

Figure 5.9 shows a block diagram of a linear deltamodulation (LDM) sys-tem. Here, the encoder generates a 1-bit code I(n), where the two possiblecodes, zero and one, are interpreted as the values +1 and −1, respectively.This is multiplied by the step size Δ to yield the quantized difference signal

x(n) d(n)

a1z–1

a1z – 1

I(n)+

a1xr(n – 1)

a1xr(n – 1)

dq(n)

xr(n)dq(n)

Encoder Decoder

xr(n)

X

X

Coder

LPF Output

∑ ∑

Δ Δ

1

–1

FIGURE 5.9 Linear deltamodulation.

Page 143: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

126 Principles of Speech Coding

dq(n). The combination of these two operations is equivalent to a two-levelquantizer with output levels ±Δ. A first-order predictor, with predictorcoefficient a1, yields the signal estimate.

Because coarse 1-bit coding is used in deltamodulation, the sampling fre-quency of the input signal has to be many times the Nyquist rate to provideadequate performance. Therefore, a decimating LPF is employed at the out-put of the decoder to attenuate the out-of-band quantization noise, therebyimproving the SNR.

5.5.1 Optimum 1-Bit Quantizer

Assume that the quantizer input d(n) has a zero-mean symmetric den-sity function p(d) and that the output dq(n) is restricted to the two levels±Δ. The variance σ2

q of the quantization noise q(n) = d(n) − dq(n) is thengiven by

σ2q = 2

∞∫

0

(d − Δ)2p(d)dd = σ2d − 2ΔE(|d|) + Δ2. (5.42)

The optimum step size that minimizes this function is

Δopt = E(|d|) = σd

F. (5.43)

Here, F = σd/E(|d|), the ratio of the rms to the rectified average of the signal,is known as the form factor of the density function. Its value is listed in Table 5.4for typical distributions. The corresponding quantization noise power fromEquation 5.42 is

σ2q = σ2

d − [E(|d|)]2 = σ2d(F2 − 1)

F2 = ρσ2d, (5.44)

where

ρ = σ2q

σ2d

= F2 − 1F2 (5.45)

denotes the noise-to-signal ratio of the optimum 1-bit quantizer. Its reciprocal,the optimum SNR (in dB), is listed in Table 5.4.

5.5.2 Optimum Step Size and SNR

Since the statistics of the difference signal d(n) are not known, we will expressit in terms of the statistics of the input signal x(n) and the characteristics of

Page 144: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 127

TABLE 5.4

Form Factor of Typical Distributions and the SNR of the Optimum 1-bit Quantizer

Optimum SNR (dB) of the 1-Bit

Density Function p(d) Form Factor F = σd/E(|d|) Quantizer 10 log10(1/ρ)

Uniform 2/√

3 6.02Gaussian

√π/2 4.4

Laplacian√

2 3.01Arcsine π/2

√2 7.23

the optimal 1-bit quantizer. The signal d(n) can be written as

d(n) = x(n) − a1xr(n − 1)

= x(n) − a1x(n − 1) + a1{x(n − 1) − xr(n − 1)}= x(n) − a1x(n − 1) + a1q(n − 1), (5.46)

where the last step follows from the fact that the reconstruction error in closed-loop differential coding is just the quantization error. Assuming that this erroris independent of the input signal, we can compute the variance of d(n) bysquaring both sides of the above equation and taking expectations. This yields

σ2d = (1 + a2

1 − 2a1r1)σ2x + a2

1σ2q, (5.47)

where r1 = E{x(n)x(n − 1)}/E{x2(n)}. Substituting σ2q = ρσ2

d, we can solvefor σ2

d:

σ2d = 1 + a2

1 − 2a1r1

1 − ρa21

σ2x. (5.48)

The optimum prediction gain of the LDM therefore is

G = σ2x

σ2d

= 1 − ρa21

1 + a21 − 2a1r1

. (5.49)

The corresponding SNR is

SNRDM = σ2x

σ2q

= σ2x

σ2d

σ2d

σ2q

= Gρ

= (1/ρ) − a21

1 + a21 − 2a1r1

. (5.50)

Using Equations 5.43 and 5.48, we can compute the optimum step size as

Δopt = σd

F= σx

F

√√√√1 + a21 − 2a1r1

1 − ρa21

. (5.51)

Page 145: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

128 Principles of Speech Coding

Thus, although the statistics of the difference signal d(n) are not known,we can still estimate the SNR of the LDM system provided that the shape ofp(d) is known. But this is generally not the case even when the input densityfunction p(x) is given. However, in the special case where the input signaland the quantization noise are both assumed to be Gaussian, the differencesignal will also be Gaussian.

5.5.2.1 Special Cases

Consider the case of perfect integration that implies a1 = 1. Here, the SNRand the optimum step-size relations (Equations 5.50 and 5.51) simplify to

SNRDM = 12(1 − r1)(F2 − 1)

, Δopt = √2(1 − r1)σx, (5.52)

where we have substituted for ρ from Equation 5.45.Next, we consider the case of the optimal open-loop predictor a1 = r1. Here,

the corresponding relations are

SNRDM = 1F2 − 1

(F2 + r2

1

1 − r21

), Δopt = σx√

F2 + [r21/(1 − r2

1)]. (5.53)

As mentioned earlier, the SNR at the output of the filter, denoted bySNRFDM, will be better as the LPF rejects the out-of-band noise. Assuming anideal LPF with a cutoff frequency fc, an input sampling frequency fs, and thatthe spectrum of the quantization noise is uniformly distributed in the range0 − fs/2, we can write

SNRFDM = SNRDM(fs/2)

fc. (5.54)

5.5.2.2 SNR for Sinusoidal Inputs with Perfect Integration

Let the input signal be x(t) = A sin(2πft + φ), which is characterized bythe arcsine density function. Its autocorrelation function can be shown tobe R(τ) = (A2/2) cos 2πf τ. If the input sampling frequency is fs, the firstnormalized autocorrelation coefficient is given by

r1 = R(1/fs)A2/2

= cos 2πffs

≈ 1 − 2π2f 2

f 2s

, (5.55)

where the indicated approximation is valid if fs � f . Since σx = A/√

2, fromEquation 5.52, the optimum step size is

Δopt = 2πffs

A√2

. (5.56)

Page 146: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 129

If we assume that the density function of the difference signal is arcsine(which is not quite accurate, as the sum of arcsine and Gaussian densityfunctions does not yield an arcsine density function), the filtered SNR, usingEquations 5.52 and 5.54, is given by

SNRFDM = 0.0542f 3s

f 2fc. (5.57)

Note that, for a given input frequency, SNRFDM varies as the cube of fs.This implies that the filtered SNR, for deltamodulation, increases by 9 dB ifthe sampling frequency is doubled, whereas it increases by only 6 dB for PCM.

5.6 Adaptive Deltamodulation

Since the step size Δ in an LDM system is fixed, it has to be chosen as a com-promise between limiting slope overload distortion and excessive granularnoise. If it is too small, the SNR is primarily determined by slope overload,whereas granular noise dominates if it is too large. Thus, it would be desir-able to vary the step size so that the combined distortion due to these twofactors is minimized. Specifically, Δ should be increased when a steep slopein the input signal is encountered, and decreased if the amplitude variationsare small. Since the receiver must also be aware of the manner in which thestep size is altered as a function of time, the adaption algorithm is normallybased on just the transmitted code sequence. Figure 5.10 shows the encoderhalf of such an adaptive deltamodulation (ADM) system.

A simple adaption algorithm, due to Jayant [1], sets the next step size to amultiple of the current value based on the current and previous code words

X Adaptionalgorithm

a1xr(n–1)

x(n) +

d(n) 1

–1

I(n)

Coder

a1z–1

dq(n)

xr(n)

Δ(n)

Encoder

FIGURE 5.10 Adaptive deltamodulation.

Page 147: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

130 Principles of Speech Coding

as follows:

Δ(n + 1) ={

max[M1Δ(n), Δmax] if I(n) = I(n − 1),min[M2Δ(n), Δmin] if I(n) �= I(n − 1),

(5.58)

where M1 > 1 and M2 < 1. The generation of identical consecutive codewords signals a positive or negative slope overload condition. In such situa-tions, the step size should be increased to follow the rapidly changing input,whereas it should be decreased otherwise. Typical values for the multiplica-tion factors are M1 = 1.2 and M2 = 0.7. As delineated in the above equation,the step size should be bounded by the limiting values Δmax and Δmin. A20 dB span is generally allowed on the step size variations, which impliesΔmax/Δmin = 10.

The continuously variable slope deltamodulation (CVSD) is an improvedversion of the simple adaption scheme described above. In this case, the stepsize is adapted as follows:

Δ(n + 1) ={

βΔ(n) + D1 if I(n) = I(n − 1) = I(n − 2),βΔ(n) + D2 otherwise,

(5.59)

where D1 � D2 > 0 and β < 1. Typical values for these parameters are β =0.95 and D1/D2 = 10. Note that the generation of three (or more) identical codewords is used to indicate slope overload here. This indication in conjunctionwith the current step size determines the next value Δ(n + 1). Explicit limitingof the step size is not needed, as they are automatically bounded by the valuesΔmin = D2/(1 − β) and Δmax = D1/(1 − β).

5.7 Summary

The central idea behind differential coding is to quantize a difference signalobtained by subtracting the speech signal from its predicted value. We showedthat, for correlated inputs, the variance of this difference signal is smaller.Hence it is advantageous to code and transmit it instead of the original speech.Such a scheme improves the overall end-to-end SNR by the variance reductionfactor, which is known as the prediction gain. We also demonstrated that it isimportant to adhere to the closed-loop paradigm in differential coding sys-tems; otherwise, the quantization error accumulates and seriously degradesthe system performance.

Although fixed prediction yields modest performance improvements, itis essential to adapt the predictor coefficients to obtain better prediction

Page 148: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 131

gains. In order to avoid transmitting side information, the adaption algo-rithm should be based only on the quantizer output code so that the receivercan track the predictor states at the transmitter. The ITU G.726 ADPCM stan-dard is a differential coding system based on adaptive prediction. In addition,it also employs an adaptive optimal quantizer to improve the overall SNRperformance for both speech and voice-band data signals.

Deltamodulation is a particular differential coding system that uses just a1-bit quantizer. While it is not specified in any of the speech coding standards,it is informative to study this simple scheme as it is easy to implement andanalyze. Besides speech, it has possible applications in digitizing other typesof signals.

EXERCISE PROBLEMS

5.1. Consider a DPCM system for digitizing a first-order Markov source in whichthe normalized autocorrelation values are related as r2 = r2

1. Suppose you havea choice between two predictors:

x(n) = x(n − 1) and x(n) = 2x(n − 1) − x(n − 2).

Explain which one you would choose and the reason for your choice. (Hint:Calculate the variance of the prediction error for both the cases and compare.)

5.2. This problem is about computing the SNR advantage (prediction gain) of aDPCM system for quantizing sine wave signals. Consider a sinusoidal inputsignal x(t) = sin(2πft), where f = 1000 Hz. It can be shown that the normalizedautocorrelation function of the sine wave at lag τ is given by

r(τ) = E[x(t)x(t + τ)]E[x2(t)] = cos(2πf τ).

If this sine wave is sampled at 8000 Hz and differentially quantized, find theprediction gain (i.e., the ratio of the variance of the original signal to that of thedifference signal) for the following two cases of the difference signals: (a) d(n) =x(n) − x(n − 1) and (b) d(n) = x(n) − r1x(n − 1). Note that at the sampling rateof 8000 Hz, r1 = r(τ = 1/8000).

5.3. Consider the estimator

x(n) = a1x(n − 1) + a2x(n + 1)

for estimating the value of x(n) from x(n − 1) and x(n + 1). Assuming that thesignal x(n) is stationary with zero mean, find a1 and a2 in terms of the normalizedautocorrelation values of x(n) to minimize the mean square estimation error:

σ2d = E[{x(n) − x(n)}2].

5.4. A discrete time stationary zero-mean process has normalized autocorrelationvalues r1 = 0.75 and r2 = 0.5. An optimal second-order predictor is used in a

Page 149: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

132 Principles of Speech Coding

DPCM coder for the input process. The difference signal is quantized by a 6-bit uniform quantizer with a loading factor of 4. (The loading factor V/σ of aquantizer is the ratio of the overload level to the rms value of the input to thequantizer.) Find the optimal predictor coefficients, the prediction gain, and theapproximate overall SNR of the DPCM system assuming negligible overloading.

5.5. Derive the matrix Equation 5.11 for the optimum predictor of order p and thecorresponding predictor gain expression (Equation 5.13).

5.6. In this problem you are to find the prediction gain of a DPCM system thatemploys a first-order predictor, taking into account the quantization noisethat is fed back. Consider the DPCM coder shown in Figure 5.4. The inputx(n) has zero mean and variance σ2

x and first normalized autocorrelationcoefficient r1. Assume that the reconstructed signal xr(n) can be modeledas x(n) contaminated by zero-mean uncorrelated additive noise of varianceσ2

q. Also assume that the quantizer can be characterized by the noise-to-signal ratio ρ (that depends on the input PDF and the type of the quantizer,and is normally much less than 1), which is defined by ρ = σ2

q/σ2d, where

q(n) = d(n) − dq(n).The prediction gain of a open-loop differential coder with the optimum first-

order predictor is G = 1/(1 − r21). However, the closed-loop gain will be lower

due to quantization noise feedback. If a1 = r1, show that the actual predictiongain is given by

G = σ2x

σ2d

= 1 − ρr21

1 − r21

.

5.7. This problem is concerned with the optimum pth-order predictor in a DPCMsystem, taking into account the quantization noise that is fed back. Consider thegeneralized predictor system shown in Figure 5.3. The input x(n) has zero mean,variance σ2

x, and normalized autocorrelation coefficients r1, r2, . . . , rp. Assumethat the quantization error q(n) is a zero-mean stationary white noise sequenceof variance σ2

q, and it is uncorrelated with x(n); that is, E[x(n − j)q(n − k)] = 0,for all n, j, and k. If the predictor coefficients a1, a2, . . . , ap are chosen to minimizethe variance σ2

d of the difference signal d(n), show that

a. The difference signal is orthogonal to the signal estimate xe(n), that is,E[d(n)xe(n)] = 0. (Hint: First show that E[d(n)xr(n − k)] = 0, k = 1, 2, . . . , p.)

b. The optimal predictor coefficients satisfy the following equations:

⎡⎢⎢⎢⎢⎢⎣

1 + (ρ/G) r1 · · · rp−1

r1 1 + (ρ/G) · · · ......

......

rp−1 rp−2 · · · 1 + (ρ/G)

⎤⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎣

a1

a2...

ap

⎤⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎣

r1

r2...

rp

⎤⎥⎥⎥⎥⎥⎦

where the overall noise-to-signal ratio is ρ/G = σ2q/σ2

x.

5.8. Prove that the second-order (pole) predictor used in the G.726 algorithm is stableif and only if

|a2| < 1 and |a1| < 1 − a2.

Page 150: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Differential Coding 133

(Note that you have to prove that the roots of the polynomial z2 − a1z − a2will be within the unit circle under the stated conditions.)

5.9. Derive the expressions (Equations 5.52 and 5.53) for the optimum SNR and thecorresponding step size of the linear deltamodulator.

5.10. Consider the optimal open-loop predictor a1 = r1 in the linear deltamodulator.Assuming that the input signal has a Gaussian density function with zero mean,variance σ2

x, and first normalized autocorrelation coefficient r1 = 0.5, show thatthe optimum step size is Δopt ≈ 0.725σx. What is the corresponding SNRDM?

Reference

1. Jayant, N.S. and P. Noll, Digital Coding of Waveforms, Prentice Hall, EnglewoodCliffs, NJ, 1984.

Bibliography

1. Gersho, A. and R.M. Gray, Vector Quantization and Signal Compression, KluwerAcademic Publishers, Boston, 1991.

2. ITU-T Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies,International Telecommunication Union, Geneva, 1993.

3. ITU-T Recommendation G.726, Adaptive Differential Pulse Code Modula-tion (ADPCM) of Voice Frequencies, International Telecommunication Union,Geneva, 1990.

Page 151: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

6Linear Prediction

6.1 Introduction

Linear prediction is a very powerful and widely used technique in the fieldof signal processing. It is the basis of many of the speech coding algorithmsthat we will be presenting. Linear prediction assumes that we can model theprediction of a signal as a linear combination of past samples of the signal.The error in the prediction is known as a prediction error, residual, or inno-vation. It signifies the new information contained in the signal. In that sense,linear prediction can be seen as a method that removes the old and redundantinformation from a signal.

There are two types of linear prediction: forward linear prediction (FLP) andbackward linear prediction (BLP) [1,2]. There are relationships between FLPand BLP especially for a stationary input signal. The solution of the problemof finding the filter weights to minimize the forward or backward prediction-error powers (PEPs) leads to the Levinson–Durbin (L-D) algorithm. It alsoforms the basis of the lattice filter structure. The coefficients of the latticefilter structure are known as the partial correlation (PARCOR) (or reflection)coefficients which are used extensively in linear predictive analysis of speech.The reflection coefficients are transformed to other coefficients such as linespectrum frequency (LSF) pairs for use in speech coding algorithms.

6.1.1 Linear Prediction Theory and Wiener Filters

Linear prediction is a very important concept used in many different areas ofsignal processing. Linear estimation is related to linear prediction. First wepresent the classical Wiener filter, which is used for optimum linear estimation(or prediction) of a desired signal sequence d(n) from a related input signalsequence x(n) as shown in Figure 6.1.

Assuming a linear transversal (FIR) filter for the Wiener filter, the estimatey(n) can be written as y(n) = x(n | Xn−1) for FLP and y(n) = x(n − M | Xn)

for BLP.We also assume that we know the statistical parameters of the input signal

and the desired response. The Wiener filter solves the problem of designing

135

Page 152: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

136 Principles of Speech Coding

Estimationerror e(n)

Desiredresponse

d(n)Estimate

y(n) Inputx(n) Wiener

filter +

FIGURE 6.1 The Wiener filter for the estimation of the desired response d(n).

a linear filter with the noisy data as input and the requirement of minimizingthe effect of the noise at the filter output according to some statistical criterion.An error signal is generated by subtracting the desired signal from the outputsignal of the linear filter.

e(n) = d(n) − y(n). (6.1)

The statistical criterion often used is the MSE

ξ = E[|e(n)|2]. (6.2)

A desirable choice of this performance criterion (or cost function) must lead totractable mathematics and lead to a single optimum solution. For the choicesof the MSE criterion and FIR filter, we obtain a performance function that isquadratic, which means the optimum point is single.

Let W = [w0, w1, w2, . . . , wN−2, wN−1]T be the coefficients (or weights) of theweight vector of the Wiener filter. For a linear transversal (FIR) filter as theWiener filter

W(n) = [w0(n), w1(n), w2(n), . . . , wN−2(n), wN−1(n)]T.

Let X(n) = [x0(n), x1(n), x2(n), . . . , xN−2(n), xN−1(n)]T, which is a set of Ninputs. Alternatively for an FIR filter, we can define the input vector as aset of N delayed inputs.

X(n) = [x(n), x(n − 1), x(n − 2), x(n − 3), . . . , x(n − N + 2), x(n − N + 1)]T.

The output (or estimate) is

y(n) = WT(n)X(n). (6.3)

Therefore, the performance function is

ξ = E[|e(n)|2] = E[(d(n) − WHX(n))(d(n) − XH(n)W)]= E[(d2(n)] − WHE[X(n)d(n)] − E[XH(n)d(n)]W + WHE[X(n)XH(n)]W)].

(6.4)

Page 153: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 137

We seek an optimum weight vector Wopt that minimizes the performancefunction ξ.

Define the N×1 cross-correlation vector

P = E[X(n)d(n)]= [p0, p1, p2, . . . , pN−2, pN−1]T.

When d(n) = x(n), then

P = E[X(n)x(n)]= [r0,0, r1,0, r2,0, . . . , rN−2,0, rN−1,0]T. (6.5)

The N × N autocorrelation matrix,

R = E[X(n)XH(n)]

=

⎡⎢⎢⎢⎣

r00 r01 · · · r0,N−1r10 r11 · · · r1,N−1...

.... . .

...rN−1,0 rN−1,1 · · · rN−1,N−1

⎤⎥⎥⎥⎦ , (6.6)

where ri,j is the autocorrelation function given by ri,j = E[x(n − i)(n − j)].For a stationary discrete-time input signal and a lag of k, r(k) = E[x(n)x∗(n −

k)] and r∗(k) = E[x∗(n)x(n − k)] are the autocorrelation functions. Therefore, if

X(n) = [x(n), x(n − 1), x(n − 2), x(n − 3), . . . , x(n − N + 2), x(n − N + 1)]T,

then

R = E[X(n)XH(n)]

=

⎡⎢⎢⎢⎣

r(0) r(1) · · · r(N − 1)

r∗(1) r(0) · · · r(N − 2)...

.... . .

...r∗(N − 1) r∗(N − 2) · · · r(0)

⎤⎥⎥⎥⎦ , (6.7)

and if d(n) = x(n), then

P = E[X(n)x(n)]= [r(0), r(1), . . . , r(N − 3), r(N − 2), r(N − 1)]T. (6.8)

The matrix R has some very interesting properties.

Page 154: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

138 Principles of Speech Coding

6.2 Properties of the Autocorrelation Matrix, R

1. If the input x(n) is a stationary, discrete-time stochastic process, thenR is Hermitian. This means that RH = R. For such a stationary input,this means that the autocorrelation function r(k) = r∗(−k), wherer(k) = E[x(n)x∗(n − k)] and r(−k) = E[x(n)x∗(n + k)].

2. If the input x(n) is a stationary, discrete-time stochastic process, thenR is Toeplitz. This means that the diagonal elements of R are the same.

3. If the input x(n) is a stationary, discrete-time stochastic process, thenR is nonnegative definite and, most likely, will be positive definite.This means that all the eigenvalues of the R matrix are greater than orequal to zero.

4. If the input x(n) is a wide-sense stationary, discrete-time stochasticprocess, then R is nonsingular. This also implies that the inverse of Rexists.

5. If the input x(n) is a stationary, discrete-time stochastic process,then E[XR(n)XRH(n)] = RT, where X(n) = [x(n), x(n − 1), . . . , x(n −N + 2), x(n − N + 1)]T is the input data vector and XR(n) = [x(n −N + 1), x(n − N + 2), . . . , x(n − 1), x(n)]T is the reversed, transposedinput data vector.

RT = E[XR(n)XRH(n)]

=

⎡⎢⎢⎢⎣

r(0) r∗(1) · · · r∗(N − 1)

r(1) r(0) · · · r∗(N − 2)...

.... . .

...r(N − 1) r(N − 2) · · · r(0)

⎤⎥⎥⎥⎦ .

This is a result of the stationary property of the input. Note also thatbecause the R matrix is Hermitian, the R matrix of the reversed inputvector, XR(n), is the same as that of the original input vector, X(n).

6. If the input x(n) is a stationary, discrete-time stochastic process, thenR can be partitioned as follows:

RN+1 =[

r(0) rH

r RN

]

or as

RN+1 =[

RN rR∗

rRT r(0)

].

Page 155: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 139

Here, r(0) is the autocorrelation function of the input x(n) for zero lag:

rH = E[x(n)XH(n)

]= [r(0), r(1), . . . , r(N − 2), r(N − 1)]

and

rRT = E[x(n)XRH(n)]= [r(−N), r(−N + 1), . . . , r(−2), r(−1)].

There are other properties but these are the ones that are importantfor our purposes here.

6.3 Forward Linear Prediction

Forward linear prediction (FLP) uses the previous M samples x(n − 1), x(n −2), . . . , x(n − M + 1), x(n − M) to make a prediction of the current sample x(n).This is a one-step FLP, measured with respect to time n−1.

Aforward predictor consists of a linear transversal filter with M tap weightswf,1, wf,2, . . . , wf,M and tap inputs x(n − 1), x(n − 2), . . . , x(n − M + 1), x(n −M), respectively (Figure 6.2). The predicted value is x(n | Xn−1), where Xn−1 isthe vector-space spanned by the M-dimensional input vector x(n − 1), x(n −2), . . . , x(n − M + 1), x(n − M). We assume that these tap inputs are from awide-sense stationary stochastic process of zero mean.

The combination of the linear predictor and the error generation is known asthe prediction-error filter. See Figures 6.3 and 6.4 for the forward prediction-error filter (FPEF).

Recall that an autoregressive (AR) process can be generated by passingwhite noise through an all-pole filter as mentioned in Chapter 2. An AR pro-cess can be analyzed by passing the AR signal through a linear predictor

x(n) x(n – 1)

w *f, 1 w *f, 2 w *f, M–1 w *f, M

x(n – 2) x(n – M + 1) x(n – M)

x(n | Xn–1)

z–1 z–1 z–1

∑ ∑ ∑

FIGURE 6.2 The FIR-based forward linear predictor.

Page 156: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

140 Principles of Speech Coding

f M(n)

+ –

x(n)

z–1 Predictor

of order M

x(n – 1)x(n |Xn – 1)

d(n)

Σ

FIGURE 6.3 The predictor-error filter formed from a forward linear predictor.

which is a FIR (or all-zero) filter. The output of the AR process analyzer is theprediction error and this signal will be like white noise.

From Figure 6.2, the predicted value is

x(n | Xn−1) =M∑

k=1

w∗f,kx(n − k) (6.9)

and the desired response is given by

d(n) = x(n). (6.10)

The forward prediction error for an Mth order filter, fM(n), is defined as

fM(n) = x(n) − x(n | Xn−1). (6.11)

Substituting Equation 6.9 into Equation 6.11, we may express the forwardprediction error as

fM(n) = x(n) −M∑

k=1

w∗f,kx(n − k). (6.12)

x(n)

1

z–1 z–1 z–1x(n – 1) x(n – 2) x(n – M + 1) x(n – M)

–w *f, 1 –w *f, 2

f M(n)

–w *f, M–1 –w *f, M

∑ ∑∑ ∑

FIGURE 6.4 Details of the predictor-error filter formed from a forward linear predictor.

Page 157: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 141

Let aM,k , k = 0, 1, . . . , M, denote the tap weights of a new transversal filter,which are related to the tap weights of the forward predictor as follows:

aM,k ={

1−wf,k .

(6.13)

Then we may rewrite Equation 6.12 into a single summation

fM(n) =M∑

k=0

a∗M,kx(n − k). (6.14)

This input–output relation represents FPEF. This is shown in Figure 6.3.The minimum prediction error power (PEP) is

PM = r(0) − rHwf for all n. (6.15)

If the input x(n) has zero mean, then the forward prediction error fM(n) willhave zero mean and PM will also equal the variance of the forward predictionerror.

We can also write

PM = E[∣∣ fM(n)

∣∣2] for all n, (6.16)

where PM is the expectation of the square of the FLP error of order M.∗The M×1 optimum weight vector of the forward predictor is

wf = [wf,1, wf,2 . . . , wf,M

]T . (6.17)

Given the M × M correlation matrix of the input vector X(n) = [x(n − 1),x(n − 2), . . . , x(n − M + 1), x(n − M)]H, the M×1 cross-correlation vectorbetween these tap inputs and the desired response d(n) = x(n) and thevariance of x(n), we can solve the Wiener–Hopf equations for the weightvector wf .

In the case of linear forward prediction,

1. Define the M×1 tap-input vector as

X(n) = [x(n − 1), x(n − 2), . . . , x(n − M + 1), x(n − M)]H . (6.18)

∗ It is also possible to define PM = ∑Mi=1 | fi(n)|2 for all n, as the sum of the squares of all the

forward prediction errors up to order M.

Page 158: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

142 Principles of Speech Coding

Hence, the correlation matrix of the tap inputs equals

R = E[X(n)XH(n)

]

=

⎛⎜⎜⎜⎝

r(0) r(1) · · · r(M − 1)

r∗(1) r(0) · · · r(M − 2)...

.... . .

...r∗(M − 1) r∗(M − 2) · · · r(0)

⎞⎟⎟⎟⎠ . (6.19)

2. The cross-correlation vector between the input vector X(n) = [x(n −1), x(n − 2), . . . , x(n − M + 1), x(n − M)]H and the desired responsed(n) = x(n) is

P = E[X(n)x∗(n)

]

=

⎛⎜⎜⎜⎝

r∗(1)

r∗(2)...

r∗(M)

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝

r(−1)

r(−2)...

r(−M)

⎞⎟⎟⎟⎠ . (6.20)

3. The variance of x(n) equals E [x(n)x∗(n)] = r(0), since x(n) has zeromean.

6.4 Relation between Linear Prediction and ARModeling

The equation that defines the forward prediction error and the differenceequation defining the AR model have the same mathematical form. Therefore,the Wiener–Hopf equations for linear prediction are similar to the Yule–Walker equations for the AR models discussed in Chapter 2. However, whenthe process is not AR, the use of a predictor provides an approximation to theprocess.

6.5 Augmented Wiener–Hopf Equations for ForwardPrediction

We can combine both Wiener–Hopf equations and the equation of the forwardPEP PM into a single matrix relation(

r(0) rH

r R

)(1

−wf

)=

(PM0

), (6.21)

Page 159: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 143

where 0 is the M×1 null vector and R is the M × M input correlation matrix.This is the augmented Wiener–Hopf equations of a FPEF of order M. The(M + 1) × 1 coefficient vector equals the FPEF vector,

aM =(

1−wf

). (6.22)

We may also express this matrix relation as the following system of (M + 1)

simultaneous equations:

M∑i=0

aM,lr(l − i) ={

PM, i = 0,0, i = 1, 2, . . . , M.

(6.23)

6.6 Backward Linear Prediction

Backward linear prediction (BLP) uses the current sample and some lessprevious samples x(n), x(n − 1), . . . , x(n − M + 2), x(n − M + 1) to make aprediction of the most previous sample x(n − M). This is a one-step BLP,measured with respect to time n − M + 1.

A backward predictor consists of a linear transversal filter with M tapweights wb,1, wb,2, . . . , wb,M and tap inputs x(n), x(n − 1), . . . , x(n − M + 2),x(n − M + 1), respectively (Figures 6.5 and 6.6). The predicted value isx(n | Xn), where Xn is the vector-space spanned by the M-dimensionalinput vector x(n), x(n − 1), . . . , x(n − M + 2), x(n − M + 1). We assume thatthese input samples are from a wide-sense stationary stochastic process ofzero mean.

The combination of the linear predictor and the error generation is knownas the prediction-error filter. See Figures 6.6 through 6.8 for the backwardprediction-error filter (BPEF).

From Figure 6.6, the predicted value is

x(n − M | Xn) =M−1∑k=0

w∗b,kx(n − k) (6.24)

x(n) bM(n)Predictor

of order M

x(n – M + 1) x(n – M)

z–1 ∑

x(n – M|Xn)

– +

FIGURE 6.5 The predictor-error filter formed from a backward linear predictor.

Page 160: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

144 Principles of Speech Coding

x(n) x(n – 1)

w*b,1 w*b,2 w*b,M

x(n – M)x(n – M + 1)

x(n – M|Xn)

z–1

∑ ∑

z–1

FIGURE 6.6 The FIR-based backward linear predictor.

and the desired response is given by

d(n) = x(n − M). (6.25)

The backward prediction error for an Mth order filter, bM(n) is defined as

bM(n) = x(n − M) − x(n − M | Xn), (6.26)

where Xn is the M-dimensional space spanned by x(n), x(n − 1), . . . , x(n −M + 2), x(n − M + 1), the input samples used in making the backwardprediction.

x(n)

bM(n)

x(n – 1)

–w*b, 1 –w*b, 2 –w*b, M

x(n – M)x(n – M + 1)z–1

∑ ∑ ∑

z–1

1

FIGURE 6.7 Details of the predictor-error filter formed from a backward linear predictor.

x(n) z–1 z–1

x(n – 1) x(n – M + 1) x(n – M)

a*M,M a*M,M–1 a*M,1 a*M,0

bM(n)∑ ∑ ∑

FIGURE 6.8 Predictor-error filter formed from a backward linear predictor.

Page 161: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 145

The minimum PEP is

PM = r(0) − rHwb for all n. (6.27)

If the input x(n) has zero mean, then the backward prediction error bM(n)

will have zero mean and PM will also equal the variance of the backwardprediction error.

We can also write

PM = E[|bM(n)|2

]for all n, (6.28)

where PM is the expectation of the square of the BLP error of order M.∗Let wb denote the M×1 optimum tap-weight vector of the backward

predictor in Figure 5.6. In the expanded form,

wb = [wb,1, wb,2, . . . , wb,M]T. (6.29)

Given the M × M correlation matrix of the input vector X(n) =[x(n), x(n − 1), . . . , x(n − M + 2), x(n − M + 1)]H, the M×1 cross-correlationvector between these tap inputs and the desired response d(n) = x(n − M)

and the variance of x(n − M), we can solve the Wiener–Hopf equations forthe weight vector wb.

In the case of backward linear prediction,

1. Define the M×1 tap-input vector as

X(n) = [x(n), x(n − 1), . . . , x(n − M + 2), x(n − M + 1)]H. (6.30)

Hence, the correlation matrix of the tap inputs equals

R = E[X(n)XH(n)]

=

⎛⎜⎜⎜⎜⎝

r(0) r(1) · · · r(M − 1)

r∗(1) r(0) · · · r(M − 2)

......

. . ....

r∗(M − 1) r∗(M − 2) · · · r(0)

⎞⎟⎟⎟⎟⎠ . (6.31)

2. The cross-correlation vector between the input vector X(n) = [x(n),x(n − 1), . . . , x(n − M + 2), x(n − M + 1)]H and the desired response

∗ It is also possible to define PM = ∑Mi=1 |bi(n)|2 for all n, as the sum of the squares of all the

backward prediction errors up to order M.

Page 162: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

146 Principles of Speech Coding

d(n) = x(n − M) is

P = E[X(n)x∗(n − M)]

=

⎛⎜⎜⎜⎝

r∗(M)

r∗(M − 1)...

r∗(1)

⎞⎟⎟⎟⎠ =

⎛⎜⎜⎜⎝

r(−M)

r(−M + 1)...

r(−1)

⎞⎟⎟⎟⎠ . (6.32)

3. The variance of x(n − M) equals E[x(n − M)x∗(n − M)] = r(0), sincex(n − M) has zero mean.

6.7 Backward Prediction-Error Filter

The input sample vector to the BPEF is

X(n) = [x(n), x(n − 1), . . . , x(n − M + 2), x(n − M + 1)]H.

Define the backward prediction error bM(n) as

bM(n) = x(n − M) −M∑

k=1

w∗bkx(n − k + 1). (6.33)

Then we can write the weight vector of the BPEF in terms of the correspond-ing backward predictor as follows:

aM,k ={

−wb,k+1, k = 0, 1, . . . , M − 1,1, k = M.

(6.34)

Therefore,

bM(n) =M∑

k=0

aM,kx(n − k). (6.35)

The two forms of a prediction-error filter for stationary inputs are uniquelyrelated to each other. In particular, by reversing the input sequence andcomplex-conjugating a forward predictor-error filter, we get the correspond-ing BPEF. Note that in both filters the respective input vectors have the samevalues.

The relationship between the forward predictor coefficients and the back-ward predictor coefficients is

w∗b,M−k+1 = wf,k , k = 1, 2, . . . , M (6.36)

Page 163: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 147

or, equivalently,

wb,k = w∗f,M−k+1, k = 1, 2, . . . , M. (6.37)

Therefore the weight vector of the BPEF is

aM,k = a∗M,M−k , k = 0, 1, . . . , M. (6.38)

Then we can write the equivalent form,

bM(n) =M∑

k=0

aM,M−kx(n − k). (6.39)

6.8 Augmented Wiener–Hopf Equations for BackwardPrediction

Similar to the forward prediction case, the augmented Wiener–Hopf equationfor the BLP is (

R rB∗

rBT r(0)

)(−wb1

)=

(0

PM

), (6.40)

where 0 is the M × 1 null vector. The M × M matrix R is the correlation matrixof the M×1 tap-input vector x(n). This is the augmented Wiener–Hopf equa-tion of a BPEF of order M. The (M + 1) × 1 coefficient vector equals the FPEFvector,

aM =(−wb

1

).

We may also express the matrix relation of Equation 6.40 as a system of(M + 1) simultaneous equations:

M∑i=0

a∗M,M−lr(l − i) =

{0, i = 0, . . . , M − 1,PM, i = M.

(6.41)

6.9 Relation between Backward and Forward Predictors

Assume that we have a wide-sense stationary input signal. The Wiener–Hopfequation for BLP can be written as follows by arranging the elements back-ward, and then complex-conjugating. RT is the transpose of the correlation

Page 164: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

148 Principles of Speech Coding

matrix R and wBb is the backward rearrangement of the tap-weight vector wb.

Therefore,

RTwBb = r∗ (6.42)

Take complex-conjugates of both sides to obtain

RHwB∗b = r∗.

Since the correlation matrix R is Hermitian (i.e., RH = R), we may reformulatethe Wiener–Hopf equations for backward prediction as

RwB∗b = r. (6.43)

Then we can state the relationship between the tap-weight vectors of a back-ward predictor and the corresponding forward predictor as

wB∗b = wf . (6.44)

This equation means that we may convert a backward predictor into a forwardpredictor by reversing the order of the tap weights and taking the complexconjugates of each of them.

We note that

rBTb wb = rTwB

b.

Therefore, we can write

PM = r(0) − rTwBb. (6.45)

Taking the complex conjugate on both sides, we obtain

PM = r(0) − rHwB∗b . (6.46)

Therefore, comparing this result with that of FLP, we find that the backwardPEP has exactly the same value as the forward PEP for a wide-sense stationaryprocess input.

Example 6.1

For the case of a prediction-error filter of order M = 1, Equation 6.21 yields a pairof simultaneous equations described by

(r(0) r(1)

r∗(1) r(0)

)(a1,0a1,1

)=

(P10

).

Page 165: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 149

Solving for a1,0 and a1,1, we obtain

a1,0 = P1Δr

r(0)

and

a1,1 = − P1Δr

r∗(1),

where

Δr =∣∣∣∣∣(

r(0) r(1)

r∗(1) r(0)

)∣∣∣∣∣= r2(0) − |r(1)|2

is the determinant of the correlation matrix. But a1,0 equals 1. Therefore,

P1 = Δr

r(0)

and

a1,1 = − r∗(1)

r(0).

Consider next the case of a prediction-error filter of order M = 2. Equation 6.21yields a system of three simultaneous equations

⎛⎜⎝

r(0) r(1) r(2)

r∗(1) r(0) r(1)

r∗(2) r∗(1) r(0)

⎞⎟⎠

⎛⎜⎝

a2,0

a2,1

a2,2

⎞⎟⎠ =

⎛⎜⎝

P2

0

0

⎞⎟⎠ .

Solving for a2,0, a2,1, and a2,2, we obtain

a2,0 = P2Δr

[r2(0) − |r(1)|2

],

a2,1 = − P2Δr

[r∗(1)r(0) − r(1)r∗(2)

],

a2,2 = P2Δr

[(r∗(1))2 − r(0)r∗(2)

],

which is the determinant of the correlation matrix. The coefficient a2,0 equals 1;accordingly, we may express the PEP as

P2 = Δr

r2(0) − |r(1)|2

Page 166: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

150 Principles of Speech Coding

and the prediction-error filter coefficients as

a2,1 = r∗(1)r(0) − r(1)r∗(2)

r2(0) − |r(1)|2 ,

a2,2 = (r∗(1))2 − r(0)r∗(2)

r2(0) − |r(1)|2 .

6.10 Levinson–Durbin Recursion

The solution to the Normal equations (or the Wiener–Hopf equations) involvesinversion of the autocorrelation matrix. The Normal equations can be solvedby using some computationally efficient methods. One example of this isthe Levinson–Durbin (L-D) algorithm, named so because the algorithm wasderived by Levinson (1947) and later independently by Durbin (1960).

However, the L-D algorithm avoids matrix inversion and uses the proper-ties of the matrix to invert the matrix efficiently using a few iterations. In theprocess, LPC coefficients are generated.

Consider the augmented Normal equations

⎡⎢⎢⎢⎢⎢⎣

r(0) r(1) · · · r(N)

r∗(1) r(0) · · · r(N − 1)...

.... . .

...r∗(N − 1) r∗(N − 2) · · · r(1)

r∗(N) r∗(N − 1) · · · r(0)

⎤⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎣

1a1...

aN−1aN

⎤⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

J0...0

⎤⎥⎥⎥⎦ . (6.47)

This matrix, R has some very interesting properties. It is Toeplitz. The aug-mented R matrix can be divided into a smaller matrix, two vectors and ascalar.

6.10.1 L-D Algorithm

This algorithm is a direct, recursive method for computing the prediction-error filter coefficients and the PEP by solving the augmented Normalequations.

The algorithm exploits the Toeplitz structure of the autocorrelation matrixR of the tap inputs of the filter.

The algorithm is iterative (recursive) in that it uses the solution of theaugmented Normal equation for a prediction-error filter of order (m − 1) tocompute the corresponding solution for a prediction-error filter of order (m).Usually M is the final order.

The advantage of using the L-D algorithm (sometimes just called theLevinson algorithm) for solving the Normal equation for each order m

Page 167: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 151

(compared with using other methods like the Gaussian elimination) is com-putational efficiency in the number of multiplications, divisions, additions,and memory usage.

Recall the augmented Normal equations derived for FLP and for BLP.

6.10.2 Forward Linear Prediction

aB∗m =

(0

aB∗m−1

)+ Γm

(am−1

0

).

(r(0) rH

r R

)(1

−wf,opt

)=

(PM0

).

aM =(

1−wf,opt

).

6.10.3 Backward Linear Prediction

(R rR∗

rRT r(0)

)(−wb,opt

1

)=

(0

PM

).

aM =(

−wb,opt

1

).

For stationary signals,

wR∗b,opt = wf,opt.

∴ (m + 1) × 1 Vector am = tap-weight vector of a FPEF of order m,

aR∗m = tap-weight vector of a BPEF of order m,

Superscript R ⇒ reverse or backward arrangement, and superscript ∗ ⇒complex conjugate.

The L-D recursion can be stated in any one of the following two equivalentways.

1. The tap-weight vector of a FPEF may be order updated as follows

am =(

am−1

0

)+ Γm

(0

aR∗m−1

), (6.48)

where Γm is a constant that satisfies certain conditions.

Page 168: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

152 Principles of Speech Coding

In scalar notation,

am,k = am−1,k + Γma∗m−1,m−k , k = 0, 1, 2, . . . , m,

where am−1,k is the kth tap weight of a FPEF of order (m − 1), am,k isthe kth tap weight of a FPEP of order m, and a∗

m−1,m−k is the kth tapweight of a BPEF of order (m − 1).

Note that

am−1,0 = 1,

am−1,m = 0.

equivalently.2. The tap-weight vector of a BPEF may be order updated as follows

aB∗m =

(0

aB∗m−1

)+ Γm

(am−1

0

), (6.49)

where Γm is a constant that satisfies certain conditions.In scalar notation,

a∗m,m−k = a∗

m−1,m−k + Γma∗m−1,k , k = 0, 1, 2, . . . , m,

where a∗m,m−k is the kth tap weight of a BPEF of order m.

Other quantities are previously defined.We will show the validity of the L-D recursion by establishing the

conditions on Γm for both statements of the algorithm to be true. Todo this, we proceed in four stages.

Stage 1

Premultiply Equation 6.48 by Rm+1, where

Rm+1 =(

r(0) rHm

rm Rm

)=

(rmaB∗

m−1

RmaB∗m−1

)= (m + 1)x(m + 1) = E[X(n)XH(n)],

Page 169: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 153

where

X(n) =

⎛⎜⎜⎜⎝

x(n)

x(n − 1)...

x(n − m)

⎞⎟⎟⎟⎠

↑m + 1,↓

Rm+1am =(

Pm0m

),

Rm+1

(am−1

0

)+ ΓmRm+1

(0

aR∗m−1

)=

(Pm

0m

).

Stage 2

Recall that

Rm+1 =(

Rm rR∗m

rRTm r(0)

),

where

rR∗m = E

[X(n)X∗(n − m)

],

X(n) =

⎛⎜⎜⎜⎝

x(n)

x(n − 1)...

x(n − m)

⎞⎟⎟⎟⎠ ,

∴ Rm+1am = Rm+1

(am−1

0

)=

(RmrB∗

m

rBTm r(0)

)(am−1

0

)=

(Rmam−1

rRTm am−1

).

However, Rmam−1 =(

Pm−1

0m−1

), since Rm+1am =

(Pm0m

).

Pm−1 is the PEP for this filter of order (m − 1).Also define

Δm−1Δ= rRT

m am−1 = [r(−m)r(−m + 1) . . . r(−1)]

⎛⎜⎜⎜⎝

am−1,0am−1,1

...am−1,m−1

⎞⎟⎟⎟⎠

Page 170: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

154 Principles of Speech Coding

=m−1∑k=0

am−1,kr(k − m).

∴ Rm+1

(am−1

0

)=

⎛⎝Pm−1

0m−1Δm−1

⎞⎠ .

Stage 3

Recall that

Rm+1 =(

r(0) rHm

rm Rm

),

where Rm is the m × m autocorrelation matrix of

X(n − 1) =

⎛⎜⎜⎜⎜⎜⎜⎝

x(n − 1)

x(n − 2)

...

x(n − m)

⎞⎟⎟⎟⎟⎟⎟⎠

and rm = E[x(n − 1)x∗(n)].

∴ Rm+1

(0

aR∗m−1

)=

(r(0) rH

m

rm Rm

)(0

aR∗m−1

)=

(rmaR∗

m−1

RmaR∗m−1

).

Note that

rmaR∗m−1 = [r(1)r(2) · · · r(m)]

⎛⎜⎜⎜⎜⎜⎜⎝

a∗m−1,m−1

a∗m−1,m−2

...

a∗m−1,0

⎞⎟⎟⎟⎟⎟⎟⎠

=m∑

l=1

r(l)a∗m−1,m−lΔ

∗m−1.

Also,

RmaR∗m−1 =

(0m−1

Pm−1

).

∴(

Pm

0m

)=

⎛⎜⎝

Pm−1

0m−1

Δm−1

⎞⎟⎠ + Γm

⎛⎜⎝

Δ∗m−1

0m−1

Pm−1

⎞⎟⎠ .

This equation is a direct consequence (equivalence) of the L-D recursion.

Page 171: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 155

Consider the first row of this vector equation,

Pm = Pm−1 + ΓmΔ∗m−1. (6.50)

Consider the last row of this vector equation,

0 = Δm−1 + ΓmPm−1 (6.51)

⇒ Γm = −Δm−1

Pm−1.

Solve the previous two Equations 6.50 and 6.51 to obtain

Pm = Pm−1[1 − |Γm|2]update formula for the PEP.

Note that0 ≤ Pm ≤ Pm−1, m ≥ 1

and P0 = r(0).Therefore, for a prediction-error filter of order M,

PM = P0

M∏m=1

(1 − |Γm|2

).

Interpretations of Parameters Γm and Δm−1

For a prediction-error filter of order M, the parameters

Γm, 1 ≤ m ≤ M

are called the reflection coefficients.Note that

0 ≤ Pm ≤ Pm−1, Δm−1m ≥ 1,

implies that|Γm| ≤ 1, for all m.

Stage 4

For a prediction-error filter of order m, the mth reflection coefficient

Γm = am,m (the last tap weight of the filter)

Δm−1 may be interpreted as a cross-correlation between the forward predic-tion error fm−1(n) and the delayed backward prediction error bm−1(n − 1).

Δm−1 = E[bm−1(n − 1)f ∗m−1(n)],

Page 172: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

156 Principles of Speech Coding

where fm−1(n) is the output of a FPEF of order (m − 1) in response to the tapinputs

x(n), x(n − 1), . . . , x(n − m)

and bm−1(n − 1) is the output of a BPEF of order (m − 1) in response to thetap inputs

x(n), x(n − 1), . . . , x(n − m).

To show this relationship, recall that

Δm−1 =m−1∑k=0

am−1,kr(k − m),

r(k − m) = E[x(n − m)x∗(n − k)].

∴ Δm−1 =m−1∑k=0

am−1,kE[x(n − m)x∗(n − k)]

= E

[x(n − m)

m−1∑k=0

am−1,kx∗(n − k)

].

Recall that by definition,

fm−1(n) =m−1∑k=0

a∗m−1,kx(n − k)

∴ Δm−1 = E[x(n − m)f ∗

m−1(n)]

and, also, we recall that

x(n − m) = �x(n − m/ζn−1) + bm−1(n),

where ξ(n−1) is the space spanned by x(n − 1), x(n − 2), . . . , x(n − m + 1) andbm−1(n) is the backward prediction error produced by a predictor of order(m − 1).

The estimate �x(n − m/ζn−1) is a linear one, that is,

�x(

n − mζn−1

)=

m−1∑k=1

w∗b,kx(n − k).

∴ Δm−1 = E[x(n − m)f ∗m−1(n)]

= E[bn−1(n)f ∗m−1(n)] + E

[m−1∑k=1

w∗b,kx(n − k)f ∗

m−1(n)

].

Page 173: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 157

However, from the principle of orthogonality,

E[

fm−1(n)x∗(n − k)] = 0, 1 ≤ k ≤ m − 1

Δm−1 = E[bm−1(n)f ∗

m−1(n)].

Note that f0(n) = b0(n) = x(n), where x(n) is the prediction-error filter inputat time n.

∴ Γm = −E[bm−1(n − 1)f ∗

m−1(n)]

E[| fm−1(n)|2] , and Δ0 = E

[b0(n − 1)f ∗

0 (n)]

= E[x(n − 1)x∗(n)

]= r∗(1),

where r(1) is the autocorrelation of the input for a lag of 1.Another interpretation of Γm is to consider

Γm = −Δm−1

Pm−1.

However, Pm−1, the forward PEP of order m − 1 is given by

Pm−1 = E[| fm−1(n)|2].

∴ Γm = −E[bm−1(n − 1)f ∗m−1(n)]

E[| fm−1(n)|2] .

Therefore Γm are sometimes called PARCOR coefficients.

∴ Reflection coefficients = Negative of PARCOR coefficients.

Example 6.2

Application of the L-D recursionThere are two ways of applying the L-D recursion to compute aM,k , k = 0, 1, . . . , Mand PM (PEF coefficients and PEP for a final prediction order M).

One stage of the resulting lattice filter is shown in Figure 6.9.The two equations are

fm(n) = fm−1(n) + Γmbm−1(n − 1)

and

bm(n) = bm−1(n − 1) + Γ∗mfm(n).

Page 174: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

158 Principles of Speech Coding

fm –1(n)fm (n)

bm –1(n)bm (n)

z–1

+

+

Γm Γ∗m

FIGURE 6.9 One stage of the lattice filter.

Example 6.3

L-D Algorithm Method-AGiven am,k , k = 0, 1, . . . , M (tap-weight coefficients for final order M). We canestimate Γ1, Γ2, . . . , ΓM and r (0) and PM (PEP for final order M).

SOLUTION

1. Estimate the autocorrelation function for different lags from the input data. Forexample

r∧(k) = 1N

N∑n=1+k

u(n)u∗(n − k), k = 0, 1, . . . , M,

where N is the total length of input data and N � M.

2. Then compute

Δm−1 = rBTm am−1 =

m−1∑k=0

am−1,k r(k − m),

Γm = −Δm−1/Pm−1, and Pm = Pm−1(1 − |Γm|2) until m = M.

Initially

P0 = r(0) and Δ0 = r∗(1).

Note that

am,0 = 1 for all m,

am1,k = 0 for all k > m.

The resultant coefficients and PEP are known as the Yule–Walker estimate.

L-D Algorithm Method-BGiven Γ1, Γ2, . . . , ΓM and r (0). We can estimate am,k , k = 0, 1, . . . , M (tap-weightcoefficients for final order M) and PM (PEP for final order M).

Page 175: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 159

Here we need to use

am,k = am−1,k + Γma∗m−1,m−k , k = 0, 1 . . . , M,

Pm = Pm−1(1 − |Γm|2),

start at m = 0 and stop at m = M.

Example 6.4

Given Γ1, Γ2, . . . , ΓM and P0, use method-B to determine a3,1, a3,2, a3,3, and P3for a prediction-error filter of final order 3.

SOLUTION

Apply the L-D recursion

am,k = am−1,k + Γma∗m−1,m−k , k = 0, 1, . . . , M

and

Pm = Pm−1

(1 − |Γm|2

), P0 = r(0).

PEF of order 0, m = 0,

a0,0 = 1,

P0 = r(0).

PEF of order 1, m = 1,

a1,0 = 1,

a1,1 = Γ1,

P1 = P0(1 − |Γ1|2).

PEF of order 2, m = 2,

a2,0 = 1,

a2,1 = Γ1 + Γ2Γ∗1,

a2,2 = Γ2,

P2 = P1

(1 − |Γ2|2

),

where P1 is defined above.

Page 176: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

160 Principles of Speech Coding

PEF of order 3, m = 3,

a3,0 = 1,

a3,1 = a2,1 + Γ3Γ∗2P2,

a3,2 = Γ2 + Γ3a∗2,1,

a3,3 = Γ3,

P3 = P2

(1 − |Γ3|2

),

where a2,1 and P2 are defined above.Note that L-D algorithm yields all the tap weights and prediction-error powers

for each order m = 1, 2, 3 up to final order 3 (includes intermediate orders).

6.10.4 Inverse L-D Algorithm

The L-D algorithm solves the mapping

{Γ1, Γ2, . . . , ΓM, r(0)} → {aM,1, aM,2, . . . , aM,M, PM

}Of course aM,0 = 1.

The inverse L-D algorithm solves the inverse mapping{aM,1, aM,2, . . . , aM,M

} → {Γ1, Γ2, . . . , ΓM} .

To determine the inverse recursion, recall that for FPEP

am,k = am−1,k + Γma∗m−1,m−k , k = 0, 1, . . . , m

and for BPEF recall that

a∗m,m−k = a∗

m−1,m−k + Γ∗mam−1,k , k = 0, 1, . . . , m.

Combine these two equations,

(am,k

a∗m,m−k

)=

(1 Γm

Γ∗m 1

)(am−1,k

a∗m−1,m−k

), k = 0, 1, . . . , m,

where the order m = 1, 2, . . . , M.Solving this equation for am−1,k yields

am−1,k = am,k − am,ma∗m,m−k

1 − |am,m|2 , k = 0, 1, . . . , m

using the fact that Γm = am,m.

Page 177: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 161

Start with {aM,k}, then use the last equation to compute (with m = M,M − 1, M − 2, . . . , 2) the tap weights of the corresponding prediction-errorfilters of order M − 1, M − 2, . . . , 1, respectively.

Then use Γm = am,m, m = M, M − 1, M − 2, . . . , 1 to determine ΓM,ΓM−1, . . . , Γ1.

Example 6.5

Given a3,1, a3,2, a3,3 of a PEF of order 3, determine the corresponding reflectioncoefficients Γ1, Γ2, Γ3 using the inverse Levinson algorithm.

SOLUTION

PEF of order 2 (corresponding to m = 3)

a2,1 =a3,1 − a3,3a∗

3,2

1 − |a3,3|2 ,

a2,2 =a3,2 − a3,3a∗

3,1

1 − |a3,3|2 .

PEF of order 1 (corresponding to m = 2)

a1,1 =a2,1 − a2,2a∗

2,1

1 − |a2,2|2 ,

where a2,1 and a2,2 are defined above.

∴ Γ3 = a3,3,

Γ2 = a2,2,

Γ1 = a1,1.

There are three equivalent representations of the AR process as shown inFigure 6.10. They are (i) autocorrelation sequence, (ii) AR parameter sequence,and (iii) reflection coefficient sequence.

Autocorrelation sequence

Autoregressive parameter sequence

Reflection coefficient sequence

Representation Representation

PP, ap[1], , ap[ p] r[0], Γ1, , Γp

r[0], r[1], r[ p]

FIGURE 6.10 The three equivalent representations of an AR process.

Page 178: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

162 Principles of Speech Coding

6.10.5 Summary of L-D Recursion

Relationships to use

1. Levinson’s recursions(am,k

a∗m,m−k

)=

(1 Γm

Γ∗m 1

)(am−1,k

a∗m−1,m−k

), k = 0, 1, . . . , m

2. Inverse Levinson’s recursions

am−1,k = am,k − am,ma∗m,m−k

1 − |am,m|2 , k = 0, 1, . . . , m,

Γm = am,m.

3. r(m) = −Γ∗mPm−1 −

m−1∑k=1

a∗m−1,kr(m − k).

6.11 Summary

In this chapter, we introduced the important topic of linear prediction. Weuse Wiener filter theory as the basis for linear prediction. This results in theWiener–Hopf equation. Then we discussed FLP and BLP. The two predictorsare related and these relationships are examined especially for stationary andnonstationary input signals. The methods used to solve the linear predictionproblem are discussed. These include autocorrelation and covariance meth-ods. Later we discussed the L-D method. This leads to lattice filter used inLPC. The relationships between reflection coefficients and log area ratios andLSP are discussed.

EXERCISE PROBLEMS

6.1. What are the differences between FLP and BLP for (i) stationary input signals and(ii) nonstationary input signals. Explain.

6.2. A stationary input signal has the following autocorrelation function values:

r(0) = 1.0,

r(1) = 0.9,

r(2) = 0.7,

r(3) = 0.5.

Using the L-D algorithm, compute the reflection coefficients. Draw a three-stagelattice filter for the input signal using these values of reflection coefficients. Showthat the PEP at each stage of the lattice decreases with lattice stage.

Page 179: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Prediction 163

6.3. Recall Problem 1.5. Compute the autocorrelation values for 240 samples of thespeech signal. The autocorrelation is computed on the speech s(n) (withoutwindowing) by using the formula

r(k) =239∑n=k

s(n)s(n − k), k = 0, 1, 2, . . . , 10.

Using these results, we now apply the L-D algorithm to find the reflectioncoefficients. The algorithm is

P[0] = r(0)

for i = 1 to 10

a[i−1]0 = 1

ki = −

[i−1∑j=0

a[i−1]j r(i − j)

]

P[i−1]

a[i]i = ki

for j = 1 to i − 1

a[i]j = a[i−1]

j + kia[i−1]i−j

end

P[i] = (1 − k2i )P[i−1]

end

The final solution is a[10]j , j = 0, 1, 2, . . . , 10, where a0 = 1.0. The result ki, i =

1, 2, . . . , 10 is the 10th order LPC coefficients for the 240 samples of speech.6.4 Repeat Problem 6.3 but using windowed speech s(n) by using the following

asymmetric Hanning window:

wLP(n) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0.54 − 0.46 cos(

2πn399

), n = 0, 1, 2, . . . , 199,

cos(

2π(n − 200)

159

), n = 200, 201, . . . , 239.

The windowed autocorrelation formula is

r(k) =239∑n=k

wLP(n)s(n)wLP(n − k)s(n − k), k = 0, 1, 2, . . . , 10.

Compare your results with that of Problem 6.3. Explain any differences.

Page 180: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

164 Principles of Speech Coding

References

1. Strobach, P., Linear Prediction Theory: A Mathematical Basis for Adaptive Systems,Springer, Berlin, 1990.

2. Haykin, S., Adaptive Filter Theory, 4th edition, Prentice Hall, Englewood Cliffs, NJ,2002.

Bibliography

1. Bishnu S. Atal, The history of linear prediction, IEEE Signal Processing Magazine,Vol. 23, No. 2, (March), pp. 154–161, 2006.

2. Robert M. Gray, History of LPC Digital Speech and its impact on the Internet Pro-tocol, Supplemental material from http://www.stanford.edu/bobgray/, July 3,2006.

Page 181: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

7Linear Predictive Coding

7.1 Introduction

In the previous chapter, we explained the principles behind the theory oflinear prediction. It is a very powerful and widely used technique in the fieldof signal processing. It is the basis of many of the speech coding algorithmsthat are popular today.

In this chapter, we present the LPC model of speech generation. This modelis based on the acoustic model of speech generation using the vocal tractexcitation discussed in the first chapter. We also briefly introduce the conceptof CELP speech coders. It is the basis of many of the modern speech coders.Further discussion of CELP speech coders is continued in Chapter 9.

7.2 Linear Predictive Coding

The basic speech generation model underlying the LPC is that of an all-polefilter driven by a suitable excitation source as shown in Figure 7.1. Thesynthesis filter consists of a predictor and an adder/subtractor.

From the theory of linear prediction, we know that the excitation sequencee(n) can be recovered at the encoder by inverse filtering the speech sig-nal as in Figure 7.2. Combining the acoustic model of speech productionin Chapter 1 and Figure 7.1, we get Figure 7.3, which is the well-knownLPC model of speech generation. Here there are two excitation sources anda voiced/unvoiced switch to choose between the two sources. During thevoiced part of a speech, the excitation is the periodic (or quasiperiodic) pulses.During the unvoiced part of the speech, the excitation is white noise.

The pitch period is used to control the period of the impulse train exci-tation. The gain is used to control the amplitude of the excitation to thesynthesis filter and the filter coefficients are used in the synthesis filter. Thesynthesis filter represents the combination of the spectra of the glottis, vocaltract, and the radiation of the lips combined into a single time-varying filter.

165

Page 182: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

166 Principles of Speech Coding

Predictor

Speech s(n) e(n)Excitationsource

Synthesis filter

FIGURE 7.1 Basic speech generation model.

All these parameters (pitch period, gain, voicing decision, and filter coeffi-cients) are determined from one frame of the speech signal during the speechanalysis. The filter coefficients are fixed for one frame of speech (typically10–20 ms) during which the speech signal is assumed stationary. A new set ofcoefficients have to be computed and used for a new frame of speech.

7.2.1 Excitation Source Models

The excitation source can be a white noise generator or a set of periodicpulses. The excitation source can be recovered by using the analysis model inFigure 7.2. The challenge now is to model this excitation sequence adequatelyand transmit it to the receiver where the original speech signal can be synthe-sized. The early LPC systems modeled it as consisting of either pitch pulses(for voiced sounds) or random noise (for unvoiced sounds). The voiced–unvoiced decision, the pitch period, and gain estimates are sent to the decoderat the frame intervals where an approximation to the original speech signal issynthesized. However, accurate pitch estimation is one of the more difficultproblems of speech analysis and a speech frame cannot always be classified asstrictly voiced or strictly unvoiced. As a result, the quality of speech producedby this simplified model is considered to be “synthetic” at best.

Modern LPC systems model the speech production system as shownin Figure 7.3. The parameters of this model (i.e., the characteristics of theexcitation source and the predictor coefficients) are assumed to remain sta-tionary for a short frame interval of 10–20 ms. These parameters are evaluatedby analyzing the actual speech signal at the source using LPA, and transmit-ted to a far-end receiver (at the frame intervals) where an approximationto the original speech signal is synthesized using the LPC model shown in

Inverse filter

Speech s(n) Error e(n) ∑

FIGURE 7.2 Basic speech analysis model.

Page 183: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 167

Speech

Impulse train generator

White noise generator

Voiced/

Unvoiced switch

Synthesis filter

Pitch period

Voicing

Gain Filter coefficients

FIGURE 7.3 The LPC model of speech production.

Figure 7.3. The bit rate required with this scheme is significantly less thanwhat would be necessary to transmit the speech samples directly or to encodethe speech waveform directly.

The LPC model was quite successful. It formed the basis of the first commer-cial LPC-based toy product developed by Texas Instruments called “Speakand Spell.” It also formed the basis of the U.S. Government standard forthe 2.4 kbps secure voice communication (FS 1015). Even though LPC-basedspeech coders produced intelligible speech at low bit rates (2.4 kbps), theirspeech quality was not very good, especially for use on POTS phone lines.

There are many limitations of the LPC model of speech generation. Thismodel assumes that speech can strictly be divided into voiced/unvoicedsegments. It also requires that we can accurately determine the period ofthe voiced speech segments. Both of these are strictly not true. These prob-lems result in speech generated by LPC not sounding natural and havinglow quality.

In order to improve speech quality, there are other types of excitationdeveloped for the LPC model. As an example, Atal proposed the multipulseexcitation LPC (MPE-LPC) model [1,2]. This model uses a sufficiently large setof pulses at the input of an all-pole filter to produce better speech quality overa speech frame. All classes of speech sounds can be generated by a sequence ofpulses with appropriate amplitudes and locations. This effectively eliminatesthe difficult task of classifying the speech segments as voice/unvoiced andof determining an accurate period for the voiced segments. Figure 7.4 showsa block diagram illustrating the MPE idea.

In Figure 7.4, the pulse sequence is determined so that a weighted error isminimized, and is used to excite a speech synthesizer (an all-pole filter) toproduce a natural-sounding speech signal.

The MPE-LPC algorithm determines the location and amplitude of thepulses one pulse at a time. The speech generated by an MPE-LPC speech syn-thesizer is at a higher rate of 9.6 kbps but is natural-sounding speech withoutany background noise.

Page 184: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

168 Principles of Speech Coding

Excitation generator

LPC synthesizer

Perceptualweighting

filter

Weighted error

+ –

Original speech

Error minimization

Pulse amplitudesand locations

Multi-pulseexcitation

Synthesized speech Error

FIGURE 7.4 Basic structure of MPE-LPC.

The mixed-excitation linear prediction (MELP) coder is another type ofcoder that tries to solve or alleviate the two limitations of the LPC modelmentioned above, which lead to voicing errors, lower quality, and unintel-ligibility. The method used by MELP coders is to develop a single mixedexcitation. Two examples are shown in Figure 7.5a and b. In Figure 7.5a, anLPF with a cutoff frequency of fc and a low-pass filter (HPF) with the samecutoff frequency of fc are used to generate a mixed excitation. The impulsetrain excites the low-pass region of the synthesis filter and the noise excitesthe high-pass region of the synthesis filter. The filter coefficients are cho-sen such that the mixed excitation has a flat spectrum. In Figure 7.5b, theexcitation is shaped using two first-order FIR filters [H1(z) and H2(z)] withtime-varying parameters. A pulse position jitter is selectively used for thesynthesis of weakly periodic or nonperiodic voiced speech signals. A spectralenhancer (which is an adaptive pole-zero filter) is used to boost the formantfrequencies. Then a dispersion filter is placed after the LPC synthesis filterto improve the matching of natural and synthetic speech away from the for-mants. The MELP speech coders are reported to lead to reduction of buzzinessand raspiness in synthetic speech signals.

The residual-excited linear prediction (RELP) coder speech generationmodel was developed to also combat the limitations of the LPC model. Theprediction residual is supposed to have a relatively flat power spectrum.Residual encoding in RELP is based on spectral matching (not waveformmatching) as in waveform coders such as ADPCM, ADM, and so on. RELPuses the residual (errors) as excitation to the all-pole synthesis filter assumingthat the low-frequency components are perceptually important.

A block diagram depicting this method is shown in Figure 7.6a and b.In the RELP vocoder, the input speech at the transmitter is first analyzed

using LPA. Then an LPF bandlimits it to about 800 Hz. The residual is down-sampled and coded using the ADM method. The resultant baseband ofthe residual is coded at about 5 kbps. At the same time, LP coefficients are

Page 185: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 169

(a)

(b)

Impulse train

Impulse train

Random white noise

Random white noise

Pitch period

Mixedexcitation

Mixedexcitation

+

+

g

LPF

HPF

Pitch period

Spectral enhancement

H1(z)

H2(z)

Positionjitter

FIGURE 7.5 Mixed excitation LP (MELP) models.

(a)

(b)

LPanalysis LPF Down-

sampling

Adaptive deltademodulation

LPF interpolate/ upsample

Spectral flattening

DEM

UX

MU

X

LPsynthesis

Adaptive delta

modulation

Speech

Data out

Syntheticspeech

Data in

FIGURE 7.6 The RELP vocoder: (a) transmitter and (b) receiver.

Page 186: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

170 Principles of Speech Coding

quantized and both are transmitted to the receiver. At the receiver, the data aredemultiplexed and then adaptive delta demodulated. It is then low-pass fil-tered, interpolated, upsampled, and spectrally flattened before the synthesisfilter.

In the RELP fast Fourier transform (FFT)-based vocoder, the input speechat the transmitter is first analyzed using LPA. Then the FFT of the predictionresidual is taken and the magnitudes and phases of the frequency componentswithin the baseband (usually below 1 kHz) are encoded, multiplexed, andtransmitted. At the same time, LPC coefficients are quantized and both aretransmitted to the receiver.

A block diagram depicting this method is shown in Figure 7.7a and b. At thereceiver, a pitch-dependent copy-up procedure is used to generate the high-frequency residual. Then an inverse FFT is taken before the synthesis filter.At the same time, LPC coefficients are de-multiplexed and inverse-quantizedand then used in the synthesis filter.

The speech quality of RELP coders is higher than that of LPC coders mainlybecause of the emphasis on coding of perceptually important residual com-ponents. This quality is also limited by loss of information on the basebandfiltering. Typically, RELP coder rates are between 6 and 9.6 kbps.

Other possible excitation methods include the regular-pulse excitationLPC (RPE-LPC) model and CELP coder. The RPE-LPC uses an excitationsequence that consists of multiple pulses that are uniformly spaced unlikethe MPE-LPC that sometimes uses nonuniformly spaced pulses.

The position of the pulses in RPE-LPC is determined by specifying thatof the first pulse in a frame and the spacing between nonzero pulses. An

FFT Baseband coding

Q

Decoding Copy-up process IFFT LP

synthesis

Q–1

Speech

Data in

(a)

(b)

LP analysis

DEM

UX

MU

X Data out

Syntheticspeech

FIGURE 7.7 The FFT-based RELP vocoder: (a) transmitter and (b) receiver.

Page 187: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 171

RPE-LTP coder (modified RPE-LPC) is used in the GSM cellular phonesspeech coding standard.

7.3 LPC-10 Federal Standard

In this section, we discuss the Federal Standard FS-1015 coder as an exampleof a successful LPC-based speech coder. The newer FS-1016 coder is discussedin Chapter 9 as an example of a CELP-based speech coder.

The detailed block diagram of the general LPC encoder and decoder isshown in Figures 7.8 and 7.9, respectively.

7.3.1 Encoder

In the encoder, the input speech is first sampled at 8 kHz to get the PCM speechthat is 16 bits/sample. Then the speech is segmented into frames withoutoverlapping. Each frame can be about 10–30 ms. Then a pre-emphasis filteris used on the speech to adjust its spectrum. The output is sent to an LPanalysis block, to the prediction error filter, and also to the voicing detectorthat outputs 1 bit as either voiced or unvoiced. The output of the voicingdetector is sent to both the power computation block and the pitch estimationblock. The output of the prediction error filter is also used for pitch periodestimation, voicing detection, and power computation. The power is encoded

Input PCM speech

Voicing Pitch-periodindex

Powerindex

LPCindex LPC

Bit-stream

Framesegmentation

LPanalysis

LPC encoder

Pre-emphasis

Prediction-errorfilter

LPCdecoder

Power encoder

Voicingdetector

Pitch periodestimation

Power computation

Pitch periodencoder

Pack

FIGURE 7.8 Block diagram of the LPC encoder.

Page 188: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

172 Principles of Speech Coding

LPC bit-stream

Pitch periodindex

Powerindex

Voicing LPCindex

Synthetic speech

Unpack

Pitch period decoder

Impulse traingenerator

White noise generator

Power decoder

LPC decoder

Gaincomputation

Synthesisfilter

De-emphasis

Voiced/unvoiced

FIGURE 7.9 Block diagram of the LPC decoder.

in the power encoder. The pitch period is encoded in the pitch period encoderonly if the frame is voiced. The LPC decoder is also used in the encoder. TheLP coefficients are encoded in the LPC encoder. All the encoded bits (for LPcoefficients, power, pitch period, and voicing) are packed and transmitted tothe receiver in an LPC bit stream.

7.3.2 LPC Decoder

The LPC decoder is shown in Figure 7.9. The LPC encoded bit stream is firstunpacked. The pitch period index is sent to the pitch period decoder. Thevoicing bit controls the voiced/unvoiced decision switch. The power indexbits are sent to the power decoder. The LPC index bits are sent to the LPCdecoder to be decoded and the output is used to determine the coefficients ofthe synthesis filter. The excitation from the voiced/unvoiced switch is mul-tiplied by the output of the gain computation block and used to excite thesynthesis filter. The output of the synthesis filter is de-emphasized by thede-emphasis filter to get the synthetic speech signal.

7.3.3 FS-1015 Speech Coder

The block diagram of the LPC-10 FS1015 encoder and decoder is shown inFigure 7.10a and b, respectively. The sampled PCM speech signal is first

Page 189: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 173

(a)

Frame-to-pitch blockconverter and interpolator

Pre- emphasis Segmentation

LPanalysis

LPF Inversefilter

Pitchextraction

Voicingdetector

Pitch andvoicing

correction

Parameter coding

Originalspeech

Codedspeech

(b) Parameter decoding

Periodicexcitation

Noiseexcitation

Reflection-to-directcoefficient conversion

Voiced/unvoicedswitch

All-polesynthesis filter

De-emphasis

Codedspeech

Synthetic speech

Gain

FIGURE 7.10 The Federal Standard FS1015: (a) LPC-10 encoder/transmitter and (b) LPC-10decoder/receiver.

pre-emphasized by a first-order filter given by

H(z) = 1 − 0.9375z−1.

The next step is the frame segmentation before the LPA. The FS1015standard specifies use of the covariance method for LPA instead of the morepopular autocorrelation method. The pitch period estimation is done afterlow-pass filtering the input speech to bandlimit it and subsequently inversefiltering by a second-order filter. It considers 60 pitch period values corre-sponding to the pitch frequency range 50–400 Hz. A voicing detector is usedto detect voiced/unvoiced after the LPF. The pitch determination and voic-ing detector are based on methods described in Chapter 2. A single voiced

Page 190: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

174 Principles of Speech Coding

TABLE 7.1

Bit Allocation for the FS1015 LPC Coder

Resolution

Parameter Voiced Unvoiced

LPC 41 20Pitch period/voiced speech 7 7Power 5 5Synchronization 1 1Error protection — 21Total 54 54

frame between two unvoiced frames can lead to incorrect voicing decisionsand thus annoying artifacts in the synthesized speech. A correction is madeto the current frame’s voicing decision based on using information from theneighboring frames. Finally, the LP coefficients, pitch, and voicing/unvoicedbits are coded and transmitted to the FS 1015 receiver.

At the receiver, the coded speech is decoded to yield all the parame-ters: LP coefficients, pitch, and voiced/unvoiced bits. The LP lattice filtercoefficients are converted to direct filter coefficients and later used in thesynthesis filter to produce synthetic speech. The voiced/unvoiced switchcontrols the source of excitation to the synthesis filter. The gain controls thepower of the synthesized speech, which is de-emphasized before finally beingsent out.

The bit allocation for FS 1015 LPC encoder is shown in Table 7.1. The pitchperiod values are encoded using 7 bits. The power is encoded using 5 bits. TheLPC coefficients are encoded as described in Chapters 2 and 6. This resultsin 41 bits for voiced and 20 bits for unvoiced segments. Unvoiced frame bitsare protected using error protection costing 21 bits/frame. The frame lengthis 22.5 ms. For a total of 54 bits/frame, this results in a bit rate of 2.4 kbps forthe FS 1015 LPC.

7.4 Introduction to CELP-Based Coders

In this section, we briefly introduce the CELP methods for low-bit rate speechcoding.

First we discuss the limitations of LPC as a way to introduce necessity forCELP coders.

The quality of speech generated by the LPC model depends on the accuracyof the model.

The LPC model is quite simplistic in assuming that each frame of speech canbe classified as voiced or unvoiced. In reality, there are some brief regions of

Page 191: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 175

transitions between voiced and unvoiced and vice versa that the LPC modelincorrectly classifies. This can lead to artifacts in the generated speech, whichcan be annoying.

The fixed choice of two excitations, namely white noise or periodicimpulses, is not truly representative of the real speech generation models,especially for voiced speech. In addition, the nature of the excitation signalsfor voiced speech is not truly periodic, nor are they truly impulses. This leadsto synthetic speech that is not truly natural-sounding.

Naturalness can be added to the synthetic speech by preserving some ofthe phase information that is not typically preserved during the LPC pro-cess, especially for voiced frames. Unvoiced frame phase information canbe neglected. This is important even though the human ear is relativelyinsensitive to phase information.

The spectrum generated by exciting a synthesis filter with periodic impul-ses (as required for LPC modeling of generation of voiced frames) is one thatis distorted. This is due to a violation of the requirement discussed in Chapter2 that the AR model be excited by a flat-spectrum excitation (which is trueof white noise). The use of a periodic impulse train for excitation, however,leads to a distorted spectrum. This is more noticeable for low-pitch periodvoiced speech like that of women and children. For such speech, LPC-basedsynthetic speech is not very good.

In order to alleviate some of these problems with LPC, other coders such asCELP and MELP have been developed. CELP uses a long-term and a short-term synthesis filter to avoid voiced/unvoiced frame decision. It also usesphase information.

The MELP coder uses a mixed excitation signal (not just periodic pulsesand white noise excitation) to produce synthetic speech that sounds morenatural. The rate of 4.8 kbps is an important data rate because it can be trans-mitted over most local telephone lines in the United States. A version of CELPoperating at 4.8 kbps has been chosen as a U.S. standard for secure voice com-munication known as the FS1016 standard. The other such standard FS1015,which uses an LPC vocoder operating at 2.4 kbps, was described earlier.The LPC vocoder produces intelligible speech but the speech quality is notnatural.

Most parametric speech coders that operate lower than 16 kbps achievehigh speech quality by using more complex adaptive prediction, such as LPCand pitch prediction, and by exploiting auditory masking and the underlyingperceptual limitations of the ear. Important examples of such coders are MPE-LPC, RPE-LPC, and CELP coders.

The CELP algorithm combines the high-quality potential of waveform cod-ing with the compression efficiency of parametric model-based vocoders. Atpresent, the CELP technique is the technology of choice for coding speechat bit rates of 16 kbps and lower. At 16 kbps, a low-delay CELP (LD-CELP)algorithm provides both high quality, close to PCM, and low communicationdelay and has been accepted as an international standard for transmission ofspeech over telephone networks.

Page 192: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

176 Principles of Speech Coding

The bit rate, 8 kbps, was chosen for first-generation digital cellular tele-phony in North America. This speech quality is good, although significantlylower than that of the 64 kbps PCM speech. Both North American andJapanese first-generation digital standards are based on the CELP technique.The first European digital cellular standard is based on RPE-LPC algorithmat about 13.2 kbps.

Vector quantization methods are discussed in Chapter 8. However, it suf-fices to say here that vector quantization provides many advantages overscalar quantization: higher compression ratio, better performance, and so on.

A vector of random noise is used as an excitation instead of whitenoise/impulse train. The MPE-LPC idea has been extended to vector exci-tation. However, the excitation vectors are stored at both the transmitter andthe receiver. Only the index of the excitation is transmitted. This leads tolarge reduction in bits transmitted. Vector quantization is required for vectorexcitation. CELP speech coders are a result of this.

CELP speech coders exhibit good performance at data rates as low as4.8 kbps. One of the drawbacks of CELP-type coders is their large computa-tional requirements. The vector sum excited linear prediction (VSELP) speechcoder utilizes a codebook with a structure that allows for a very efficientsearch procedure. VSELP coder requires less than 8 kbps. The VSELP coderwas standardized by the Telecommunications Industry Association (TIA)as the standard for use in North American digital cellular telephone sys-tems IS-54, IS-136 and also in European GSM half-rate coders. The coderemploys two VSELP excitation codebooks, a gain quantizer that is robust tochannel errors and an adaptive pre-/postfilter arrangement. A major draw-back of VSELP is its limited ability to encode nonspeech sounds. Therefore,it performs poorly when encoding speech in the presence of backgroundnoise.

CELP is based on applying the concept of vector quantization to encodethe parameters of the speech generation model (such as LP coefficients) andpossibly other parameters as well. The codebook contains the list of possiblevectors determined from minimization of the overall distortion measure suchas weighted mean square of the quantization error. Adaptive codebooks arediscussed later. For now, we consider fixed codebooks, which are also calledstochastic codebooks. Figure 7.11 shows the excitation source model for mostLPC-based systems, which uses a stochastic codebook.

The basic block diagrams of a CELP-based encoder and decoder are shownin Figures 7.12 and 7.13, respectively. Note that there are two parts of thesynthesis filter: the long-term prediction (LTP) pitch synthesis filter and theSTP formant synthesis filter. Also, the error in the synthesized speech is per-ceptually weighted and then minimized. Then it is used to determine theproper index for getting the excitation from the fixed codebook. The codebookcontains a vector of possible excitations for the synthesis filters. A vector exci-tation is better than having to make a hard choice between impulse train andwhite noise as possible excitations in the LPC model.

Page 193: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 177

Stochastic codebook X

Gain

Pitchpredictor

e(n)

βz–L

FIGURE 7.11 A simplified excitation source model for most LPC-based systems.

Fixed codebook

LTP pitch synthesis

filter

STPformant

synthesis filter

Perceptual weighting

filter

Index Gain

Error

X

+ –

Error minimization

Synthesized speech

Original speech

FIGURE 7.12 Basic structure of encoder of CELP coders.

At the decoder, the index is used to determine the proper excitation neededfrom the fixed codebook. The excitation is multiplied by the gain and thenused to excite the LTP pitch synthesis filter and the output is then used to excitethe STP formant synthesis filter. An adaptive postfilter is used to smooth theoutput speech.

The STP formant synthesis filter is the normal LPC filter that models theenvelope of the speech spectrum. Its transfer function is given by

1A(z)

= 1

1 − ∑pk=1 akz−k

.

Fixedcodebook

LTP pitchsynthesis

filter

STPformant synthesis

filter

Gain

X Adaptivepostfilter

Outputspeech

FIGURE 7.13 Basic structure of decoder of CELP coders.

Page 194: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

178 Principles of Speech Coding

The STP formant synthesis filter coefficients are obtained once per frameusing the autocorrelation equations (using overlapped windows that extendtypically over three frames) and transformed to line spectral frequencies (LSF)for quantization and transmission.

The LTP models the fine structure (pitch) of the speech spectrum. Its transferfunction can be written as

1P(z)

= 11 − βz−L ,

where β denotes the pitch gain and L denotes the pitch lag. These parame-ters are determined at subframe intervals using a combination of open- andclosed-loop techniques.

The fixed codebook generates the excitation vector that is filtered by the LTPand STP to generate the synthesized speech signal. The index of the codebookand the gain is obtained so that a perceptually weighted error between theoriginal and synthesized speech signals is minimized.

The decoder consists of the codebook, LTP, and STP blocks. Additionally, ithas an adaptive postfilter that is needed to improve the speech quality.

A somewhat detailed version of Figure 7.12 is shown in Figure 7.14. Theexcitation source consists of a stochastic codebook, which generates randomcode vectors, and a pitch generation filter, as shown in Figure 7.14. The explicitpitch detection problem is avoided by modeling the pitch generation processas an LTP as shown. The parameters of this model (i.e., the codebook index,the gain, the pitch gain β, and the pitch lag L) are determined at the transmitterto minimize a perceptually weighted MSE between the original speech signaland the synthesized speech with an exhaustive search technique (the so-calledAbS method). The AbS method can be used to determine the parameters ofpitch synthesis filter conceptually as described here, but it is not practical, sothe adaptive codebook is used.

First, the formant predictor coefficients ak are obtained by analyzing aframe (typically 20 ms) of speech signal using the autocorrelation method.

Stochastic codebook X

G Pitchpredictor

Synthesized speech

Perceptual error

+ –

Formant predictor

Original speech

Mean square

Perceptualweighting

∑ ∑ ∑

∑akz–kβz–L

FIGURE 7.14 Excitation source model for most LPC-based systems.

Page 195: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 179

Next, the pitch parameters β and L are determined to minimize the percep-tual error. During this process, the contribution to the excitation from thestochastic codebook is assumed to be zero and the past memory of the pitchpredictor excites the formant predictor. (For certain values of the lag L, anestimated value of the future memory is also needed.) The pitch parame-ters are updated at subframe intervals of 5 or 10 ms. Finally, the codebookindex and gain are determined by repeating the exhaustive search procedure.These parameters are also updated at possibly even shorter subframe intervals(5 or 10 ms).

The optimum parameters are sent (once per frame) to the receiver wherethe synthesized speech is obtained using the above model. Additionally, anadaptive postfilter is employed to improve the quality of the synthesizedspeech.

7.4.1 Perceptual Error Weighting

The LTP and codebook parameters are selected to minimize the mean squareof the perceptually weighted error sequence. The perceptual weighting filteris given by

W(z) = A(z)A(z/γ)

= 1 − ∑pk=1 akz−k

1 − ∑pk=1 akγ

kz−k, 0 < γ < 1.

The weighting filter de-emphasizes the frequency regions correspondingto the formants as determined by the STP filter, thereby allocating more noiseto the formant regions, where they are masked by the speech energy, and lessnoise to the subjectively disturbing regions close to the frequency nulls. Theamount of de-emphasis is controlled by the parameter γ(γ = 0.75 in ITU-TG729A standard). A more general weighting filter of the form,

W(z) = A(z/γ1)

A(z/γ2),

is employed in certain CELP coders (e.g., ITU-T G.729 standard).For computational reasons, the weighting filter W(z) is moved to the two

branches before the summer as shown in Figure 7.15. The STP is modified toaccount for the cascading with W(z) to yield the weighted STP as follows:

H(z) = 1A(z)

W(z) = 1A(z/γ)

= 1

1 − ∑pk=1 akγ

kz−k, 0 < γ < 1.

Page 196: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

180 Principles of Speech Coding

Fixed codebook

LTP1/P(z)

WeightedSTPH(z)

Errorminimization

Gain Weighted

error

X

+ –

Weightingfilter W(z)

Weightedspeech sw(n)

Original speech s(n)

FIGURE 7.15 Translating the effect of weighting filter W(z).

7.4.2 Pitch Estimation

Determining the accurate pitch period of the speech signal is a difficulttask. This task is often broken up into two stages to reduce computationalcomplexity. First an open-loop search is performed over the whole range ofpossible values of pitch period to obtain a coarse estimate. The estimate is thenrefined using a closed-loop (AbS) technique. Fractional pitch delay estimatesare generally required to synthesize good-quality speech.

The open-loop pitch analysis is normally done once per frame. The methodof coarse pitch estimation basically consists of calculating the autocor-relation function of the weighted speech signal sw(n) and choosing thedelay L that maximizes it. One problem with this approach is that multi-ple pitch periods within the range of values of L might occur if the pitchperiod is small. In this case there is a possibility that the first peak in theautocorrelation is missed and a multiple of the pitch is chosen, therebygenerating a lower-pitched voice signal. To avoid this pitch multiples prob-lem, the peak of the autocorrelation is estimated in several lag ranges (threeranges in ITU-T G729 and G729A standards), and smaller pitch periods arefavored in the selection process with proper weighting of the normalizedautocorrelation values.

7.4.3 Closed-Loop Pitch Search (Adaptive Codebook Search)

Significant improvement in voice quality can be achieved when the LTPparameters are optimized inside the AbS loop. Consider the block diagramshown in Figure 7.16.

Page 197: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 181

Zeroexcitation H(z)

Weightederrorεw(n)

+ –

W(z)

Weightedspeech Sw(n)

Originalspeech s(n)

Weighted synthesized speech Sw(n)

Select L and β forminimum MSE

1/P(z) ∑

FIGURE 7.16 LTP parameters are optimized inside the AbS closed loop.

H(z)

Weightederror

X

X

+–

Weightedspeech Sw(n)

Original speech s(n)

Subframe delay

Weightedsynthesizedspeech Sw(n)

Gc

Fixedcodebook

Adaptivecodebook

R

W(z)

∑ ∑

FIGURE 7.17 Block diagram of the closed-loop LTP analysis.

Page 198: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

182 Principles of Speech Coding

We first assume that the codebook output is zero. The pitch delay L is thenselected as the delay (in the neighborhood of the open-loop estimate) thatminimizes the mean square of the perceptually weighted error. The optimumpitch gain value is usually obtained by computation.

If the pitch period L is greater than the length of the subframe over whichthe codebook search is performed, the contribution to the output signal ofthe pitch synthesis filter for this subframe is only a function of the excitationsequence that was used in the last subframe, which is stored in the LTP buffer,and is not a function of the current choice of the fixed codebook excitationsequence. With this interpretation, the pitch synthesis filter can be viewed asan adaptive codebook that is in parallel with the fixed codebook as shown inthe diagram in Figure 7.17.

More details of the CELP coders FS-1016 and ITU-T G.729A are presentedin Chapter 9.

7.5 Summary

In this chapter, we presented an introduction to LPC. We presented the FS 1015LPC standard as an example of LPC speech coders. We covered the limitationsof the LPC model of speech generation and introduced the modifications thathave been introduced to overcome these limitations. This led to a discussion ofMELP, RELP, RPE-LTP, and so on, coders. Finally, we introduced the conceptof CELP speech coders. AbS methods of speech coding, to which CELP codersbelong, are discussed in Chapter 9. Meanwhile in Chapter 8, we discuss theconcept of vector quantization, which underlies many of the modern CELP-based speech coders.

EXERCISE PROBLEMS

7.1 What are the possible types of excitation used for the LPC model of speechproduction?

7.2 What are the advantages and disadvantages of the LPC model of speech produc-tion compared with waveform coding methods of Chapters 4 and 5? What are thelimitations of the LPC model?

7.3 Explain the rationale for the introduction of CELP speech coders.7.4 Explain how RELP speech coders work.7.5 Why is the speech quality of RELP coders higher than LPC coders?7.6 How is pitch estimated in CELP-based speech coders?7.7 Explain the differences in operation of the adaptive codebook and the fixed

codebook in CELP speech coders.

Page 199: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Linear Predictive Coding 183

References

1. Atal, B.S., The history of linear prediction, IEEE Signal Processing Magazine,(March), 23(2), pp. 154–161, 2006.

2. Atal, B.S. and J.R. Remde, A new model of LPC excitation for producing natural-sounding speech at low bit rates, Proceedings of the IEEE ICASSP Conference,pp. 614–617, 1982.

Bibliography

1. Gibson, J., Speech coding methods, standards and applications, IEEE Circuits andSystems Magazine, pp. 30–49, Fourth quarter, 2005.

2. Childers, D., et. al., The past, present and future of speech processing, IEEE SignalProcessing Magazine, (May), pp. 24–48, 1998.

3. Atal B.S. and L.R. Rabiner, Speech research directions, AT&T Technical Journal,65(5), 75–88, 1986.

4. Gray, R.M., The 1974 origins of VoIP, IEEE Signal Processing Magazine, 22(4),87–90, 2005.

5. Atal, B.S. and M.R. Schroeder, Predictive coding of speech, Proceedings of the 1967Conference on Communications and Proc., pp. 360–361, November 1967.

6. Atal, B.S. and M.R. Schroeder, Adaptive predictive coding of speech, Bell SystemsTechnical Journal, 49(8), 1973–1986, 1970.

7. Atal, B.S. and S.L. Hanauer, Speech analysis and synthesis by linear prediction ofthe speech wave, Journal of the Acoustic Society of America, 50(8), 637–655, 1971.

8. Tremain, T.E., The government standard linear predictive coding algorithm: LPC-10, Speech Technology, 1, 40–49, 1982.

9. Texas Instruments, Speak and Spell, 1981.10. Schroeder M.R. and B.S.Atal, Code-excited linear prediction (CELP): High-quality

speech at very low bit rates, Proceedings of the IEEE ICASSP Conference, pp. 937–940,1985.

11. McCree, A.V. and T.P. Barnwell, Amixed excitation LPC vocoder model for low bitrate speech coding, IEEE Transactions on Speech and Audio Processing, 3(4), 242–250,1995.

12. Gerson, I.A. and M.A. Jasiuk, Vector sum excited linear prediction (VSELP)speech coding at 8 kbps, Proceedings of the IEEE ICASSP Conference, pp.461–464,1990.

13. Gerson, I.A. and M.A. Jasiuk, Techniques for improving the performance of CELP-type speech coders, Proceedings of the IEEE ICASSP Conference, pp. 205–208, 1991.

14. Chu, W.C., Speech Coding Algorithms: Foundation and Evolution of StandardizedSpeech Coders, Wiley-Interscience, New York, 2003.

15. Parsons, T., Voice and Speech Processing, McGraw-Hill, New York, 1987.16. Kondoz, A.M., Digital Speech: Coding for Low Bit-Rate Speech, 2nd edition, Wiley,

New York, 2004.

Page 200: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

184 Principles of Speech Coding

17. Childers, D., Speech Processing and Synthesis Toolboxes, Wiley, New York, 1996.18. Mitra, S.K., Digital Signal Processing a Computer Based Approach, pp. 450–455,

McGraw-Hill, New York, 2001.19. Elias, P., Predictive coding I, IRE Transactions on Information Theory, IT-1(1), 16–24,

1955.20. Elias, P., Predictive coding II, IRE Transactions on Information Theory, IT-1(1),

24–33, 1955.

Page 201: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

8Vector Quantization for Speech CodingApplications

8.1 Introduction

The purpose of this chapter is to introduce the technique of vector quantiza-tion (VQ), which has found wide applications in speech, audio, image, andvideo compression. We first recall scalar quantization, which was discussedin Chapter 3.

In scalar quantization, each individual data or parameter is quantized. InVQ, a vector or a group of data (parameters) are quantized. In speech codingfor example, VQ is used to quantize and code parameters such as vocal tractparameters, excitation signal, and so on.

The algorithm for creating the codebook is the major part of VQ systems.We will discuss some of the common algorithms used in speech applications.

Optimal search of VQ is computationally very expensive. Algorithms forperforming suboptimal searches are less so and can be easily utilized incurrent speech coding standards. Such algorithms include

• Multistage VQ (MSVQ)• Split VQ• Conjugate VQ

Just like in scalar quantization methods such as differential pulse code mod-ulation (DPCM) and ADPCM, it is also possible to introduce prediction intothe VQ methods. This results in better performance and less implementa-tion cost. Examples are predictive VQ (PVQ) and PVQ with MA prediction(PVQ-MA).

VQ is a powerful method of mapping a sequence of discrete (or continuous)vectors into a digital sequence suitable for storage or transmission in a digitalchannel in order to minimize storage requirements or channel capacity. VQhelps us to compress the information so that it is more suitable for storageor transmission.

185

Page 202: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

186 Principles of Speech Coding

Encoder(ADC)

Decoder(DAC)

Code index,I

Outputvector, X

Inputvector, X ˆ

FIGURE 8.1 Block diagram for a vector quantizer.

8.2 Review of Scalar Quantization

Scalar quantization has been discussed in Chapter 3. In this chapter, we extendthat treatment to the case of VQ.

Vector quantizers can be divided into two main categories: memoryless andthose with memory. We will discuss the memoryless ones first.

Recall from Chapter 3 that waveform (scalar) quantization is the process ofmapping an analog sample that can take a continuum of values into oneof a finite set of levels. It was represented conceptually as the combinationof an encoder and a decoder as shown in Figure 8.1. The coder analyzesthe input sample and generates a code corresponding to the quantizationregion to which this particular sample belongs, while the decoder producesthe proper rounded-off value. The decoder output is the actual finite precisionrepresentation of the input sample. Note that in Figure 8.1, only the codeindex, I, is transmitted from the encoder to the decoder.

8.3 Vector Quantization

The block diagram for a simple vector quantizer is shown in Figure 8.2. Itis also based on a simple encoder–decoder system like the scalar quantiza-tion. In fact, it is a natural generalization of the scalar quantization to many

Compare inputwith each codeword

Tablelookup

Encodercodebook

Decodercodebook

Inputvector, X

Index,i

Outputvector, X

Encoder Decoder

ˆ

FIGURE 8.2 A simple vector quantizer (encoder and decoder).

Page 203: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 187

dimensions. Alternatively, scalar quantization is VQ of unit dimension. VQis used in speech coding (and image coding) to realize substantial savings incompression.

The input vector x is compared with each codeword vector stored in theEncoder Codebook. The index of the closest matching codeword, I, is trans-mitted to the receiver. At the receiver, the index, i, is used to read the DecoderCodebook to decide which codeword to output using a Table Lookup. Thecodeword output, ˆx, is the output vector that is expected to be close to theinput vector x.

Note that in Figure 8.2 only the index is transmitted from the encoder to thedecoder, thus increasing compression efficiency. Therefore, the decoder musthave the same “codebook” as the encoder. The codebook is precomputedand stored at both the encoder and the decoder. When a speech (sampleor parameter) vector is to be encoded, the input vector is compared witheach vector in the precomputed codebook using a distance metric. The indexof the codebook vector that closely matches the input vector is stored ortransmitted to the decoder. At the decoder, the index is used to output thecorresponding codebook vector with the closest match (nearest neighbor).In this way, by transmitting only the index rather than the vector itself,significant compression savings are achieved.

Let the size of the input vector x be N. Thus,

x = [x1, x2, . . . , xN]. (8.1)

The elements of the vector {xk , 1 ≤ k ≤ N} are real-valued, continuous randomvariables from the input process X. Let M be the number of training vectors.

The size of the output vector ˆx is also N. Thus,

ˆx = [x1, x2, . . . , xN]. (8.2)

However, ˆx is chosen from a finite set (database), Y, consisting of L vectors,each of dimension N called codewords,

Y = [y1, y2, . . . , yL]. (8.3)

Each vector, yi, is N dimensional, that is

yi = [yi1, yi2, . . . , yiN]. (8.4)

The elements of the vector {yi, 1 ≤ i ≤ L} are real-valued, continuous randomvariables. VQ is therefore a mapping,

Q(x) ⇒ ˆx = yi for 1 ≤ i ≤ L, (8.5)

where ˆx is chosen from an L-level codebook Y. There are L such codewordsstored in Y. And we have to make a choice of codeword that is a nearest

Page 204: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

188 Principles of Speech Coding

neighbor to the input vector. The making of this choice involves design anduse of an algorithm. The algorithm for creating the precomputed codebookis the major part of VQ systems. We will discuss these later in this chapter.

A set of training vectors are used to partition the codebook into L partitions,with each codeword a vector of dimension N. Inside each partition (or cell)is a centroid to which all codewords inside that partition are mapped. Thecentroid corresponds to the mean of the training vectors for each partition(cell).

The number of bits required to address all the partitions (cells) is b =log2 L bits, where L is the number of codewords.

For transmission, the vector yi is encoded into a binary codeword ci oflength Bi bits. In general, different codewords will have different lengths.Therefore, the transmission rate is given by

T = BFc bits/s,

where

B = limM→∞

1M

M∑n=1

B(n) bits/vector

is the average length of a codeword. B(n) is the number of bits used to code thevector x(n) at time n, Fc is the number of codewords transmitted per second,and M is the number of training vectors.

The average number of bits per parameter or bits per dimension is

R = BN

bits/dimension.

For a codebook of size L, the maximum number of bits needed to code eachvector is

Bmax = log2 L bits.

In VQ, our goal is to design the quantizer such that the distortion in theoutput is minimized for a given transmission rate.

This quantization is lossy because the output vector y is not always exactlythe same as the input vector x. After the quantization, we can define a quan-tization error that is measured by a distortion measure between input vectorx and output vector y.

8.3.1 The Overall Distortion Measure

The overall distortion measure is

D = limM→∞

1M

M∑n=1

d[x(n), y(n)], (8.6)

Page 205: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 189

where M is the number of vectors in the database and d[x(n), y(n)] is the dis-tortion due to each vector y(n) in the database. The distortion is non-negative,that is,

d[x(n), y(n)] ≥ 0. (8.7)

For a stationary and ergodic input process, this sample average, the overalldistortion measure D, tends to the expectation,

D = E[d(x, y)]

=L∑

i=1

Pr(x ∈ Ci)E[d(x, yi) | x ∈ Ci] (8.8)

=L∑

i=1

Pr(x ∈ Ci)

x∈Ci

d(x, yi)p(x) dx,

where Pr(x ∈ Ci) is the discrete probability that vector x is in the cell Ci,p(x) is the multidimensional PDF of vector x and we integrate over all thecomponents of the vector x in cell Ci.

Example

A 1-dimensional vector (scalar) quantizer is shown in Figure 8.3. All input signalsx ∈ [xi , xi+1] in cell Ci will be quantized as the output vector yi , which is locatedat the statistical centroid (middle) of the (range) interval [xi , xi+1]. In the case ofuniform scalar quantization, these intervals are equally spaced, but in the case onnonuniform quantization, they are not.

Example

A 2-dimensional vector quantizer is shown in Figure 8.4. Note that the shapes ofthe various cells can be very different. All input vectors Xi = [x1i , x2i ] in cell Ciwill be quantized as the output vector yi , which is located at the centroid of thecell. The codebook is the set of all output vectors located at the centroids of eachcell Ci . Here there are 16 cells and 16 corresponding centroids. Therefore this isa 2-dimensional, 4-bit vector quantizer that gives a rate of 2 bits/dimension. Theinput vectors are not shown in Figure 8.4.

Partition or cellCi yixi xi+1

–4 –3 –2 –1 0 1 2 3 4

FIGURE 8.3 A 1-dimensional space (N = 1) partitioned into L = 10 intervals (cells) Ci .

Page 206: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

190 Principles of Speech Coding

–4 –3 –2 –1 0 1 2 3 4

Cell Ci

Centroid (Ci)

x2

x1

4

3

2

1

0

–1

–2

–3

–4

FIGURE 8.4 A 2-dimensional space (N = 2) partitioned into L = 16 cells, each with acentroid Ci .

8.3.2 Distortion Measures

There are various types of distortion measures we can use:

i. MSEThe MSE is defined as

d2(x, y) = ‖x − y‖2 = 1N

(x − y)T(x − y) =N∑

k=1

(xk − yk)2. (8.9)

A generalized version of this measure is that based on Lr norm,

dr(x, y) = 1N

N∑k=1

‖x − y‖r. (8.10)

A choice of r = 2 gives the MSE distortion measure. A choice of r = 1gives the average absolute error distortion measure. A choice of r = ∞

Page 207: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 191

gives an error distortion measure that tends toward the maximumerror.

limr→∞[dr(x, y)]1/r = max(|xk − yk|, 1 ≤ k ≤ N).

ii. Weighted Mean Square Error (WMSE)This distortion measure is similar to MSE except that it weights thedifferent parameters unequally using a positive-definite weightingmatrix W.

dW(x, y) = (x − y)TW(x − y).

When W = N−1I, the WMSE becomes the MSE distortion criterion,that is, dW = d2. A common choice is W = R−1, where R is the estimateof the autocovariance matrix of the input vector x.

For a symmetric weighting matrix, that is, W = PTP, the distortionmeasure is equivalent to

dW(x, y) = (x − y)TPTP(x − y)

= (Px − Py)T(Px − Py)

= ( ˜x − ˜y)T( ˜x − ˜y)

= d2( ˜x, ˜y).

Therefore the WMSE is equivalent to performing a MSE on thetransformed vectors ˜x and ˜y.

For real inputs, the centroid realized using this measure is given by

yi = 1Ni

∑xk∈Ri

xk .

iii. Hamming Distortion MeasuresThe hamming distortion measures is defined as

dH(x, y) ={

0 if x = y,1 if x �= y.

iv. Linear Prediction Distortion MeasuresFrom Chapter 6 (on LPC), the solution of the Yule–Walker equations is

N∑k=1

a(k)r(i − k) = −r(i), 1 ≤ i ≤ N,

where r(i) is the short-term autocorrelation coefficients of the speechsignal over a single frame.

Page 208: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

192 Principles of Speech Coding

To prevent instability, these coefficients are later transformed intoanother set of coefficients known as reflection coefficients or PARCORcoefficients. For sensitivity reasons, these PARCOR coefficients arealso transformed into another set of coefficients (such as log arearatios, LARs) which exhibit lower spectral sensitivity. An exampleof another alternative linear prediction distortion measure is themodified Itakura–Saito distortion measure, which is defined as

dI(x, y) = (x − y)TRx(x − y),

where Rx = {r(i − k)/r(0)}, 0 ≤ i, k ≤ N − 1 is the normalized auto-correlation matrix of the input. The autocorrelation function (takenover a frame of speech signal) is defined as

r(i − k) = E[x(i)Tx(k)].Note that Rx is a time-varying weight matrix, unlike the fixed W inWMSE. This means that this distortion measure is not symmetric,that is, dI(x, y) �= dI(y, x). However, the WMSE distortion measure is asymmetric distance and metric.

v. Perceptual Distortion MeasuresDistortion measures should correlate well with the human perceptionof speech. However, as bit rates decrease and distortion increases,the distortion measures discussed so far may not correlate well withhuman perception of speech.

Examples of distortion measures that are perceptually based can befound in References [1,2].

8.3.3 Codebook Design

Recall that the overall distortion measure is

D = limM→∞

1M

M∑n=1

d[x(n), y(n)].

To design an L-level codebook, we partition the N-dimensional spacespanned by the input vector x into L cells, which are {Ci, 1 ≤ i ≤ L}. Thenwe associate each cell Ci with a vector yi.

Next, we present one of the most popular algorithms used for this design.

8.4 Lloyd’s Algorithm for Vector Quantizer Design

Lloyd’s algorithm is also sometimes referred to as the K-means algorithm. Itis an iterative partitioning algorithm for codebook design. It divides the set

Page 209: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 193

of training vectors into K partitions such that the two necessary optimalityconditions are satisfied. These conditions are

i. The quantizer is realized using a nearest-neighbor or minimum-distortion selection method, that is,

Q(x) = yi iff d(x, yi) ≤ d(x, yj) j �= i, 1 ≤ j ≤ L.

ii. Each codevector yi is selected to minimize the average distortion in itspartition {Ci, 1 ≤ i ≤ L} (or region). The selected vector is called the“centroid” of the partition, that is, yi is selected to minimize

Di = E[d(x, y) | x ∈ Ci]

=∫

x∈Ci

d(x, y)p(x) dx.

The “centroid” is defined as

cent(Ci) = {yo : E[d(x, yo) | x ∈Ci] ≤ E[d(x, yj) | x ∈Ci]} for all yj.

The centroid condition for optimality is that for a given cell (or partition), theoptimal codevectors should satisfy

yi = cent(Ci).

Boundary sets are defined as

B = {x : d(x, yi) = d(x, yj)} for all i �= j and 1 ≤ i, j ≤ L.

The boundary set must be empty for a given codebook to be optimal, that is, notwo vectors can be mapped to the same codevector. A nonempty boundaryset means that there is at least one vector x that is equidistant from bothcodevectors yi and yj. Mapping x to yi or yj will give two encoding schemeswith the same average distortion, meaning that the codebook is not optimalfor the given partition.

If the PDF of the input signal source is unknown, and assuming the sourceis ergodic, we can use a finite number of samples as “training” data to thequantizer to minimize the average distortion over all the data set.

When the probability distribution of the signal to be quantized is unknown,we can design an optimal quantizer based on training data using Lloyd’salgorithm.

The “centroid” thus determined depends on the definition of the distortionmeasure used. For example, for the MSE distortion criterion, given a set of

Page 210: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

194 Principles of Speech Coding

training vectors, {x(n), 1 ≤ n ≤ M}, a subset Mi of the vectors will be insidecell Ci. The average distortion is given by

Di = 1Mi

∑x∈Ci

d(x, yi).

The choice of yi that minimizes this distortion for either the MSE or theWMSE distortion measure is given by

yi = 1Mi

∑x∈Ci

x(n),

which is simply the sample mean of all the training vectors contained incell Ci.

The choice of yi that minimizes this distortion for the Itakura–Saito distor-tion measure is given by averaging the normalized autocorrelation functionscorresponding to the input training vectors contained in cell Ci.

ryi(k) = 1Mi

∑x∈Ci

rx(k), 0 ≤ k ≤ N.

Lloyd’s algorithm divides the set of training vectors {x(n), 1 ≤ n ≤ M}into L clusters Ci(m) such that the two necessary conditions for optimalitymentioned above are satisfied.

Lloyd’s algorithm for optimal VQ based on training data can be summa-rized as follows: Use m as the iteration index. Ci(m) is the ith cluster at iterationm and yi(m) is its centroid.

1. Initialization: Set m = 0. Choose a set of initial codevectors yi(0), 1 ≤i ≤ L (by using the splitting method [discussed below]).

2. Classification: Classify the set of training vectors {x(n), 1 ≤ n ≤ M}into the clusters Ci(m) by the nearest-neighbor rule, that is,

x ∈ Ci(m) iff d(x, yi(m)) ≤ d(x, yj(m)) for all j �= i, 1 ≤ i, j ≤ L.

3. Codevector updating: m → m + 1. Update the codevector of everycluster by computing the centroid of the training vectors in each cluster

yi(m) = cent(Ci(m)), 1 ≤ i ≤ L.

4. Termination test: Check if the decrease in the overall distortion D(m)

at iteration m relative to D(m−1) at iteration m−1 is below a certainthreshold. If so, stop, otherwise go to Step (2). If D(m) − D(m − 1) < ε,stop, otherwise continue.

Page 211: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 195

The centroid here can be computed using the formulas developed forMSE or WMSE.

Lloyd’s algorithm reduces the distortion by updating the codebook. It isnot guaranteed to converge to the global optimum. Depending on the initial-ization point, it may converge to the nearest local optimum. This may notbe acceptable. To increase the possibility of finding the global optimum, wemay repeat this algorithm for several initial points and pick the codebookgenerated having the minimum overall distortion measure.

8.4.1 Splitting Method

The splitting method of initialization is when an initial codevector is set as theaverage of the M vectors in the entire training set. This initial codevector isthen split into two. The algorithm is run resulting in two codevectors. Thesetwo new codevectors are then split again into four, then the four are split intoeight, and so on. until we reach the desired number of codevectors.

Splitting is also used in another way to avoid empty cells. It is undesirableto have an empty cell after this procedure because an empty cell leads to adivision by zero in finding the centroid and increases the overall final distor-tion measure. One way to avoid this is by splitting the biggest cells into twocells by adding and/or subtracting small random numbers with low variance.

8.5 The Linde–Buzo–Gray Algorithm

Lloyd’s algorithm is based on using the PDF of the input signal source, X.This is because the “centroid” as defined earlier, which is

cent(Ci) = {yo : E[d(x, yo) | x ∈Ci] ≤ E[d(x, yj) | x ∈Ci]} for all yj,

is based on the knowledge of the expectation of a conditional probability, thatis, a PDF.

A generalized Lloyd algorithm for optimal VQ based on training data,also known as the Linde–Buzo–Gray (LBG) algorithm, is very much like theK-means algorithm.

1. Initialization: Set m = 0. Begin with the set of training vectors{x(n), 1 ≤ n ≤ M}.

Choose a set of initial codevectors yi(0), 1 ≤ i ≤ L by using thesplitting method. Select the threshold ε. Set D(0) = 0.

2. Classification: Classify the set of training vectors {x(n), 1 ≤ n ≤ M}into the clusters Ci(m) by the nearest-neighbor rule, that is,

x ∈ Ci(m) iff d(x, yi(m)) ≤ d(x, yj(m)) for all j �= i, 1 ≤ i, j ≤ L.

Assumes none of the quantization regions are empty.

Page 212: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

196 Principles of Speech Coding

3. Codevector updating: Update the codevector of every cluster bycomputing the centroid of the training vectors in each cluster.

yi(m) = cent(Ci(m)), 1 ≤ i ≤ L.

4. Compute the average distortion D(m) between the training vectorsand the updated codevector.

5. Termination test: Check if the decrease in the overall distortion D(m)

at iteration m relative to D(m − 1) at iteration m − 1 is below a cer-tain threshold. If so, stop, otherwise go to Step (2). If D(m) − D(m −1)/D(m) < ε, stop, otherwise continue.

6. m → m + 1. Find new codevectors yi(m), 1 ≤ i ≤ L that are the aver-age value of the elements of each of the quantization regions orclusters Ci(m). Go to Step (2).

Also, the LBG algorithm is not guaranteed to converge to the global opti-mum. To increase the possibility of finding the global optimum, we may repeatthis algorithm for several initial points and pick the codebook generatedhaving the minimum overall distortion measure.

Stochastic relaxation methods can also be used to avoid local minimumpoints. This method is discussed in References [3,4].

8.6 Popular Search Algorithms for VQ Quantizer Design

8.6.1 Full Search VQ

Full search VQ is a method of exhaustively searching for the optimal code-word. It involves searching the space of unconstrained VQ for the optimalcodevector index. This method is expensive in terms of computationalcomplexity and memory.

To reduce costs of computation and memory, the VQ can also be constrainedby a certain structure. Examples are multistage or tree search (e.g., binary)structures which will be discussed later. The full search can also be used forthe multistage structure, but in that case the optimal index will be a vector ofoptimal indices of the stages.

If each codevector in the codebook is represented by B = RN bits, where Ris the average number of bits per parameter or bits per dimension, then thenumber of vectors in the codebook is L = 2B. The computational complexityof the full search is N2B = N2RN multiply-adds per input vector, where N isthe vector dimension in the codebook.

The memory storage required for full search is N2B = N2RN locations. Bothgrow exponentially as the number of bits in the codeword.

Page 213: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 197

8.6.2 Binary Search VQ

Binary search is the simplest case of a class of so-called tree searched VQ meth-ods. Binary search VQ is a method of reducing the computational complexityof the search for the optimal codeword. It divides the N-dimensional space tobe searched into two regions (K-means algorithm of K = 2), then subdividesthe resulting regions into two regions, and so on until the space is dividedinto L = 2B regions, where L is a power of 2 and B is an integer. Each regionfrom the subdivisions has an associated centroid.

Example

An example for L = 8 is shown in Figure 8.5.An input vector X is quantized by traversing the binary tree along the path that

gives the minimum distortion at each node in the path. At the first subdivision, thecentroids are V1 and V2. Then at the second subdivision (stage), the centroids areV3 through V6. The centroids of the regions after the third binary division are thecodevectors yi , i = 1, 2, . . . , 8.

The computational complexity of the binary tree search is 2N log2 L, where Nis the number of multiply-adds for each distortion computation and 2 log2 L isthe number of distortion computations. In Reference [5], the authors compare theperformance of the Binary Search (Pair-Wise) algorithm to the LBG algorithm forimage processing applications.They concluded that the LBG algorithm has slightlybetter SNR and is more efficient than the Pair-Wise algorithm. A fast Binary Searchalgorithm introduced in Reference [6], however, is better than the Binary Searchalgorithm.

The memory storage required for binary search is 2N(L − 2). Here is an exampleof memory and computation requirements for an application using both full searchand binary search. It is obvious that binary search reduces computational andmemory costs.

X

≤ >

V2V1

y1

V6V5V4V3

y2 y3 y4 y5 y6 y7 y8

FIGURE 8.5 Block diagram of a uniform binary search vector quantizer for L = 8.

Page 214: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

198 Principles of Speech Coding

Example

Use full search: N = 10, R = 1.The number of codebook vectors is L = 2NR = 1024.The number of multiply-adds is N2NR = 10 × 1024 = 10240.Storage required is N2NR = 10 × 1024 = 10240.This is for each input vector. For a sampling frequency of 8000 Hz, the number of

vectors per second is 8000/10 = 800.Use binary search: N = 10, R = 1.The number of codebook vectors is L = 2NR = 1024.The number of multiply-adds is 2N log2(2NR) = 2 × 10 × 10 × 1 = 200.Storage required is 2N(L − 2) = 2N(2NR − 2) = 2 × 10 × (10 − 2) = 160.This is for each input vector. For a sampling frequency of 8000 Hz, the number of

vectors per second is 8000/10 = 800.

8.7 Other Suboptimal Algorithms for VQ Quantizer Design

Optimal search of VQ is computationally very expensive. Algorithms for sub-optimal search are less so and can be easily utilized in current speech codingstandards. Three examples of such algorithms are

• MSVQ• Split VQ• Conjugate VQ

In addition, just like in scalar quantization methods such as DPCM, it is alsopossible to introduce prediction into the VQ methods. This results in betterperformance and less implementation cost. Examples are PVQ and PVQ-MA.

Next, we discuss some of these algorithms.

8.7.1 Multistage VQ

Both full search and binary search VQ methods can be used on unconstrainedVQ structures. The memory and computational requirements for a typical VQapplication for a speech application using full search or binary search can bevery high.

MSVQ was introduced in Reference [7] to constrain the VQ into a certainstructure, thereby reducing memory and computation costs. They are alsoknown as residual vector quantizers.

In order to reduce implementation costs such as storage (memory) andcomputational processing time of unconstrained VQ, the MSVQ method wasintroduced and it constrains the VQ be implemented with a certain structure(or constraint).

Page 215: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 199

Vectorquantizer

#1

Vectorquantizer

#2

Inputvector, X

Outputvector, X

XW1 W2

N1 patterns, codewords N2 patterns, codewords

Rotationmatrix, Aiez ui

FIGURE 8.6 Block diagram of a simple multistage vector quantizer.

The disadvantages of the MSVQ scheme include a drop in its performance(speed of search) compared to an unconstrained VQ method.

The block diagram of an MSVQ is shown in Figure 8.6.In Figure 8.6, an N1-bit vector quantizer in the first stage quantizes x into a

particular vector zi. The residual vector e = x − zi is then used as an input toa second stage of an N2-bit quantizer. Before this, a rotation matrix operationis optionally applied such that ui = Aie. W1, W2 represent the N1, N2 patternsquantizer codewords respectively.

Note that the quantized value of x is given as the sum of the two codevectors,as in

X = Q(x) = y = zi + A−1i wi for 1 ≤ i ≤ N.

The block diagram of a single-stage vector quantizer equivalent to themultistage vector quantizer of Figure 8.6 is shown in Figure 8.7.

The details of a K-stage encoder for a multistage vector quantizer are shownin Figure 8.8. Here, the input vector x is compared with the output vec-tor ˆx = y(1)

i1 + y(2)i2 + · · · + y(k)

ik , where y(l)il is the lth codevector from the kth

stage. All the codevectors have the same dimension as the input vector. Thedecoders D1 to Dk of the different stages decode their input using the inputindex i1, i2, . . . , ik .

Using these different indices, the encoder minimizes the distance betweeninput x and ˆx. The code index set {i1, i2, . . . , ik} that minimizes this distance is

Vectorquantizer

Inputvector, X

Outputvector, X~

WN1 × N2 patterns, codewords

FIGURE 8.7 Block diagram of an equivalent single-stage vector quantizer.

Page 216: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

200 Principles of Speech Coding

D1

D2

Dk

D3

Distancecalculation Minimization

i1

i2

i3

ik

y–i1

y–i2

y–i3

y–ik

x– (1)

(2)

(3)

(k)

FIGURE 8.8 MSVQ encoder details.

transmitted to the MSVQ decoder. At the decoder, codevectors from differentstages are added together to form the quantized version of the input.

The details of a K-stage decoder for a multistage vector quantizer areshown in Figure 8.9.

For a K-stage MSVQ, the given codebook sizes are N1, N2, . . . , Nk . There-fore the number of bits required for each stage is r1 = log(N1), r2 =log(N2), . . . , rk = log(Nk).

The total number of bits is

r =k∑

j=1

rj =k∑

j=1

(log(Nj)) = log

⎛⎝ k∏

j=1

Nj

⎞⎠ .

D1

D2

D3

Dk

i1

i2

i3

ik

Q(x)

y–i1

y–i2

y–i3

y–ik

(1)

(1)

(1)

(1)

FIGURE 8.9 MSVQ decoder details.

Page 217: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 201

The amount of memory required for the K-stage MSVQ is

M =k∑

j=1

Nj =k∑

j=1

2rj .

The choice of the breakdown of a B-bit VQ into MSVQ can be done in anumber of different ways. For example a 6-bit VQ can be decomposed intoeither a 2-stage (3,3) or a 3-stage (2,2,2) MSVQ. Similarly, a 10-bit VQ can bedecomposed into either a 2-stage (5,5) or a 2-stage (3,7) or a 2-stage (1,9) ora 3-stage (3,3,4) MSVQ, and so on. Memory costs for the MSVQ are reducedcompared to the unconstrained VQ for each possible MSVQ implementation.

MSVQ systems can be searched using one of the following search methods:full search, sequential search, and tree search. Each method yields a differentcomputational cost, memory requirement, and number of bits.

For MSVQ design algorithms, it is possible to use either the SequentialCodebook Design Algorithm or the Joint Codebook Design Algorithm. Fordetails of these and other similar algorithms, see [3,7–9].

8.7.2 Split VQ

The idea of split VQ was first introduced in References [10,11] in order toreduce computational complexity and storage.

The input vector is divided into subvectors, and each of the subvectors isvector quantized independently. When the input vector is of large dimension,the method can drastically reduce the computational complexity. It can beshown that the MSE distortion criterion is minimized by using either full VQor split VQ algorithms.

8.7.3 Conjugate VQ

The idea of conjugate VQ was introduced by Moriya in [12]. An output vectoris generated by summing the output of two codevectors, each stored in adifferent codebook.

The authors in [12] used two codebooks Csub1 and Csub2.

Ci = α1Csub1 + α2Csub2.

The conjugate VQ is also used for coding the gain, and so on.The conjugate VQ idea has the following advantages:

i. Reduces memory requirementsii. Reduces complexity of random codebook searches

iii. Improves robustnessiv. Allows a large trained random codebook to be produced

Page 218: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

202 Principles of Speech Coding

8.7.4 Predictive VQ

The idea of PVQ was introduced in References [13,14]. PVQ is a vec-tor quantizer with memory that exhibits good MSE performance, smallerblock size, and smaller computational load compared with memory-less VQmethods.

A memoryless VQ encodes consecutive vectors independently. However,when these vectors are not statistically independent, the performance of theVQ can be improved by introducing memory. In PVQ, a multiple codebookquantizer structure is employed where past outputs are used to determinewhich of the codebooks to use for the present input. More details can befound in References [13,14].

8.7.5 Adaptive VQ

All the VQ methods considered so far utilize fixed codebooks. In many cases,it is desirable to have the codebook track the variations in the input vectorstatistical characteristics. As in adaptive scalar quantizers, adaptive vectorcodebooks can be done using either forward or backward methods.

In many applications such as speech coding, a combination of fixed andadaptive codebooks is used in VQ. Afixed codebook provides the initial code-vectors to the adaptive codebook and also helps speed up the convergenceof the adaptive codebook when the input signal vector varies significantly.Typically the adaptive codebook is used in a two-stage cascaded vectorquantizer as shown in Figure 8.10.

Compare inputwith eachcodeword

Fixedcodebook

Adaptivecodebook

Adaptivecodebook

Index,i

Outputvector, X

Encoder Decoder

Index,k

Fixedcodebook

Inputvector, X

e

Comparewith eachcodeword

Index,i

Index,k

FIGURE 8.10 Adaptive VQ (encoder and decoder).

Page 219: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 203

8.8 Applications in Standards

It is well known that VQ has better performance than scalar quantization. Themost common application of VQ is in the VQ of LPC coefficients (actually theLSF). Rather than quantizing the coefficients one-by-one (scalar), they arequantized as a vector to get the enhanced performance of VQ.

Here we review how VQ has been applied in speech coding standards.Many of the recent speech coding standards use some sort of VQ. It is knownthat LSF vectors (which are derived from LPC coefficients obtained fromconsecutive frames) exhibit high levels of correlation.

For example, memoryless-type VQ is used to represent the LSF coeffi-cients of a frame of speech. PVQ can be used for determining the presenceof interframe correlation.

PVQ with a split structure is used to code the LSF vector of the ITU-T G.723.1standardized speech coder.

The ITU-T G.729 standardized speech coder uses a combination of MSVQ,split VQ, and PVQ with a switched predictor to encode the LSF vector in18 bits/frame.

A two-stage MSVQ is used. The first stage is a 10-dimensional memorylessquantizer, while the second stage is a split VQ with 5-dimensional codebooks.The 18 bits in a frame are allocated between the two stages. Two differentpredictors are selected during encoding.

The switched predictor enables the encoder to better adapt to differentsignal statistics. This is very useful because LSFs exhibit high correlationsduring stationary speech segments, but mild correlations during transitionsbetween speech signal types.

Another example of VQ application in standardized speech coders is in theETSI GSM EFR ACELP coder. This coder uses PVQ with split structure toquantize the LSF vector.

8.9 Summary

In this chapter, we introduced the technique of VQ, which has found wideuses in speech, audio, image, and video compression. VQ is an extension ofscalar quantization to vectors.

We also presented one of the main algorithms for creating the code-book: Lloyd’s algorithm (or K-means algorithm). The LBG algorithm is ageneralization of Lloyd’s algorithm.

Full search VQ requires exponentially large memory and computations.Other suboptimal algorithms can be used to reduce computational complexityand memory. Such algorithms include MSVQ, split VQ, conjugate VQ, PVQ,and adaptive VQ.

Page 220: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

204 Principles of Speech Coding

In speech coding for example, VQ is used to quantize and code parameterssuch as vocal tract parameters, excitation signal, and so on.

EXERCISE PROBLEMS

8.1. Explain the main differences between scalar quantization and VQ. Mention therequired components for vector quantizers to have.

8.2. Review Lloyd’s algorithm for optimal scalar quantization based on input signalstatistics and based on training data. Compare it with Lloyd’s algorithm for VQbased on input signal statistics and based on training data.

8.3. What are the two main conditions for optimality of vector quantizers? Howwill you design a vector quantizer if one or more of these conditions are notmet?

8.4. Run the VQ simulation and animation tool found in the website http://www.data-compression.com/vqanim.shtml. The source for this tool is a memoryless Gaus-sian source with zero-mean and unit variance. There are 4096 training vectors.The LBG design algorithm is run with ε = 0.0001. The algorithm guarantees alocally optimal solution. The size of the training sequence should be sufficientlylarge. It is recommended that the number of training vectors should be 1000 timesthe number of codevectors. Experiment with this simulation model. Try differentvalues of input noise for the simulation.

8.5. Using MATLAB�’s Vector Quantization Design Tool (vqdtool), autogeneratethe initial codebook using Gaussian random numbers. Use different number oflevels and both squared error and weighted square error distortion measures.Experiment with this simulation model. Try different stopping criteria and algo-rithmic details of values of input noise for the simulation. Design and plot theperformance curves and entropy.

8.6. How is VQ applied in ITU-T G.729 speech coder?How is VQ applied in ITU-T G.723 speech coder?

References

1. Schroeder, M. and B. Atal, redictive coding of speech and subjective error criteria,IEEE Transactions on Acoustics, Speech and Signal Processing, 27, 247–254, 1979.

2. Viswanathan, V., et al., Objective speech quality evaluation of medium bandand narrow band real-time speech coders, Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing, 543–546, 1983.

3. Chu, W.C., Speech Coding Algorithms: Foundation and Evolution of StandardizedSpeech Coders, Wiley-InterScience Publishers, New York, 2003.

4. Kondoz, A.M., Digital Speech: Coding for Low Bit-Rate Communication Systems, 2ndedition, Wiley, New York, 2004.

5. Shanbehzadeh, J. and P. Ogunbona, On the computational complexity of the LBGand PNN algorithms, IEEE Transactions on Image Processing, 6(4), 614–616, 1997.

6. Equitz, W., A new vector quantization clustering algorithm, IEEE Transactions onAcoustics, Speech and Signal Processing, 37, 1568–1575, 1989.

Page 221: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Vector Quantization for Speech Coding Applications 205

7. Juang, B.H. and A.H. Gray, Multiple-stage vector quantization for speech cod-ing, Proceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing, 597–600, 1982.

8. LeBlanc, P., B. Bhattacharya, S. Mahmoud, and V. Cuperman, Efficient search anddesign procedures for robust multi-stage VQ of LPC parameters 4 kb/s speechcoding, IEEE Transactions on Acoustics, Speech and Signal Processing, 1(4), 373–385,1993.

9. LeBlanc, W.P., Speech Coding at Low to Medium Bit Rates, PhD dissertation, CarletonUniversity, Canada, 1992.

10. Paliwal, K. and B. Atal, Efficient vector quantization of LPC parameters at24 bits/frame, Proceedings of the IEEE International Conference on Acoustics, Speechand Signal Processing, 661–664, 1991.

11. Paliwal, K. and B. Atal, Efficient vector quantization of LPC parameters at24 bits/frame, IEEE Transactions on Speech and Audio Processing, 1(1), 3–14, 1993.

12. Moriya, T., Two-channel conjugate vector quantizer for noisy channel speechcoding, IEEE Transactions on Journal of Special Areas in Communications, 10(5),866–874, 1992.

13. Haoui, A. and D. Messerschmitt, Predictive vector quantization, Proceedings ofthe IEEE International Conference on Acoustics, Speech and Signal Processing, 9(Pt. 1),420–423, 1984.

14. Rizvi, S.A. and N.M. Nasrabadi, Predictive vector quantizer using constrainedoptimization, IEEE Signal Processing Letters, 1(1), 15–18, 1994.

Bibliography

1. Childers, D., et al., The past, present and future of speech processing, IEEE SignalProcessing Magazine, (May), 24–48, 1998.

2. Atal, B.S. and L.R. Rabiner, Speech research directions, AT&T Technical Journal,65(5), 75–88, 1986.

3. Makhoul, J., S. Roucos, and H. Gish, Vector quantization in speech coding,Proceedings of the IEEE, 73(11), 1551–1588, 1985.

4. Buzo, A., A.H. Gray, and J.D. Markel, Speech coding based upon vector quan-tization, IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(5),562–574, 1980.

5. A Vector Quantization animation website. http://www.data-compression.com/vqanim.shtml

6. Painter, E. and A. Spanias, A MATLAB software tool for the introduction ofspeech coding fundamentals in a DSP course, Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2, 1133–1136, 1996.

7. Chan, W.Y., S. Gupta and A. Gersho, Enhanced multistage vector quantizationby joint codebook design, IEEE Transactions on Communications, 40(11), 1693–1697,1992.

8. Chang, P.-C. and R.M. Gray, Gradient algorithms for designing predictive vectorquantizers, IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-34(4), 679–690, 1986.

Page 222: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

206 Principles of Speech Coding

9. Zeger, K., Corrections to gradient algorithms for designing predictive vector quan-tizers, IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-39(3),764–765, 1991.

10. Atal, B.S., V. Cuperman, and A. Gersho, eds, Advances in Speech Coding, KluwerAcademic Publishers, Norwell, MA, 1991.

11. Gersho, A. and R.M. Gray, Vector Quantization and Signal Compression, 4th printing,Kluwer Academic Publishers, Norwell, MA, 1995.

12. Atal, B.S., V. Cuperman, and A. Gersho, eds, Speech and Audio Coding for Wirelessand Network Applications, Kluwer Academic Publishers, Norwell, MA, 1993.

13. Mathworks, Inc., A demo of the vector quantization, MATLAB Signal ProcessingToolbox, version R2007b.

Page 223: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

9Analysis-by-Synthesis Coding of Speech

9.1 Introduction

In this chapter, we present the powerful method of speech coding usingthe AbS method. It is a method of ensuring that the best possible excita-tion is chosen for a segment of speech. Many speech coders are based on thismethod. AbS method is the closed-loop method. The speech coders discussedin Chapter 7, for example, MPE-LPC, RELP, RPE-LPC, modified RPE-LPC, areexamples of AbS speech coders.

They analyze the speech input and extract parameters such as LPC coeffi-cients, gain, pitch, etc., which are then quantized and transmitted or storedfor synthesis later. In Figure 9.1, a basic block diagram of an AbS encoder isshown. A decoder is embedded in the encoder. This is a closed-loop system.The parameters are extracted by the encoding and then they are decoded andused to synthesize the speech. The synthetic speech is compared with the orig-inal speech and the error is minimized (in a closed loop) to further choose thebest parameters in the encoding. The measures of the minimization includeMSE, and other measures.

Typically only a few (not all) of the parameters are used for closed-loopoptimization. An example of an AbS speech encoder is CELP. This was alsobriefly introduced in Chapter 7. The five major components of a CELP encoderare (i) excitation (stochastic) codebook, (ii) gain calculation, (iii) synthesisfilter, (iv) error minimization, and (v) spectral analysis as shown in Figure 9.2.A perceptual weighting filter can be added as needed. Also, it can incorporatea synthesis filter for both pitch and formant. Figure 9.3 shows an AbS closedloop of a CELP encoder with perceptual weighting and separate pitch andformant synthesis filters.

We note that the perceptual weighting filter is linear and can be moved tooutside the feedback loop by using signal graph manipulations techniques asshown in Figure 9.4a and b. This may be done to reduce complexity. The result-ing filter called the modified formant synthesis filter (m.f.s.f.) is a merge ofthe formant synthesis filter and the perceptual weighting filter. In Figure 9.4b,this is 1/A(z/γ).

207

Page 224: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

208 Principles of Speech Coding

Input speech Parameterselection and

encoding Decoder

Errorminimization

Synthetic speech

Error

Bitstream

FIGURE 9.1 Block diagram of an encoder based on the AbS principles (incorporates a decoder).

9.2 CELP AbS Structure

The CELP coder is the most common and one of the most successful AbSspeech coders. In Figure 9.5, at the encoder a segment of speech s(n) is synthe-sized using the linear prediction model 1/(1 − A(z)) along with a long-termredundancy predictor 1/(1 − P(z)) for all possible excitations from a code-book. For each excitation, an error signal is calculated and passed through a

Gaincalculation

Spectralanalysis

Excitationcodebook

Synthesisfilter

Errorminimization

Syntheticspeech

Inputspeech

FIGURE 9.2 Major components of a CELP encoder.

Page 225: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 209

Excitationgenerator

Pitch synthesisfilter

Errorminimization

Formantsynthesis filter

Perceptualweighting

filter

Syntheticspeech

Error

Inputspeech

FIGURE 9.3 AbS closed loop of a CELP encoder with perceptual weighting.

perceptual weighting filter W(z). This perceptually weighted error e(n) is min-imized and the excitation that produces the minimum perceptually weightedcoding error is selected for use at the decoder. It is clear that the best exci-tation from all possible excitations for a given segment of speech is selectedby synthesizing all possible representations at the encoder. That explains the

Excitationgenerator

Pitch synthesisfilter 1/A(z/γ)

W(z)

Errorminimization

Excitationgenerator

(a)

(b)

Pitch synthesisfilter 1/A(z) W(z)

W(z)

Errorminimization

Error

Error

FIGURE 9.4 Equivalent realizations of the CELP encoder with perceptual weighting: (a) originalrealization and (b) realization using modified formant synthesis filter (m.f.s.f.).

Page 226: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

210 Principles of Speech Coding

Codebook

1W(z)

Error minimization e(n)

s(n)

ŝ(n)

Gain

1(1 – P(z)) (1 – A(z))

FIGURE 9.5 A CELP encoder based on the AbS principles (incorporates a decoder).

name the analysis-by-synthesis method. The predictor parameters and theexcitation codeword index are sent to the receiver to decode the speech. Thedecoder is shown in Figure 9.6.

9.3 Case Study Example: FS 1016 CELP Coder

An example of a CELP-based speech coder is the FS1016 speech coder. Thecoder operates at 4.8 kbps and was standardized in 1991 and was jointly devel-oped by the AT&T Bell Labs and the U.S. Department of Defense (DoD).The coder produces reasonably good speech quality and is robust to chan-nel errors, acoustic noise, and other network impairments such as tandeming.The basic structure of FS1016 speech CELP coder is discussed here. Its encoderis shown in Figure 9.7, whereas the decoder is shown in Figure 9.8.

For the encoder, the PCM speech is input to the frame/subframe seg-mentation block where the speech is segmented into 30 ms frames and four7.5 ms subframes. Short-term LPA is performed on each frame obtained fromthe pre-emphasized input speech. The resulting 10th-order LPC coefficientsare encoded yielding the LPC index. Adaptive codebook is used for thispurpose.

The length of the excitation codevector is the same as that of the sub-frame. Excitation sequence is determined by searching the excitation code-book once every subframe. The search procedure includes the generationof many filtered excitation sequences and their corresponding gains. Then

Codebook

1(1 – P(z)) (1 – A(z))

1

Gain

ŝ(n)

FIGURE 9.6 A CELP decoder based on the AbS principles.

Page 227: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 211

Input PCM speech

CELP bit-stream

Frame/subframe segmentation

Perceptualweighting filter

Zero-inputresponse of

m.f.s.f.Stochastic

codebook search

LPanalysis

LPC encoder

Impulse responseof m.f.s.f.

Adaptivecodebook search

Gainmodification

LPC decoder andinterpolation

Pitch periodencoder

Gainencoder

Gainencoder

Gaindecoder

Gaindecoder

Total responseupdate of systems

state

Stochasticcodebook gain

index

Stochasticcodebook

index

Adaptivecodebookgain index

Pitch period(adaptive

codebook)index

LPCindex

Form CELP bit stream

FIGURE 9.7 Block diagram of the FS1016 CELP encoder.

the weighted MSE is computed for each sequence and the codevector andgain associated with the sequence with the lowest error are selected.

The stochastic codebook gain index is encoded in the stochastic codebookoutput. The pitch period index is encoded in the adaptive codebook output.A total response is fed back to the parameter update and codebooks.

The output of the encoder comprises the parameters: the stochastic code-book index, the stochastic codebook gain index, the LPC index, the pitchperiod index, and the adaptive codebook gain index. These are then trans-mitted to the receiver side.

A standard excitation codebook is nonoverlapping and consists of L, N-dimensional codevectors giving an L × N memory matrix. Therefore, thenumber of memory words required is LN (Figure 9.9a). Most of the elementsof an overlapping codebook (of two consecutive codewords) are common.Therefore, for an overlapping codebook with a shift value of S, the amount ofmemory required is S(L − 1) + N (Figure 9.9b). Also, it is possible to use recur-sive convolution methods, which leads to reduced computational complexity.

The FS1016 speech coder uses an overlapping codebook with ternary-valued samples [1,2]. This reduces computation because it allows the use

Page 228: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

212 Principles of Speech Coding

LPC decodingand interpolation

Adaptivecodebook

Interpolation

Formant synthesisfilter

Postfilter

Pitchperiodindex

Gaindecoder

LPCindex

Stochasticcodebook gain

index

Stochasticcodebook

index

Syntheticspeechstream

Stochasticcodebook

Gaindecoder

Pitchperiod

decoder

Adaptivecodebook gain

index

CELP bit stream

FIGURE 9.8 Block diagram of the FS1016 CELP decoder.

of the recursive convolution methods [2,3]. The FS1016 coder uses 9 bits forindexing the stochastic codebook, which means it uses L = 29 = 512 codevec-tors. An overlap of S = 2 is used. The dimension of the codebook is N = 60.Therefore the memory required is 1082 locations instead of 30,720 locationsfor a nonoverlapping codebook.

At the decoder, the CELP bit stream is divided into the stochastic codebookindex, the stochastic codebook gain index, the LPC index, the pitch periodindex, and the adaptive codebook gain index. Here also, there are two code-books: the stochastic codebook and the adaptive codebook. The stochasticcodebook gain index is decoded and used to multiply the stochastic code-book output. The LPC index is decoded and interpolated. The pitch periodindex is also decoded and used for the adaptive codebook.

The perceptual weighting filter reduces noise components in speech spec-tral valleys and increases noise components at speech spectral peaks. Apostfilter is used to further enhance the speech quality. This postfilter increasesthe subjective quality of speech by lowering the components in the speechspectral valleys. It consists of a HPF, which is a first-order filter like thepre-emphasis filter. The resulting overall postfilter is given by

H(z) =(

1 − μz−1) 1 + ∑M

i=1 aiβiz−i

1 + ∑Mi=1 aiαiz−i

,

Page 229: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 213

S

N

L

Codevector 0

Codevector 1

Codevector 2

Codevector L – 3

Codevector L – 2

Codevector L – 1

S(L – 1) + N

Codevector 0

Codevector 1

Codevector L – 2

Codevector L – 1

(a)

(b)

FIGURE 9.9 (a) Nonoverlapping and (b) overlapping codebooks.

where the values of α and β are determined based on subjective listening testresults for each speech coding standard. The other parameters ai are the LPCcoefficients, and μ is the pre-emphasis filter coefficient.

The CELP 1016 standard tries to improve the long-term predictor (pitchsynthesis filter) by the closed-loop AbS method to get better coding efficiency.

Page 230: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

214 Principles of Speech Coding

The parameters of the LTP are

i. Pitch periodii. Long-term gain

We could perform long-term LPAon the STP error. The energy of the overallprediction error is minimized. The resulting parameters reflect the statisti-cal property of the particular signal subframe. This is not the goal of CELP.The goal of CELP is the minimization of the perceptually weighted differ-ence between the input speech and the synthetic speech. Therefore, we needto directly minimize this weighted difference for each subframe. We needto obtain the optimal parameters: pitch period, long-term gain, excitation(stochastic) vector, and excitation (stochastic) gain.

This is done in two steps:

a. Assume zero-input gain and then compute the predictor parametersto minimize the error.

b. The LTP is held constant and the optimal excitation plus gain isdetermined.

The resulting method requires a large number of computations, whichmakes it impractical [1,2,4]. This happens especially when the pitch period isless than the length of the subframe. The efforts to reduce the computationalcomplexity lead to the concept of the “adaptive” codebook. This is done bystill minimizing the same weighted difference through the closed-loop AbSmethod. It is achieved by using a periodic extension of the history of pitchsynthesis filter output for the case when the pitch period is less than the lengthof the subframe. The limitation of this method is that the pitch pulses in a sub-frame have the same amplitude from one pulse to another for the case whenthe pitch period is less than the length of the subframe. This has little or noeffect on the subjective quality of the speech.

The optimal pitch period is determined from the codebook with overlap-ping codevectors. The codebook is changed from subframe to subframe andis therefore called “adaptive.”

The bit allocation of the FS 1016 CELP coder is shown in Table 9.1. Fromthis table, we see that we require 144 bits/30 ms frame. This means the ratefor the FS1016 is 4.8 kbps. This is twice the 2.4 kbps rate for the FS1015coder. The FS1016 standard specifies actual stochastic codebook contentsof the 1082 locations [2]. It is derived from zero-mean unit-variance Gaus-sian signal that is center-clipped at 1.2 and quantized into the ternary levels{−1, 0, +1}.

The FS1016 searches the adaptive codebook range between the indicesTmin = 20 and Tmax = 147. This requires that we allocate 7 bits to thisindex. It also introduces a fractional delay in the search of the adaptivecodebook [5]. For speech signals sampled at 8000 Hz, integer pitch periods

Page 231: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 215

TABLE 9.1

Bit Allocation for the FS1016 CELP Coder

Number per Total Bits

Parameter Frame Resolution per Frame

LPC 10 3,4,4,4,4,3,3,3,3,3 34Pitch period (adaptive-codebook

index)4 8,6,8,6 28

Adaptive-codebook gain 4 5 20Stochastic-codebook index 4 9 36Stochastic-codebook gain 4 5 20Synchronization 1 1 1Error correction 4 1 4Future expansion 1 1 1Total 144

have a resolution of 18000 s = 0.125 ms. In order to improve the resolution of

the pitch period, fractional pitch periods are additionally estimated usinginterpolation methods [5].

The problem of pitch multiplication is also addressed by the addition of frac-tional part to the integer pitch periods. Pitch multiplication is when multiplepitch periods within the range of values of our search range occur if the pitchperiod is small. In this case, it is possible that the first peak in the autocorrela-tion is missed and a multiple of the pitch period is chosen, thereby generatinga lower-pitched voice signal. It has been shown that using fractional delaywe can improve our average prediction gain by about 1.5–2.5 dB [6].

FS1016 uses 256 pitch periods in the interval {20, 147}. This requires 8 bitstable look-up for both the integer and fractional parts of the pitch period. Theresolutions used are as follows:

i. 13 for

{20, 25 2

3

}and for

{34, 79 2

3

}ii. 1

4 for{

26, 33 34

}iii. 1 for {80, 147}

Note that the smallest fractions are used for lowest pitch periods (e.g., forfemale speakers). Note that 8 bits is used for the first and third subframes,while the second and fourth subframes use only 6 bits for a relative shiftwith respect to the pitch period of prior subframe to reduce computationalrequirements and improve efficiency for the codebook search. This is possiblebecause pitch periods of adjacent subframes are not significantly different.

The gain of the adaptive codebook is encoded with 5 bits. The automaticgain control for the postfilter of the FS1016 is shown in Figure 9.10. There aretwo gain estimators that estimate the power of the noisy input speech s(n)

Page 232: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

216 Principles of Speech Coding

PostfilterH3(z)

Computegain

Enhancedspeech

y(n)

Gain estimatefor x(n)

Gain estimatefor s(n)

x(n)

Noisyinput

speechs(n)

ρ(n)

FIGURE 9.10 Automatic gain control incorporated into the postfilter.

and the filtered signal x(n) using first-order smoothing filters:

σ2s (n) = ζσ2

s (n − 1) + (1 − ζ)s2(n) and

σ2x(n) = ζσ2

x(n − 1) + (1 − ζ)x2(n).

A choice of ζ = 0.99 gives good enough adaptation for speech sampled at8 kHz without risk of distortion. The gain ρ(n) is computed after each sampleusing the formula:

ρ(n) =√

σ2s (n)

σ2x(n)

.

The enhanced speech output is then given by

y(n) = ρ(n)x(n).

The square root operation can be eliminated by using the following set ofequations instead:

σs(n) = ζσs(n − 1) + (1 − ζ)|s(n)|,σx(n) = ζσx(n − 1) + (1 − ζ)|x(n)|, and

ρ(n) = σs(n)

σx(n).

9.4 Case Study Example: ITU-T G.729/729A Speech Coder

The ITU G.729A is a popular robust speech codec designed for low-bit-rateapplications [3,7]. It uses many of the advanced speech coding techniques

Page 233: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 217

discussed in this and earlier chapters. Now we present this standard in detailas a case study of a modern speech coder that is widely deployed.

The G.729 speech coder is an 8 kbps conjugate-structure algebraic-code-excited linear prediction (CS-ACELP) speech compression algorithmapproved by ITU-T. The G.729 offers high-quality, robust speech performanceat the price of complexity. It requires 10 ms input frames and generates framesof 80 bits in length. With the G.729 coder processing signals in 10 ms framesand a 5 ms lookahead, the total algorithmic delay is 15 ms.

Each 80-bit frame produced contains linear prediction coefficients, excita-tion (stochastic) codebook indices, and gain parameters that are used by thedecoder in order to reproduce speech. The input to this algorithm is a 16-bitlinear PCM sample that is converted from the original speech. The outputfrom this algorithm is a 8 kbps compressed data stream.

Another ITU-T standard, the G.729/A (Annex A), specifies a reduced com-plexity G.729 speech coder with several simplifications, involving codebooksearch routines and the decoder postfilter, among others. These modifica-tions may result in slightly lower voice quality in certain circumstances. Bothspeech coders can operate interchangeably at the same rate.

G.729 Annex B defines a G.729/B speech coder that uses discontinuoustransmission (DTX), voice activity detection (VAD), and comfort noise gen-eration (CNG) to reduce bandwidth usage by preventing the transmissionof any nonvoice frames during periods of silence. Other related standardsare G.729 Annex D, which uses 6.4 kbps CS-ACELP compression, and G.729Annex E, which uses 11.8 kbps CS-ACELP compression.

G.729 is designed and optimized to work in conjunction with Recom-mendation V.70. Recommendation V.70 mandates the use of Annex A/G.729(G.729A) speech coding methods. However, when necessary the full versionof Recommendation G.729 can also be used to slightly improve the quality ofthe speech.

In Figure 9.11, we show the general block diagram of a CS-ACELP coder.It uses an excitation generator to produce excitation for a synthesis filter thatconsists of both long-term and short-term synthesis filters. In Figure 9.12, weshow a fixed and an adaptive codebook for generation of excitation to thesynthesis filter.

The postprocessing block in Figure 9.12 is shown in more detail inFigure 9.13. It also uses the long-term postfilter, the short-term postfilter, andan automatic gain controller to output an enhanced speech signal from a noisyspeech input. More details will be discussed later.

9.4.1 The ITU G.729/G.729A Speech Encoder

The initial goal of the ITU Study group 15 (SG 15) was to standardize a speechcoder algorithm at 8 kbps capable of the same or better speech quality thanthe ADPCM ITU standard G.726, which runs at 32 kbps for many operating

Page 234: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

218 Principles of Speech Coding

Excitationcodebook

Long-termsynthesis

filter

Short-termsynthesis

filterPostfilter

Outputspeech

Parameter decoding

Received bitstream

G.729(07)_F01

FIGURE 9.11 Block diagram of the conceptual CELP synthesis model. (From ITU-T Recommen-dation G.729, Coding of Speech at 8 kbps using Conjugate-Structure Algebraic Code-ExcitedLinear Prediction (CS-ACELP), January 2007. http://www.itu.int/rec/T-REC-G.729-200701-I.With permission.)

conditions. These conditions include noisy speech, multiple encodings, audio(and other nonspeech) inputs, transmission errors such as random or burstyerrors, and even lost frames. This will mean a reduction of 4 in bandwidth forthe same quality of speech.

+1

1–A(z)Post-

processing

Adaptivecodebook

Feedback for pitchanalysis

Fixedcodebook

Synthesisfilter

Adaptivecodeboook

gain

Fixedcodeboook

gain

FIGURE 9.12 General block diagram of CS-ACELP coder with adaptive and fixed codebooks.

Page 235: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 219

1 + γz–p

1 + λz–pG11 – A(z/β)1 – A(z/α) 1–μz–1

Gainestimator

Computescalingfactor

Output gainscaling unit

Gainestimator

Pitch predictorinformation

LPC predictorinformation

Long-termpostfilter

Short-termpostfilter

Noisyspeech

Enhancedspeech

Automatic gain control (AGC)

FIGURE 9.13 General structure of the postprocessing filter.

The original applications include personal communication systems,satellite communication systems, and secure networks. The final standard,ratified in November 1995, was based on a combination of desired featuresfrom two competing proposals. It was based on the CELP (discussed inChapter 7) and is called the CS-ACELP coder.

The input speech signal is assumed to be sampled at 8000 Hz at16 bits/sample as in linear PCM. This results in a 128-kbps input. However,the coder operates on a frame of 10 ms (i.e., 80 samples). It uses a lookaheadof 5 ms (i.e., 40 samples) for LPA. This results in an overall low delay of 15 ms,which is acceptable and not very annoying.

An HPF with a cutoff of 140 Hz is used to preprocess the 16-bit PCMsamples. Then a 10th-order LPA is performed, and the LPC parame-ters are converted into line spectral frequency (LSF) and quantized using18 bits/frame.

Recall that speech is a nonstationary signal. Therefore, subframes are usedin the processing to enable better tracking of pitch and gain parameters.Another reason is to reduce the complexity of the coder, especially in theVQ codebook search algorithm. The 10 ms input frame is subdivided intotwo subframes of 5 ms each.

Excitation (stochastic) codebook and adaptive-codebook search (the mostcomputational intensive of all arithmetic operations) of the ITU-T G.729 isdone using combinations of open and closed-loop searches. This reduces theoverall computational cost. The principles of the CS-ACELP encoder and

Page 236: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

220 Principles of Speech Coding

Inputspeech Preprocessing

LP analysisquantizationinterpolation

Fixedcodebook

Adaptivecodebook

Pitchanalysis

LPC information

Perceptualweighting

Fixed CBsearch

Gainquantization

Parameterencoding

Transmittedbitstream

LPC information G.729(07)_F02

LPC information

Synthesisfilter

++GC

GP

FIGURE 9.14 Principles of the CS-ACELP encoder. (From ITU-T Recommendation G.729, Cod-ing of Speech at 8 kbps using Conjugate-Structure Algebraic Code-Excited Linear Prediction(CS-ACELP), January 2007. http://www.itu.int/rec/T-REC-G.729-200701-I. With permission.)

decoder are shown in Figures 9.14 and 9.15, respectively. Now we describethe major blocks of the G.729 and 729/A standard encoder and decoder.

9.4.1.1 The ITU G.729 Encoder Details

The detailed signal flow for the CS-ACELP encoder is shown in Figure 9.16. Itis evident which functions are performed by frame and which functions are

Page 237: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 221

Adaptivecodebook

Fixedcodebook

Short-termfilter

+ Postprocessing

G.729(07)_F03

GC

GP

FIGURE 9.15 Principles of the CS-ACELP decoder. (From ITU-T Recommendation G.729, Cod-ing of Speech at 8 kbps using Conjugate-Structure Algebraic Code-Excited Linear Prediction(CS-ACELP), January 2007. http://www.itu.int/rec/T-REC-G.729-200701-I. With permission.)

performed per subframe. The sublabels on each box correspond to the sub-section of Section 9.4.1.1 (which described the encoder) of the ITU documentfor the G.729 and 729/A standards.

The allocation of the 80 bits generated by the G.729 encoder for each 10 msframe is distributed as shown in Table 9.2. For some of the parameters in thistable, the number of bits for a subframe of 5 ms is used instead.

The excitation is represented by an adaptive-codebook and a fixed(stochastic)-codebook contribution. The adaptive- and fixed-codebookparameters are transmitted per subframe. The adaptive-codebook compo-nent represents the periodicity in the excitation signal using a fractionalpitch lag with 1/3 sample resolution. The adaptive-codebook is searchedusing a two-step procedure. An open-loop pitch lag is estimated once perframe based on the perceptually weighted speech signal. The adaptive-codebook index and gain are found by a closed-loop search around theopen-loop pitch lag. The signal to be matched, referred to as the target sig-nal, is computed by filtering the LP residual through the weighted synthesisfilter.

The adaptive-codebook index is encoded with 8 bits in the first subframeand differentially encoded with 5 bits in the second subframe. The targetsignal is updated by removing the adaptive-codebook contribution, and thisnew target is used in the fixed-codebook search. The fixed codebook is a 17-bitalgebraic codebook [3,4,7]. The gains of the adaptive and fixed codebook arevector quantized with 7 bits using a conjugate-structure codebook [3] (withmoving-average (MA) prediction applied to the fixed-codebook gain). Thebit allocation for a 10-ms frame is shown in Table 9.2.

The function of the decoder is the decoding of the transmitted parame-ters (LP parameters, adaptive-codebook vector, fixed-codebook vector, andgains) and performing synthesis to obtain the reconstructed speech, fol-lowed by a postprocessing stage, consisting of an adaptive postfilter and afixed HPF.

Page 238: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

222Principles

ofSpeechC

odingPer frame

LP analysisPreprocessing

Inputsamples

High passand down scale

WindowingautocorrelationsLevinson Durbin

Perceptualadaptation

Computeweightedspeech

Compute pitchcontribution

3.8.13.8.1Preselect a

potential pulseamplitude at

all 40 locationsCompute Φ,

combine withselected amplitudesand store efficiently

3.8.1Pitchdelay

h(n)

3.8

gc

3.9.1MAcode-gainpredictionp

g

GainsGA1, GB1

GA2, GB2

Computecodeword P(z)

3.8.2

c(n)

3.10Compute

excitation andupdate filter

state

v(n)

G.729(07)_F04

3.9

Conjugatestructure

VQ

Compute impulseresponse

3.5

A(z)Â(z)

LSPindex

3.4

γ

Pitch prefilterP(z)

3.1

3.2.1;2 3.3 3.3 3.7.1 v(n)

P0, P1P2

3.7

Find closed-looppitch delay

and gain

Find open-looppitch delayA(z) → LSP

3.2.3

LSPquantization

Interpolation andLSP → A(z)

Interpolation andLSP → Â(z)

Â(z) 3.2.5;6

L0, L1L2, L3

3.2.4

x(n)

A(z)A(z) Â(z) 3.6

Open-looppitch search

Closed-looppitch search

(adaptive codebook)

Compute pitchtarget signal

Compute targetsignal in

code domain

x(n)+

d(n)

S1, C1(dT Ck)2

CkT Φ Ck

measure

S2, C2Codeindex

Per subframe

Algebraic codebook search(fixed codebook)

Memoryupdate

Find code word Ckwhich maximizes the

A(z)

FIGURE 9.16 Details of the signal flow at the CS-ACELP encoder. (From ITU-T Recommendation G.729, Coding of Speech at 8 kbps using Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP), January 2007. http://www.itu.int/rec/T-REC-G.729-200701-I. With permission.)

Page 239: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 223

TABLE 9.2

Bit Allocation of the 8 kbps G.729 CS-ACELP Coder (10 ms Frame)

Total per

Parameter Codeword Subframe 1 Subframe 2 Frame

LSPs L0, L1, L2, L3 18Adaptive-codebook delay P1, P2 8 5 13Pitch-delay parity P0 1 1Fixed-codebook index C1, C2 13 13 26Fixed-codebook sign S1, S2 4 4 8Codebook gains (stage 1) GA1, GA2 3 3 6Codebook gains (stage 2) GB1, GB2 4 4 8Total number of bits 80

9.4.1.1.1 Preprocessing

The input signal s(n) to the encoder is assumed to be 16-bit PCM signal.The preprocessing consists of two functions: (i) signal scaling and (ii) high-pass filtering. The PCM signal is scaled by dividing it by a factor of 2in order to reduce the possibility of overflow in fixed-point implementa-tion. The HPF is a second-order pole-zero filter with a cutoff frequency of140 Hz. The combination of these two functions results in the preprocessingfilter,

Hh1(z) = 0.46363718 − 0.92724705z−1 + 0.46363718z−2

1 − 1.9059465z−1 + 0.9114024z−2 .

9.4.1.1.2 LPA and Quantization

The LP synthesis filter is

1

A(z)= 1

1 + ∑10i=1 aiz−i

,

where ai, i = 1, 2, 3, . . . , 10, are the quantized LP coefficients.LPA is performed once per speech frame using the autocorrelation method

with a 30-ms asymmetric window. Every 80 samples (10 ms), the auto-correlation coefficients of windowed speech are computed and convertedto the LP coefficients using the LD algorithm (see Chapter 6). Then theLP coefficients are transformed to the LSP domain for quantization andinterpolation purposes. The interpolated quantized and unquantized filtersare converted back to the LPF coefficients (to construct the synthesis andweighting filters for each subframe). We now give more details of theseoperations.

9.4.1.1.2.1 Windowing and Autocorrelation Computation The windowingused in the G.729 is an asymmetric window and consists of two parts: a

Page 240: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

224 Principles of Speech Coding

half-hamming window and a quarter of a cosine function cycle window.

wLP(n) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0.54 − 0.46 cos(

2πn399

), n = 0, 1, 2, . . . , 199,

cos(

2π(n − 200)

159

), n = 200, 201, . . . , 239.

Each LP window is 30 ms (240 samples) long. Each subframe is 5 ms (40 sam-ples long). Each speech frame is 10 ms long and is divided into two subframes:Sf0, Sf1 (Figure 9.17). The LP window is applied to 80 samples from the currentframe, 120 samples from the previous frame and 40 samples from the futureframe (for a total of 240 samples or 30 ms). Using a 30-ms window enablesus to give the LPF a smoother filter, thereby giving better quality speech. Theautocorrelation is computed on the windowed speech by using the formula,

r(k) =239∑n=k

wLP(n)s(n)wLP(n − k)s(n − k), k = 0, 1, 2, . . . , 10.

Note that r(0) is lower-bounded by 1.0. A 60-Hz bandwidth expansion isapplied to the autocorrelation coefficients by the factor,

wlag(k) = exp

[−1

2

(2πf0k

fs

)2]

, k = 1, 2, . . . , 10,

where f0 = 60 Hz is the bandwidth expansion and fs = 8000 Hz is the sam-pling frequency. This reduces the possibility of ill-conditioning in theLevinson algorithm and also reduces the underestimation of the formantbandwidths, which could create undesirably sharp resonances.

This value is then multiplied by a white-noise correction factor of 1.0001,which is equivalent to adding a noise floor at −40 dB.

r′(0) = 1.0001r(0),

r′(k) = wlag(k)r(k), k = 1, 2, . . . , 10.

Current LP Window

Next LP Window

Previous LP Window

Sf0 Sf3 Sf0 Sf3 Sf0 Sf3 Sf0 Sf3 Sf0 Sf3 Sf0 Sf3 Sf0 Sf3 Sf0 Sf3

FIGURE 9.17 Positions of frame and LPA windows for the G.729 coder.

Page 241: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 225

These modified autocorrelation functions are used to obtain the LPFcoefficients, ai, i = 1, 2, 3, . . . , 10, as the quantized LP coefficients.

9.4.1.1.2.2 LD Algorithm The modified autocorrelation coefficients givenabove are used to compute the LP coefficients by solving the set of equations,

10∑i=1

air′(|i − k|) = −r′(k), k = 1, 2, . . . , 10,

using the LD algorithm (see Chapter 6). This leads to the following recursions:E[0] = r′(0)

for i = 1 to 10a[i−1]

0 = 1

ki = −[∑i−1

j=0 a[i−1]j r′(i − j)

]E[i−1]

a[i]i = ki

for j = 1 to i − 1

a[i]j = a[i−1]

j + kia[i−1]i−j

endE[i] = (1 − k2

i )E[i−1]

end

The final solution is a[10]j , j = 0, 1, 2, . . . , 10, where a0 = 1.0.

9.4.1.1.2.3 LPC to LSP Coefficients The LPC coefficients of the G.729 areconverted into the LSP filter coefficients ai, i = 0, 1, 2, 3, . . . , 10, for quanti-zation and interpolation purposes. The advantages of LSP coefficient repre-sentation are the following. (i) The LSP represents the LPC coefficients in thefrequency domain so that scalar quantization can provide optimal results.(ii) The LSP makes it easier to check whether the filter is stable. (iii) TheLSP coefficients are bounded between 0 and π, which makes quantizationconvenient.

The LSP coefficients are the roots of the sum and difference polynomialsgiven by

F′1(z) = A(z) + z−(M+1)A(z−1),

F′2(z) = A(z) − z−(M+1)A(z−1),

where F′1(z) and F′

2(z) are symmetric and antisymmetric, respectively. Typi-cally M = 10 is used for a 10th-order A(z) filter. When A(z) is minimum phase,

Page 242: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

226 Principles of Speech Coding

the zeros of F′1(z) and F′

2(z) are on the unit circle and are interlaced. F′1(z) has

a root at z = −1(ω = π) and F′2(z) has a root at z = 1(ω = 0). We eliminate

these roots by defining two new polynomials:

F1(z) = F′1(z)

(1 + z−1)

F2(z) = F′2(z)

(1 − z−1)

each of which has five conjugate roots on the unit circle. Therefore, we canwrite

F1(z) =∏

i=1,3,...,9

(1 − 2qiz−1 + z−2),

F2(z) =∏

i=2,4,...,10

(1 − 2qiz−1 + z−2),

where qi = cos(ωi). The coefficients ωi are the LSF and they satisfy theordering 0 < ω1 < ω2 < · · · < ω10 < π. The coefficients qi are called the LSPcoefficients in the cosine domain. The coefficients of the polynomials F1(z)and F2(z) can be computed recursively.

For G.729A, the number of points at which the polynomials F1(z) and F2(z)are evaluated is reduced to 50 (instead of 60), and the sign change interval isdivided 2 times instead of 4 times for tracking the root of the polynomial.

9.4.1.1.2.4 Quantization of LSP Coefficients The LSP coefficients of the G.729,qi, are then converted to the normalized frequency domain using the LSFrepresentation, ωi ∈ [0, π], by using the formula,

ωi = arc cos(qi), i = 1, 2, . . . , 10.

We then predict the LSF coefficients of the current frame by using theswitched fourth-order MA predictor whose coefficients are pi,k , k = 1, 2, . . . , 4.The difference between the computed and predicted coefficients is quan-tized using a two-stage vector quantizer (as described in Chapter 8). Thefirst stage is a 10-dimensional VQ using codebook L1 with 128 entries (using7 bits). The second stage is a 10-bit, which is implemented as a Split VQusing two five-dimensional codebooks L2 and L3 containing 32 entries (using5 bits) each.

Therefore, each coefficient is obtained from the two codebooks by using theformula,

li ={

L1i(L1) + L2i(L2), i = 1, 2, . . . , 5,

L1i(L1) + L3i−5(L3), i = 6, 7, . . . , 10,

Page 243: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 227

where L1, L2, and L3 are the indices of the respective codebooks. The coef-ficients li are arranged in order to give a minimum distance of J using thefollowing algorithm:

for i = 2 − 10if (li−1 > li − J)

li−1 = (li + li−1 − J)2

li = (li + li−1 + J)2

endend

Using the two different values of J = 0.0012 and J = 0.0006, the rearrange-ment is done twice. Then the quantized LSF coefficients ω

(m)i for current frame

m are obtained from the weighted sum of previous quantizer outputs l(m−k)i

and the current quantizer output l(m)i using the formula,

ω(m)i =

(1 −

4∑k=1

pi,k

)l(m)i +

4∑k=1

pi,k l(m−k)i , i = 1, 2, . . . , 10.

The choice of which MA predictor to use is determined by bit L0. At thealgorithm’s start-up, the values of l(k)

i are initialized to li = iπ/11 for all k < 0.The values of the quantized LSF coefficients ωi are checked for stability as

follows:

i. Order the quantized LSF coefficients ωi in increasing value order.ii. If ωi < 0.005, then ωi = 0.005.

iii. If ωi+1 − ωi − 0.0391, then ωi+1 = ωi + 0.0391, i = 1, 2, . . . , 9.iv. If ω10 > 3.135, then ω10 = 3.135.

The LSF parameters are then encoded. Details of the method are specifiedin Reference [7].

9.4.1.1.2.5 Interpolation of LSP Coefficients The quantized (and unquantized)LP coefficients are used for the second subframe. For the first subframe, quan-tized (and unquantized) LP coefficients are obtained by linear interpolationof the corresponding parameters in the adjacent subframes. The interpolationis done on the LSP coefficients in the cosine domain.

Page 244: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

228 Principles of Speech Coding

For each of the two subframes, we use the algorithm,

Subframe 1: q(1)i = 0.5q(previous)

i + 0.5q(current)i , i = 1, 2, . . . , 10,

Subframe 2: q(2)i = q(current)

i , i = 1, 2, . . . , 10,

where q(current)i are the LSP coefficients for the current 10 ms frame and

q(previous)i are the LPS coefficients for the previous 10 ms frame.For G.729A, the procedure used is similar except that only the quantized

LP coefficients are interpolated since the weighting filter uses the quantizedparameters for simplicity.

9.4.1.1.2.6 LSP to LPC Coefficients After the LSP coefficients are quantizedand interpolated, they are converted back to LPC coefficients using recursivealgorithms. Details of the method are specified in Reference [7].

9.4.1.1.3 Perceptual Weighting

For the G.729 standard, the perceptual weighting filter is based on theunquantized LPF coefficients αi and is given by

W(z) = A(z/γ1)

A(z/γ2)= 1 + ∑10

i=1 γi1aiz−i

1 + ∑10i=1 γi

2aiz−i.

The values of γ1 and γ2 are adapted to make the spectral shape of theweighting filter acceptable. This is done once per 10 ms frame. An interpola-tion procedure is done once per subframe of 5 ms to smooth this adaptationprocess.

Details of the method to determine the values of γ1 and γ2 are specified inReference [7].

However, for the G.729A, the perceptual weighting filter is based on thequantized LPF coefficients ai and is given by

W(z) = A(z)

A(z/γ)

with γ = 0.75. This simplifies the combination of synthesis and weightingfilters to W(z)/A(z) = 1/A(z/γ), which reduces the number of filtering oper-ations while computing the impulse response and the target signal and alsoupdating the filter states. The value of γ is fixed at 0.75 and the procedure forthe adaptation of the factors of the perceptual weighting filter described forG.729 is not used in this reduced complexity version.

The weighted speech signal is not used for computing the target signalsince an alternative approach is used (described later). However, the weighted

Page 245: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 229

speech signal (low-pass filtered) is used to compute the open-loop pitch esti-mate. The low-pass filtered weighted speech is found by filtering the speechsignal s(n) through the filter A(z)/[A(z/γ)(1 − 0.7z−1)]. First the coefficientsof the filter A′(z) = A(z/γ)(1 − 0.7z−1) are computed and then the low-passfiltered weighted speech in a subframe (40 samples) is computed by

Sw(n) = r(n) −10∑

i=1

a′isw(n − i), n = 0, . . . , 39,

where r(n) is the LP residual signal given by

r(n) = s(n) +10∑

i=1

ais(n − i), n = 0, . . . , 39.

The signal sw(n) is used to find an estimation of the pitch delay in the speechframe.

9.4.1.1.4 Open-Loop Pitch-Lag (Delay) Estimation

Recall from Chapter 7 that we stated that determining the accurate pitchperiod of the speech signal is a difficult task. In G.729/729A, this task isdivided up into two stages to reduce computational complexity.

First an open-loop search is performed over the whole range of possible valuesof pitch period to obtain a coarse estimate. The estimate is then refined usinga closed loop (AbS) technique. Fractional pitch-delay estimates are generallyrequired to synthesize good-quality speech.

To reduce the complexity of the search for the best adaptive-codebook delay,the search range is limited around a candidate delay Top, obtained from anopen-loop pitch analysis. The method of coarse pitch estimation basicallyconsists of calculating the autocorrelation function of the weighted speechsignal sw(n) and choosing the delay L that maximizes it.

One problem with the coarse pitch estimation approach is that multiplepitch periods within the range of values of L might occur if the pitch period issmall. In this case, there is a possibility that the first peak in the autocorrelationis missed and a multiple of the pitch is chosen, thereby generating a lower-pitched voice signal.

This open-loop pitch analysis is done once per frame (10 ms). The open-looppitch estimation uses the weighted speech signal sw(n) and is done as followsfor the G.729/A:

In the first step, three maxima of the correlation,

R(k) =39∑

n=0

sw(2n)sw(2n − k),

Page 246: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

230 Principles of Speech Coding

are found in the following three ranges:

i = 1: 20, . . . , 39

i = 2: 40, . . . , 79

i = 3: 80, . . . , 143

The retained maxima R(ti), i = 1, . . . , 3, are normalized using

R′(ti) = R(ti)√∑39n=0 s2

w(2n − ti)

, i = 1, . . . , 3.

The winner among the three normalized correlations is selected by favoringthe delays with the values in the lower range. This is done by augmentingthe normalized correlations corresponding to the longer delay ranges if theirdelays are submultiples of the delays in the higher delay range as follows:

Top = t1

R′(Top) = R′(t1)

if R′(t2) ≥ 0.85R′(Top)

R′(Top) = R′(t2)

(Top) = t2

endif R′(t3) ≥ 0.85R′(Top)

R′(Top) = R′(t3)

Top = t3

end

To avoid the pitch multiples problem, the peak of the autocorrelation isestimated in several lag ranges (three ranges in ITU-T G729 and G729A stan-dards), and smaller pitch periods are favored in the selection process withproper weighting of the normalized autocorrelation values.

Note that in computing the correlations, only the even samples are used.Further, in the third delay region [80,143] only the correlations at the evendelays are computed in the first pass and then the delays at ±l of the selectedeven delay are tested.

9.4.1.1.5 Computation of the Impulse Response

The impulse response h(n) of the weighted synthesis filter W(z)/A(z) isneeded for the search of adaptive and fixed codebooks. The impulse responseh(n) is computed for each subframe by filtering a signal consisting of a unitsample extended by zeros through the filter 1/A(z/γ) for the G.729A stan-dard. For the G.729, h(n) is computed for each subframe by filtering a signal

Page 247: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 231

consisting of the coefficients of the filter 1/A(z/γ1) extended by zeros throughthe two filters 1/A(z) and 1/A(z/γ2).

9.4.1.1.6 Computation of the Target Signal

For G.729A, the target signal x(n) for the adaptive-codebook search is com-puted by filtering the LP residual signal r(n) through the weighted synthesisfilter 1/A(z/γ). After determining the excitation for the subframe, the initialstates of this filter are updated.

The residual signal r(n), which is needed for finding the target vector, is alsoused in the adaptive-codebook search to extend the past excitation buffer. Thecomputation of the LP residual is given in

r(n) = s(n) +10∑

i=1

ais(n − i), n = 0, . . . , 39.

9.4.1.1.7 Adaptive-Codebook Structure

As explained in Chapter 8 on VQ, the size of the codebook can grow and leadto huge amounts of memory requirements for the search. We first assumethat the fixed codebook output is zero. The pitch delay L is then selected asthe delay (in the neighborhood of the open-loop estimate) that minimizesthe mean square of the perceptually weighted error. The optimum pitch gainvalue is then calculated accordingly.

9.4.1.1.8 Adaptive-Codebook Search

The adaptive-codebook structure is the same for both G.729 and G.729A. Inthe first subframe, a fractional pitch delay T1 is used with a resolution of 1/3

in the range[19 1

3 , 84 23

]and integers only in the range [85,143]. For the second

subframe, a delay T2 with a resolution of 1/3 is always used in the range[int(T1) − 5 2

3 , int(T1) + 4 23 ], where int(T1) is the integer part of the fractional

pitch delay T1 of the first subframe. This range is adapted for the cases whereT1 straddles the boundaries of the delay range. The search boundaries tminand tmax for both subframes are determined in the same way for both G.729and G.729A [4,7].

For each subframe, the optimal delay is found by using closed-loop analysisthat minimizes the weighted mean-squared error. In the first subframe, thedelay T1 is found by searching a small range (six samples) of delay valuesaround the open-loop delay Top. The search boundaries tmin and tmax aredefined by

tmin = Top − 3if tmin < 20 then tmin = 20tmax = tmin + 6

Page 248: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

232 Principles of Speech Coding

if tmax > 143 thentmax = 143tmin = tmax − 6

end

For the second subframe, the closed-loop pitch analysis is done around thepitch selected in the first subframe to find the optimal delay T2. The searchboundaries tmin − 2

3 and tmax + 23 are defined by

tmin = int(T1) − 5if tmin < 20 then tmin = 20tmax = tmin + 9if tmax > 143 then

tmax = 143tmin = tmax − 9

end

The closed-loop pitch search is usually performed by maximizing the term,

R(k) =∑39

n=0 x(n)yk(n)√∑39n=0 yk(n)yk(n)

,

where x(n) is the target signal and yk(n) is the past filtered excitation at delayk [past excitation convolved with h(n)]. In order to simplify the search inG.729A, the reduced complexity version, only the numerator in this equationis maximized. That is, the term,

RN(k) =39∑

n=0

x(n)yk(n) =39∑

n=0

xb(n)uk(n),

is maximized, where xb(n) is the backward filtered target signal (correlationbetween x(n) and the impulse response h(n)) and uk(n) is the past excitation atdelay k(u(n − k)). Note that the search range is limited around a preselectedvalue, which is the open-loop pitch Top for the first subframe and T1 for thesecond subframe. Note that in the search stage, the samples u(n), n = 0, . . . , 39,are not known, and they are needed for pitch delays less than 40. To simplifythe search, the LP residual is copied to u(n).

For the determination of T2 and T1 if the optimum integer delay is lessthan 85, the fractions around the optimum integer delay have to be tested.For G.729/A, the fractional pitch search is done by interpolating the past exci-tation at fractions − 1

3 , 0, and 13 , and selecting the fraction that maximizes

Page 249: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 233

the correlation in equation for RN(k). The interpolation of the past excitationis performed using the same FIR filter, b30, which is defined in References[4,7] for G.729. The interpolated past excitation at a given integer delay k andfraction t is given by

ukt(n) =9∑

i=0

u(n − k + i)b30(t + 3i) +9∑

i=0

u(n − k + 1 + i)b30(3 − t + 3i),

n = 0, . . . , 39, t = 0, 1, 2.

9.4.1.1.9 Algebraic (Fixed) Codebook Structure

In the ITU G.729 standard, an algebraic codebook is employed to realize com-putational savings and to completely eliminate storage of the codebook. Inthis scheme each of the codebook sequences is made of exactly four impulses,each of which can be weighted as +1 or −1. There are 40 samples per sub-frame. These 40 positions are separated into five tracks. If the samples of thesequence for a given subframe are numbered from 0 to 39, then the possibleimpulse positions in a given track are those whose position index mod 5 isequal to the track number. For example, track 0 consists of the positions num-bered {0, 5, 10, 15, 20, 25, 30, 35}, track 1 consists of the positions numbered {1,6, 11, 16, 21, 26, 31, 36}, track 2 consists of the positions numbered {2, 7, 12, 17,22, 27, 32, 37}, and so on. For each codebook sequence one impulse is placedsomewhere in each of the first three tracks. And the fourth impulse can beplaced in either the fourth or fifth track. So the codebook index is 17 bits long(3 bits for position and 1 sign bit for each of the first three impulses and 4 bitsfor position and 1 sign bit for the last impulse).

The fixed codebook is a 17-bit algebraic codebook. Each fixed codebookvector contains four nonzero pulses. The pulses can be one of the parameter(sign, positions, and bits) given in Table 9.3. The signs, positions, and bitsare encoded separately using the bit allocation in Table 9.1. See Table 9.4 fordetails of the structure of the fixed codebook in G.729.

9.4.1.1.10 Adaptive (Fixed) Codebook Search

Once the parameters of the STP and LTP have been determined, all thatremains is to select the best excitation sequence from the fixed codebook andto determine the gain associated with it. This is accomplished again using thetechnique of AbS.

There are some simplifications that can be made to reduce the compu-tational complexity of the codebook search. Since the contribution of theadaptive codebook is already determined, we can calculate it once and sub-tract it from the old target signal to determine a new target signal that we aretrying to match with our choice of the codevector.

This structure of the algebraic codebook can be exploited by performingan intelligent search that only checks a subset of the possible codes (the ones

Page 250: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

234 Principles of Speech Coding

TABLE 9.3

Description of Transmitted Parameters Indices

Symbol Description Bits

L0 Switched MA predictor of LSP quantizer 1L1 First-stage vector of quantizer 7L2 Second-stage lower vector of LSP quantizer 5L3 Second-stage higher vector of LSP quantizer 5P1 Pitch delay first subframe 8P0 Parity bit for pitch delay 1C1 Fixed codebook first subframe 13S1 Signs of fixed-codebook pulses first subframe 4GA1 Gain codebook (stage 1) first subframe 3GB1 Gain codebook (stage 2) first subframe 4P2 Pitch delay second subframe 5C2 Fixed codebook second subframe 13S2 Signs of fixed-codebook pulses second subframe 4GA2 Gain codebook (stage 1) second subframe 3GB2 Gain codebook (stage 2) second subframe 4

Note: The bit stream ordering is reflected by the order in the table. For eachparameter, the most significant bit (MSB) is transmitted first.

that are most promising) before choosing the code that will be used for thesubframe. In ITU-T G.729A, the reduced complexity version of ITU-T G.729,the codes are organized in a tree structure, where each level correspondsto a choice of the position of the impulse within its track. This tree is thensearched by choosing the best position of one pulse in isolation and then thebest position for the second pulse given the first and, finally, computing theMSE for all the leafs that are descendants of the chosen second-level node. Itthen repeats this procedure with the second best choice of the second-levelnode. Finally, it reorganizes the tree so that impulses are placed in differenttracks first, that is, the levels of the tree correspond to different track numbers,and then it repeats the whole procedure.

TABLE 9.4

Structure of the 17-bit Fixed Codebook in G.729

Pulse Sign/Amplitude Positions Bits

T0 s0: ±1 m0: 0.5, 10, 15, 20, 25, 30, 35 1 + 3T1 s1: ±1 m1: 1, 6, 11, 16, 21, 26, 31, 36 1 + 3T2 s2: ±1 m2: 2, 7, 12, 17, 22, 27, 32, 37 1 + 3T3 s3: ±1 m3: 3, 8, 13, 18, 23, 28, 33, 38, 4, 9, 1 + 4

14, 19, 24, 29, 34, 39

Page 251: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 235

While this procedure is suboptimal, in practice it only causes a small degra-dation in the performance of the synthesized voice and yields a significantreduction of the computational complexity. This fixed codebook enables a fastsearch because the codevector Ck contains only four nonzero pulses of unitamplitude and +1/−1. More details are given in References [4,7].

9.4.1.2 Quantization of the Gains

The adaptive codebook gain (pitch gain) and the fixed codebook gain arevector-quantized using 7 bits. The VQ is done using a two-stage conjugate-structured codebook. The first stage consists of a 3-bit two-dimensionalcodebook and the second stage consists of a 4-bit two-dimensional codebook.More details are given in References [4,7].

9.4.2 The ITU G.729/G.729A Speech Decoder

9.4.2.1 The ITU G.729 Decoder Details

The detailed signal flow for the CS-ACELP decoder is shown in Figure 9.18.It is evident which functions are performed by frame and which functionsare performed per subframe. The sublabels on each box correspond to thesubsection of this section (which described the decoder) of the ITU documentfor the G.729 and 729/A standards.

9.4.2.1.1 Parameter Decoding

The allocation of the 80 bits generated by the G.729 encoder for each 10 msframe is distributed as shown in Table 9.2. For some of the parameters in thistable, the number of bits for a subframe of 5 ms is used instead.

The decoder performs the following steps for each subframe:

i. Decoding of the adaptive (stochastic) codebook vectorii. Decoding of the fixed (excitation) codebook vector

iii. Decoding of the adaptive and fixed codebook gainsiv. Computation of the reconstructed speech

The received L0, L1, L2, and L3 indices of the LSP quantizer are used toreconstruct the quantized LSP coefficients using a similar procedure to thatdescribed in the encoder. Similar is the case for the interpolation of the LSPcoefficients. Then the interpolated LSP coefficients are converted into LPCfilter coefficients, which are used for synthesizing the reconstructed speechin each subframe.

The parity bit is computed and checked for any errors during transmis-sion. If no parity error occurred, the adaptive-codebook vector is decoded asfollows:

Page 252: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

236 Principles of Speech Coding

If no parity error occurred, the received adaptive codebook index P1is used to find the integer and fractional parts of the pitch delay T1.The integer part int(T1) and fractional part frac of T1 are computed asfollows:

if P1 < 197int(T1) = (P1 + 2)/3 + 19

frac = P1 − 3int(T1) + 58else

int(T1) = P1 − 112frac = 0

end

Per subframe

Per frame 4.1.1

L2, L3

L0, L1

S2, C2

S1, C1c(n)

sρ(n)ŝ(n)u(n)

4.1.5

4.1.3

GainsGA2, GB2

GA1, GB1

P2

P0, P1

Decodeadaptive

codevector

Decode thegains

MAcode-gainprediction

Decodefixed

codevectorP(z)

Constructexcitation

LP synthesisfilter

Post filter High-pass &upscaling

Decode LSP Interpolation LSF → Â(z)

v(n)

4.1.6 4.1.6 4.2 4.2.5

Codeindex

4.1.4

LSPindex

Pitchdelay

γ

ĝp

ĝc

FIGURE 9.18 Details of the signal flow at the CS-ACELP decoder. (From ITU-T Recom-mendation G.729/A, Coding of Speech at 8 kbps using Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP), Annex A: Reduced Complexity 8 kbps CS-ACELPSpeech Codec, January 2007. ITU-T Recommendation G.729, Coding of Speech at 8 kbpsusing Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP), January 2007.http://www.itu.int/rec/T-REC-G.729-200701-I. With permission.)

Page 253: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 237

The integer and fractional part of T2 are obtained from P2 and tmin, wheretmin is derived from T1 as follows:

tmin = int(T1) − 5if tmin < 20 then tmin = 20tmax = tmin + 9if tmax > 143 then

tmax = 143tmin = tmax − 9

end

Then T2 is decoded using

int(T2) = (P2 + 2)/3 − 1 + tmin,frac = P2 − 2 − 3((P2 + 2)/3 − 1).

Then the fixed codebook vector is decoded. The adaptive- and fixed-codebook gains are also decoded. The excitation u(n) is input to the LPsynthesis filter to yield the reconstructed speech given by

s(n) = u(n) −10∑

i=1

ai s(n − i), n = 0, 1, . . . , 39,

where ai(n) are the interpolated LPF coefficients for the current subframe.

9.4.2.1.2 Decoder Postprocessing

There are three main functions done in the postprocessing stage: adaptivepostfiltering, high-pass filtering and signal upscaling. The adaptive postfilteritself consists of a long-term postfilter, a short-term postfilter, a tilt compen-sation postfilter, and an adaptive gain procedure as shown in Figures 9.13and 9.18. The coefficients of the adaptive postfilter are updated every 5 ms.

9.4.2.2 Long-Term Postfilter

The long-term postfilter is given by

Hp(z) = 11 + γpg1

(1 + γpg1z−T),

where T is the pitch delay and the gain coefficient g1 is bounded by 1 andset to zero if the LTP gain is less than 3 dB. The factor γp = 0.5 controls theamount of long-term postfiltering. The long-term delay T and the gain g1 are

Page 254: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

238 Principles of Speech Coding

computed from the residual signal r(n), which is obtained by filtering thespeech signal r(n) through the numerator of the short-term postfilter A(z/γn).More details can be found in References [4,7].

9.4.2.3 Short-Term Postfilter

The short-term postfilter is given by

Hf(z) = 1gf

A(z/γn)

A(z/γd)= 1

gf

1 + ∑10i=1 γi

naiz−i

1 + ∑10i=1 γi

daiz−i,

where A(z) is the received quantized LP inverse filter and the factors set toγn = 0.55 and γd = 0.7 control the short-term postfiltering. LPA is not done atthe decoder. The gain term gf is calculated on the truncated impulse responsehf(n) of the filter A(z/γn)/A(z/γd) and is given by

gf =19∑

n=0

|hf(n)|.

Note that for G.729A, the gain term gf is eliminated.

9.4.2.4 High-Pass Filtering and Upscaling

The postprocessing also performs the following two functions: (i) signalupscaling and (ii) high-pass filtering. A second-order pole-zero HPF with acutoff frequency of 100 Hz is applied to the reconstructed postfiltered speechsignal. The filter output is upscaled by a factor of 2 to restore the input signallevel to the same signal levels at input to the preprocessing.

The combination of these two functions results in the following postpro-cessing filter:

Hh2(z) = 0.93980581 − 1.8795834z−1 + 0.93980581z−2

1 − 1.9330735z−1 + 0.93589199z−2 .

9.4.2.5 Tilt Compensation

The tilt compensation filter compensates for the tilt in the short-term postfilter.It is given by

Ht(z) = 1gt

(1 + γtk′1z−1),

Page 255: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 239

where γtk′1 is the tilt factor. k′

1 is the first reflection coefficient calculated fromthe truncated impulse response hf(n) such that

k′1 = − rh(1)

rh(0), rh(i) =

j=19−i∑j=0

hf( j)hf( j + i)

The gain gt = 1 − |γtk′1| compensates for the decreasing effect of gf in HE(z)

above such that the product Hf(z)Ht(z) has zero gain.Typically, two values are used for γt. When k′

1 is negative, γt = 0.9, andwhen k′

1 is positive (which corresponds to spectra with a lot of high-frequencyenergy), γt = 0.2 to avoid excessive low-pass filtering of the spectrum. For theG.729A, the gain gt is eliminated, and when k′

1 is negative, γt = 0.8, and whenk′

1 is positive, γt = 0.0.

9.4.2.6 Adaptive Gain Control

The automatic gain control of the G.729/G.729A is used to compensate forthe gain differences between the reconstructed speech signal s(n) and thepostfiltered signal sf ′(n)as shown in Figures 9.13 and 9.18. The gain scalingfactor G for the current subframe is computed by

G =∑n=39

n=0 |s(n)|∑n=39n=0 |sf ′(n)| .

The postfiltered signal sf ′(n) is given by

sf ′(n) = g(n)sf (n), n = 0, 1, . . . , 39,

where g(n) is updated for each sample by using

g(n) = 0.85g(n−1) + 0.15G, n = 0, 1, . . . , 39.

The initial value of g(−1) = 1.0 is used. For each new subframe, g(−1) isset equal to g(39) of the previous subframe. The initializations used andthe bit-exact descriptions of the G.729 and G.729A coders can be seen inReferences [4,7].

9.5 Summary

In this chapter, we presented AbS methods of speech coding. We used CELP,the most common method as a way to illustrate the typical AbS coder. We

Page 256: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

240 Principles of Speech Coding

briefly examined the FS 1016 as a first example of a standardized CELP speechcoder. Then we further exemplify this concept in another standardized speechcoder, the ITU-T G.729 CS-ACELP speech coder. This coder, though complex,is still a CELP coder with several enhancements. It uses a CS-ACELP structure.

Examples of other standardized ACELP speech coders are ITU-T G.723multipulse maximum likelihood quantization (MP-MLQ ACELP), TIA IS 641ACELP, ETSI GSM enhanced full rate (EFR) ACELP and ETSI adaptive mul-tirate (AMR) ACELP. The bit-exact fixed-point “C” code implementations ofG.729 and G.729A coders and their associated test sequences are availablefrom ITU [7].

EXERCISE PROBLEMS

9.1. Using the 10th-order LPC coefficients obtained from your speech captured inChapter 1, Problem 1.5, plot the magnitude spectra of the synthesis filter and themodified synthesis filter (m.f.s.f.)

Hf(z) = 1A(z/γ)

= 1

1 + ∑Mi=1 aiγ

iz−i

using different values of γ = 0.1, 0.4, 0.7 and 0.9.Hint: You may use any other 10th-order LPC coefficients as an example:

ai, i = 1, 2, . . . , 10

a = [−1.286, 1.138, −1.047, 0.691, −0.304, 0.373,

− 0.071, 0.012, 0.048, 0.064]

9.2. The postfilter for the FS 1016 CELP coder is given by

H(z) = (1 − μz−1)1 + ∑M

i=1 aiβiz−i

1 + ∑Mi=1 aiα

iz−i,

where the values of α and β are determined based on subjective listening testresults and parameters ai are the LPC coefficients, and μ is the pre-emphasis filtercoefficient.

Using the same 10th-order LPC coefficients as in Problem 9.1, plot the spectralmagnitude responses for different postfilters using the following parameters:

i. α = 0.8, β = 0.5, and μ = 0.5

ii. α = 0.5, β = 0.8, and μ = 0.8

Which postfilter is more desirable? Why?9.3. Compare the performances of an FS1016 speech coder with that of ITU-T

G.729/729A speech coders. What are the main differences in structure? What arethe main differences in algorithms? Which of these two coders will you chooseto use for a communication application and why?

Page 257: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Analysis-by-Synthesis Coding of Speech 241

References

1. Chu, W.C., Speech Coding Algorithms: Foundation and Evolution of StandardizedSpeech Coders, Wiley-Interscience, New York, 2003.

2. National Communications System, Details to Assist in Implementation of FederalStandard CELP 1016 Coders, Arlington, VA, 1992.

3. Salami, R., C. Laflamme, J.-P.Adoul,A. Kataoka, S. Hayashi, T. Moriya, C. Lamblin,et al., Design and description of CS-ACELP: A toll quality 8 kb/s speech coder,IEEE Transactions on Speech and Audio Processing, 6(2), 116–130, 1998.

4. ITU-T Recommendation G.729/A, Coding of Speech at 8 kbps usingConjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP),Annex A: Reduced Complexity 8 kbps CS-ACELP Speech Codec, January 2007.http://www.itu.int/rec/T-REC-G.729-200701-I.

5. Medan, Y., E. Yair, and D. Chazan, Super resolution pitch determination of speechsignals, IEEE Transactions on Signal Processing, 39(1), 40–48, 1991.

6. Kroon, P. and B.S. Atal, On improving the performance of pitch predictors inspeech coding systems, in Advances of Speech Coding, B.S. Atal, V. Cuperman, andA. Gersho (eds), pp. 321–327, Kluwer Academic Publishers, Boston, 1991.

7. ITU-T Recommendation G.729, Coding of Speech at 8 kbps using Conjugate-Structure Algebraic Code-Excited Linear Prediction (CS-ACELP), January 2007.http://www.itu.int/rec/T-REC-G.729-200701-I.

Bibliography

1. Tremain, T.E., The government standard linear predictive coding algorithm: LPC-10, Speech Technology, 1, 40–49, 1982.

2. Campbell, J.P. and T.E. Tremain, Voiced/unvoiced classification of speech withapplications to the US government LPC-10e algorithm, Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 473–476, 1986.

3. Campbell, J.P., V.C. Welch, and T.E. Tremain, The DoD 4.8 kbps standard (proposedFederal Standard 1016), in Advances in Speech Coding, B.S. Atal, V. Cuperman, andA. Gersho (eds), pp. 121–133, Kluwer Academic Publishers, Boston, 1991.

Page 258: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

10Internet Low-Bit-Rate Coder

10.1 Introduction

The Internet low-bit-rate codec (iLBC) is a speech codec designed for robustvoice communication over the Internet Protocol (IP). The codec is developedby the company Global IPSound (GIPS). It is designed for narrowband speech.The IP environment can lead to degradation in speech quality due to lostframes, which occurs in connection with lost or delayed IP packets.

The iLBC coding algorithm is for speech signals sampled at 8 kHz. Thealgorithm uses a block-independent LPC algorithm and has support for twobasic frame/block lengths: 20 ms at 15.2 kbps and 30 ms at 13.33 kbps. Whenthe codec operates at block lengths of 20 ms, it produces 304 bits per block,which is packetized as in Reference [1] (or fits in 38 bytes [2,3]).

Similarly, for block lengths of 30 ms it produces 400 bits per block, which ispacketized as in Reference [1] (or fits in 50 bytes [2,3]). The two modes for thedifferent frame sizes operate in a very similar way.

The algorithm results in a speech coding system with a controlled responseto packet losses similar to what is known from PCM with packet loss con-cealment (PLC), such as the ITU-T G.711 standard [4], which operates at afixed bit rate of 64 kbps. At the same time, the algorithm enables fixed bitrate coding with a quality-versus-bit rate tradeoff better than most otheralgorithms. A suitable real-time transport protocol (RTP) payload format foriLBC is specified in Reference [1].

The iLBC is suitable for real-time communications such as telephony andvideoconferencing, streaming audio, archival, and messaging. It is commonlyused for VoIP applications such as Skype, Yahoo Messenger, and GoogleTalk, among others. Cable Television Laboratories (CableLabs) has adoptediLBC as a mandatory PacketCable audio codec standard for VoIP over Cableapplications [5].

The structure of the iLBC is developed by Global IP Solutions (GIPS),formerly Global IP Sound, and is defined in RFC3951 [2]. A companion docu-ment RFC3952 [1] explains the details of the RTP payload format for the iLBCspeech network protocol. The ideas behind the iLBC were also presented atthe IEEE Speech Coding Workshop in 2002 [6,7].

243

Page 259: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

244 Principles of Speech Coding

This codec overcomes the dependency of CELP codec (e.g., G.729, G.723.1,GSM-EFR, and 3GPP-AMR) on previous samples. For packet-based networksusing any of these CELP codecs, packet loss will affect the quality of thereconstructed signal, as part of the historical information may be lost, makingit difficult to recover the original signal.

The computational complexity of the iLBC is in the same range as the ITU-TG.729/A codec that was presented in Chapter 9. It has the same quality as theITU-T G.729E in clean channel (no packet loss) conditions [3] but its qualityexceeds that of the other narrowband codecs including G.723.1 and G.728under packet loss conditions.

Some of the features of the iLBC include gain-shape waveform matchingforward in time, gain-shape waveform matching backward in time, start stateencoding, and pitch enhancement. These are discussed in later sections in thischapter. Also, perceptual speech quality measures (PSQM/PSQM+ values),which are used for quality measurement, are covered in the final sections ofthe chapter.

10.2 Internet Low-Bit-Rate Codec

The continuously increasing demand for carrying voice traffic over the Inter-net has forced developers to look at different speech coding types to findthe one that fits best the characteristics of packet-switched networks. Due tothe bursty nature of packet-based traffic and the possibility of packet lossdue to network congestion, the current CELP codec will not offer the sameperformance when previous samples are not available and will cause signaldistortion at the far end.

GIPS used its expertise and knowledge of the VoIP technology and marketrequirements to design a new type of codec focused on packet communicationthat has increasingly become more popular with the deployment of VoIP.This new codec is called iLBC and was designed to operate at low bit ratesand provide good-quality speech during low traffic and network congestionsituations. iLBC eliminates the dependency on previous frames because eachpacket is treated independently to achieve a better response to packet loss,delay, or packet jitter.

10.2.1 Structure

The input to the encoder is 16-bit uniform PCM sampled at 8 kHz. iLBC sup-ports 2 frame lengths: 30 ms (240 samples) for a bit rate of 13.33 kbps and20 ms (160 samples) for a bit rate of 15.2 kbps. Most of the low-bit-rate codecslimit the voice bandwidth to 50–3400 Hz, whereas iLBC utilizes the full 4 kHzbandwidth, producing higher-quality reconstructed voice (see Figure 10.1 forthe frame structure).

Page 260: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 245

20 ms block

30 ms block

1 2 3 4 5 6

0 79 39 119 159 199 239

1 2 3 4 79 39 0 119 159 (a)

(b)

FIGURE 10.1 Encoder input for iLBC (broken into subframes) for both 20 and 30 ms frames.

10.2.2 Advantages

• As opposed to CELP codec that requires previous data to estimatethe pitch gain and lag, iLBC estimates the pitch of the signal in thesame frame eliminating the dependency on previous samples andlookahead delays. That is why iLBC offers better performance duringpacket loss conditions.

• It also ensures high voice quality during normal network operationwith very low packet loss rates.

• Incorporates PLC techniques.

10.2.3 Algorithm

iLBC uses the adaptive codebook forward and backward in time as follows:

• Determine start state vector inside the speech frame. It contains thehighest residual energy (dominant pitch).

• Transmit the encoded location and waveform of the start state perframe.

• Adaptive codebook is populated with segments of the decoded startstate vector.

• Long-term predictive encoding is used from the end of the start stateuntil the end of the speech frame (forward in time). The adaptivecodebook is updated continuously with the most recent decodedsignal.

• Then, the adaptive codebook is populated again with segments ofthe decoded start state and the first encoded signal segment.

• In the other direction, the long-term predictive coding is performedbackward in time starting from the beginning of the start state untilthe beginning of the speech frame.

Page 261: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

246 Principles of Speech Coding

• Therefore, each packet will contain the start state, gain, and lag infor-mation that will be used at the far end for proper decoding withoutdepending on previous speech frames.

10.2.4 CELP Coders versus iLBC

The common practice in CELP coders is to populate the adaptive codebookwith excitation signals prior in time. This approach results in the followingproblems:

i. If the past signal is lost or error contaminated during the transmission,the adaptive codebook in the decoder will differ from the one in theencoder. This leads to a poor decoded signal quality.

ii. At the onset of a voiced speech segment, the adaptive codebook isinsufficient to properly describe the pitch cycle. This leads to a slowvoicing onset in the decoded signal.

iii. The desired fast build-up of high periodicity of voiced regions, andthe undesired feedback of coding noise to the adaptive codebook areconflicting performance goals in CELP. In practice, this means thatCELP encoding results in a decoded signal with a noisy character invoiced regions. This noisy character is perceived especially for high-pitched voices.

iLBC applies the adaptive codebook both forward and backward in time,starting from the start state. The start state is a segment of samples with thehighest residual energy, which will typically capture at least one dominat-ing pitch pulse when the segment has voiced speech in it. The location andwaveform of the start state are encoded and transmitted for each frame. Inthis way, a data packet can be made to contain start state information, and lagand gain information sufficient for the correct decoding of a complete signalframe.

This method implies the following features and leads to the followingadvantages over the common CELP coders:

i. The adaptive codebook in the decoder will be independent of thereception or loss of a previous packet.

ii. If a voicing onset occurs within the frame, then at least one significantpitch pulse will be contained in the adaptive codebook as a staringpoint and thereby the first pitch cycle in the frame can be accuratelyencoded.

There is no inherent compromise between fast build-up of high periodicityand the feedback of coding noise to the adaptive codebook, because the startstate typically will contain a fully built-up pitch pulse.

Page 262: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 247

10.3 iLBC’s Encoding Process

The iLBC speech encoding process is shown in Figure 10.2. Next, we discussthe details of each block:

1. Speech preprocessing: This is needed when the speech signal containsDC level and/or 50/60 Hz background noise. This is done by usingan HPF with a cutoff frequency of about 90 Hz.

2. LPC analysis and quantization: The LPC analysis calculates one set of 10LPC filter coefficients for 20 ms frames and two sets for 30 ms framesusing the autocorrelation method and the Levinson–Durbin recursion.These coefficients are converted into the line spectrum frequency (LSF)representation. There is a 10-ms look-back in LP analysis for 20 msframes, meaning that 80 samples are needed from the previous frame,and a 7.5-ms look-back (60 samples) for 30 ms frames. No look-aheadinto the future frame is used.a. Autocorrelation coefficients: Autocorrelation coefficients are calcu-

lated using windowed speech samples. In the case of 30 ms frame, a240-sample-long standard symmetric Hanning window is appliedas shown in Figure 10.3 for the first set of coefficients. The firstwindow is defined as

w(n) =⎧⎨⎩0.5

(1 − cos

(2π(n + 1)

241

))for 0 ≤ n ≤ 119

w(239 − n) for 120 ≤ n ≤ 239

(160 speech samples/20 ms frame length)(240 speech samples/30 ms frame length)

Preprocessspeech

HP filter

LPC calculation,quantization

and interpolation

Analysis filtersresidual

calculation

Identify startstate

Scalarquantization of

start state

Codebooksearch

Packetize Payload

Speech

Subframe 0…2 (20 ms)Subframe 0…4 (30 ms)

FIGURE 10.2 iLBC encoder block diagram.

Page 263: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

248 Principles of Speech Coding

Sf4 Sf5 Sf0 Sf1 Sf2 Sf3 Sf4 Sf5Current

Frame PastFrame

Sub-frame =5ms

FIGURE 10.3 Positions of frame and LP analysis windows for 30 ms frames.

For the second set of coefficients, an asymmetric window is appliedto the entire current frame as shown in Figure 10.3. The asymmetricwindow is defined as

w(n) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

(sin

(π(n + 1)

441

))2

for 0 ≤ n ≤ 219

cos(

π(n − 220)

40

)for 220 ≤ n ≤ 239

This asymmetric window is also applied for 20 ms frames. The nextstep is spectral smoothing and adding a white-noise floor with afactor of 1.0001. These steps are implemented by multiplying theautocorrelation coefficients by the following window:

c(n) =

⎧⎪⎨⎪⎩

exp

(−0.5

((2π60n

Fs

)2))

for 1 ≤ n ≤ 10

1.0001 for n = 0

where Fs = 8000 Hz is the sampling frequency. Spectral smoothingand white-noise floor is added to the autocorrelation function tooptimize the analysis and avoid numerical precision problems.

b. LPC coefficients: LPC coefficients are calculated using Levinson–Durbin recursion. Once the coefficients are calculated, the band-width is expanded to smooth spectral peaks.

c. LSF coefficients: LPC coefficients are used to calculate LSF coef-ficients which are more appropriate for quantization and inter-polation.

d. LSF coefficient quantization: LSF coefficients are quantized using 3-split vector quantization (VQ). The length of the LSF vectors is 10,hence they are split into three subvectors containing 3, 3, and 4values, respectively. Each subvector (1–3, 4–6, 7–10) is quantizedwith a VQ codebook of size 64, 128, and 128, respectively.

e. LSF coefficient stability: If the LSF coefficients are not ordered cor-rectly at the split boundary, the LP filter won’t be stable. This is whystability check for LSF coefficients is carried out.

Page 264: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 249

f. LSF coefficient interpolation: LSF coefficients (original and quantized)are interpolated in the LSF domain and converted into LPC coef-ficients for each sub-block. The quantized and unquantized LPCcoefficients now form two sets of filters.

3. Residual computation: The residual frame is the result of filtering thespeech samples using the quantized and interpolated LPC filters.The residual is divided into six subframes with 40 samples each. Thesignal at the output of each LPF becomes the residual signal for thecorresponding sub-block.

4. Perceptual weighting filter: The perceptual weighting filter for sub-blockk is recommended to be

Wk(z) = 1Ak(z/0.4222)

,

where Ak(z) is the filter obtained for sub-block k from unquantizedbut interpolated LSF coefficients.

5. Start state identification: The two consecutive subframes with highestresidual energy are identified. The first or last 57 samples are chosenas the start state so as to maximize energy for 20 ms frames and thefirst or last 58 samples for 30 ms frames.

6. Start state quantization: The block of residual samples in the start stateis first filtered by an all-pass filter and is searched for its largest magni-tude sample. The 10-logarithm of this magnitude is quantized with a6-bit quantizer. The quantized value is used to normalize the all-passfiltered residual samples.

Start state encoder uses a DPCM scalar quantizer as shown inFigure 10.4 to quantize the normalized samples in the perceptuallyweighted speech domain using 3 bits per sample. Each sample in theblock is filtered by a weighting filter Wk(z) specified above to form aweighted speech sample x(n). The prediction filter Pk(z) is given by

Quantizer

Quantizedresidual

y(n)

d(n) x(n) Wk(z)

Pk(z)

Residual

u(n) +

+

+ –

FIGURE 10.4 Quantization of start state samples by DPCM in weighted speech domain.

Page 265: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

250 Principles of Speech Coding

Pk(z) = 1 − 1Wk(z)

. The coded state sample u(n) is obtained by quan-tizing target sample d(n) with a 3-bit quantizer whose quantizationtable is specified in RFC 3951.

7. Codebook search: The codebook search for the dynamic codebook usedin the encoder is shown in Figure 10.5. The encoding is based onan adaptive codebook built from a codebook memory that containsdecoded LPC excitation samples from the already encoded part of theblock.

Here is what happens inside the blocks of Figure 10.5.1. Decode block decodes the part of the residual that has been encoded

so far, using the codebook without perceptual weighting.2. Memory setup block sets up the memory by taking data from the

decoded residual. This memory is used to construct codebooks. Forblocks preceding the start state, both the decoded residual and thetarget are time-reversed.

3. Perceptual weighting filter block filters the memory+target withthe perceptual weighting filter.

4. Search block searches for the best match between the target andthe codebook vector. Compute the optimal gain for this match andquantize that gain.

5. Update the perceptually weighted target by subtracting the con-tribution from the selected codebook vector from the perceptuallyweighted memory (quantized gain times selected vector). Repeat4 and 5 for the two additional stages.

6. Calculate the energy loss due to encoding of the residual. If needed,compensate for this loss by an upscaling and requantization of thegain for the first stage.

Decode Memorysetup

Perceptualweighting

filter

Search Updatetarget

RecalculateG(0) Gains and CB

indices

Quantizedspeech

Subframe 0…2 (20 ms)Subframe 0…4 (30 ms)

FIGURE 10.5 Flowchart of the codebook search in the iLBC encoder.

Page 266: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 251

TABLE 10.1a

Bit Allocation for the iLBC 30 ms Frame Coder

Parameter Bits

LSF pair coefficients 40Position of start state 4Scale factor for start state 6Scalar quantization of start state 174Shapes, short frame 21Shapes, subframe 1 22Shapes, subframe 2 24Shapes, subframe 3 24Shapes, subframe 4 24Gains 60Empty frame indicator 1Total bits 400

TABLE 10.1b

Bit Allocation for the iLBC 20 ms Frame Coder

Parameter Bits

LSF pair coefficients 20Position of start state 3Scale factor for start state 6Scalar quantization of start state 171Shapes 67Gains 36Empty frame indicator 1Total bits 304

An in-depth description of the blocks in Figure 10.5 including areference implementation is in Reference [2].

8. Packetization: The encoded bits are packetized into the payload. Thebit allocation tables for 30 ms frames and 20 ms frames are specifiedin Tables 10.1a and b, respectively.

10.4 iLBC’s Decoding Process

Ablock diagram of the iLBC decoder is shown in Figure 10.6. Next, we discussthe details of each block. If a frame was lost, then Steps 1–5 below should bereplaced by a PLC algorithm.

Page 267: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

252 Principles of Speech Coding

Parametersextraction

DecodeLPC and

interpolate

Start stateconstruction

Memory setup Residualsconstruction

Synthesis ofresidual

Post processHP filter

Payload

Residualenhancement

(post filter)

Decodedspeech

Subframe 0…. 2/4 (20 ms/30 ms)

FIGURE 10.6 iLBC decoder block diagram.

1. iLBC decoding starts with the parameters extraction from the bitstream.

2. Linear predictor coefficients must be decoded and interpolated.3. Start state must be constructed.4. Set up the memory by using data from the decoded residual. This

memory is used for codebook construction. For blocks preceding thestart state, both the decoded residual and the target are time-reversed.Subframes are decoded in the same order as they were encoded.

5. Construct the residuals of this subframe (gain[0]∗cbvec[0] +gain[1]∗cbvec[1] + gain[2]∗cbvec[2]). Repeat 4 and 5 until the residualof all sub-blocks has been constructed.

6. Next, pitch postfiltering must be performed to enhance the residualand retain the fundamental pitch.

7. Once this is done, the decoder must apply a synthesis filter toreconstruct the speech signal.

8. Postprocess with HPF, if desired.

Here are the details:

1. LPC filter reconstruction: The decoding of the LPF parameters is verystraightforward. For a set of three indices in the 20 ms mode and sixindices in the 30 ms mode, the corresponding LSF vector(s) are foundby simple table lookup. For each of the LSF vectors, the three split vec-tors are concatenated to obtain the two quantized LSF subvectors. Inthe 20 ms mode only one LSF vector is constructed. The next step is thestability check followed by the interpolation scheme both describedpreviously.

The only difference is that only the quantized LSFs are known atthe decoder, and hence the unquantized LSFs are not processed. A

Page 268: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 253

reference implementation of the LPC filter reconstruction is given inReference [2].

2. Start state reconstruction: The scalar encoded state samples are recon-structed by the following steps:a. Form a set of samples (by table lookup) from the index streamb. Scale by multiplying the set with a reciprocal of the scaling fac-

tor 1/scal = (10∧qmax)/4.5, where qmax is the quantized value fornormalization

c. Time reverse the samples (57 in case of 20 ms and 58 in case of 30 ms)d. Filter the time-reversed block with the dispersion (all-pass) filter

used in the encoder (described earlier); this compensates for thephase distortion of the earlier filter operation5. Reverse the 57 or58 samples from the previous step.

The remaining 23/22 samples in the state are reconstructed by thesame adaptive-codebook technique described below in decoder sec-tion. The location bit determines whether these are the first or the last23/22 samples of the 80-sample state vector. If the remaining 23/22samples are the first samples, then the scalar-encoded 57/58 state sam-ples are time-reversed before initialization of the adaptive-codebookmemory vector.

Areference software implementation of the start state reconstructionis given in Reference [2].

3. Excitation decoding loop: The LPC excitation vector is decoded in thesame order in which the residual was encoded at the encoder. Thatis, after the decoding of the entire 80-sample state vector, the forwardsub-blocks (corresponding to samples occurring after the state vectorsamples) are decoded, and then the backward sub-blocks (correspond-ing to samples occurring before the state vector) are decoded, resultingin a fully decoded block of excitation signal samples.

In particular, each sub-block is decoded by using the multistageadaptive-codebook decoding module described above. This mod-ule relies upon an adaptive-codebook memory constructed beforeeach run of the adaptive-codebook decoding. The construction of theadaptive-codebook memory in the decoder is identical to the methodoutlined above, except that it is done on the codebook memory withoutperceptual weighting.

For the initial forward sub-block, the last 80 samples of the length147 adaptive-codebook memory are filled with the samples of thestate vector. For subsequent forward sub-blocks, the first 40 sam-ples of the adaptive-codebook memory are discarded, the remainingsamples are shifted by 40 samples toward the beginning of thevector, and the newly decoded 40 samples are placed at the endof the adaptive-codebook memory. For backward sub-blocks, the

Page 269: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

254 Principles of Speech Coding

construction is similar, except that every vector of samples involvedis first time-reversed.

Areference software implementation of the excitation decoding loopis in Reference [2].

4. Multistage adaptive-codebook decoding: The multistage adaptive-codebook decoding module is used at both the sender (encoder) andthe receiver (decoder) ends to produce a synthetic signal in the resid-ual domain that is eventually used to produce synthetic speech. Themodule takes the index values used to construct vectors that are scaledand summed together to produce a synthetic signal that is the outputof the module.

5. Construction of the decoded excitation signal: The unpacked index val-ues provided at the input to the module are references to extendedcodebooks, which are constructed as described in the section ofencoder codebook creation in Reference [2], except that they are basedon the codebook memory without the perceptual weighting. Theunpacked three indices are used to look up three codebook vectors.The unpacked three gain indices are used to decode the correspond-ing three gains. In this decoding, the successive rescaling, as describedabove, is applied.

A reference software implementation of the adaptive-codebookdecoding is in Reference [2].

10.5 iLBC’s PLC Techniques

If packet loss occurs, the decoder receives a signal saying that informationregarding a block is lost. The decoder will apply PLC techniques when packetsare lost or do not arrive in time for playback. A PLC unit is designed for thedecoder to recognize when a packet has been lost and masks the effect oflosing a packer or having a considerable delay in its arrival.

Next, we give an example of a PLC unit that can be used with the iLBC. ThePLC unit is used only at the decoder; therefore the PLC unit does not affectinteroperability between implementations. Other PLC implementations maytherefore be used. The example of PLC unit addresses the following cases:

1. Current and previous frames are received correctly. The iLBC decodersaves the state information (LPF coefficients) for each subframe ofthe current frame and entire decoded excitation signal in case thefollowing block is lost.

2. Current frame is not received. If the block is not received, the block sub-stitution is based on a pitch-synchronous repetition of the excitation

Page 270: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 255

signal, which is filtered by the last LPF of the previous block. Theprevious block’s information is stored in the decoder state structure.

The decoder uses the stored information from the previous blockto perform a correlation analysis and determine the pitch periodicityand voicing level (the degree to which the previous block’s excitationwas a voiced or roughly periodic signal). The excitation signal fromthe previous block is used to create a new excitation signal for currentblock to maintain the pitch from the previous block.

For a better sounding substituted block, a random excitation ismixed with the new pitch periodic excitation, and the relative use ofthe two components is computed from the correlation measure (voic-ing level). Next, the signal goes through an LPF to produce a speechoutput that makes up for the lost packet/frame.

For several consecutive lost blocks, the PLC continues in a similarmanner. The correlation measure of the last block received is still usedalong with the same pitch value. The LPFs of the last block receivedare also used again. The energy of the substituted excitation for con-secutive lost blocks is decreased, leading to a dampened excitation,and therefore to dampened speech.

3. Current frame is received, but previous frame is lost. In this case, the cur-rent frame is not used for the actual output of the decoder. In orderto avoid an audible discontinuity between the current frame and theframe that was generated to compensate for the previous packet loss,the decoder will perform a correlation analysis between the excita-tion signal of both frames (current and previous one) to detect thebest phase match. Then a simple overlap-add procedure is performedto merge the previous excitation smoothly into the current frames’excitation.

The exact implementation of the PLC does not influence interoper-ability of the codec. A suggested reference software implementationof the PLC is in Reference [2]. Exact compliance with this suggestedalgorithm is not needed for a reference implementation to be fullycompatible with the overall codec specification.

10.6 iLBC’s Enhancement Techniques

The decoder will apply enhancement techniques to increase the perceptualquality of the reconstructed signal by reducing the speech-correlated noise inthe voiced speech segments.

It operates on the reconstructed excitation signal. Compared to traditionalpostfilters, the enhancer has an advantage in that it can only modify theexcitation signal slightly. This means that there is no risk of over-enhancement.

Page 271: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

256 Principles of Speech Coding

The enhancer works very similarly for both the 20 ms frame size mode andthe 30 ms frame size mode [2,7].

For the mode with 20 ms frame size, the enhancer uses a memory of six80-sample excitation blocks prior in time plus the two new 80-sample exci-tation blocks. For each block of 160 new unenhanced excitation samples,160 enhanced excitation samples are produced. The enhanced excitation is40-sample delayed compared to the unenhanced excitation, as the enhanceralgorithm uses lookahead.

For the mode with 30 ms frame size, the enhancer uses a memory of five80-sample excitation blocks prior in time plus the three new 80-sample exci-tation blocks. For each block of 240 new unenhanced excitation samples,240 enhanced excitation samples are produced. The enhanced excitation is80-sample delayed compared to the unenhanced excitation, as the enhanceralgorithm uses lookahead.

10.6.1 Outline of Enhancer

The speech enhancement unit operates on sub-blocks of 80 samples, whichmeans that there are two (three for 30 ms frame) 80-sample sub-blocks perframe. Each of these two/three sub-blocks is enhanced separately, but in ananalogous manner. Figure 10.7 shows the block diagram of the enhancer.

The main idea of the enhancer is to find three 80-sample blocks beforeand three 80-sample blocks after the analyzed unenhanced sub-block andto use these to improve the quality of the excitation in that sub-block. Thesix blocks are chosen so that they have the highest possible correlation withthe unenhanced sub-block that is being enhanced. In other words, the sixblocks are pitch-period-synchronous sequences (PSSQ) to the unenhancedsub-block.

Pitchestimation

FindPSSQ Smooth

Useconstraint

Mix

Unenhancedresidual

Criterion

Enhancedresidual

Alreadyfulfilled

Notfulfilled

Enh. block 0... 1/2

FIGURE 10.7 Block diagram of iLBC enhancer.

Page 272: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 257

A linear combination of the six PSSQs that approximates the sub-block iscalculated. If the squared error between the approximation and the unen-hanced sub-block is small enough, the enhanced residual is set equal to thisapproximation.

For the cases when the squared error criterion is not fulfilled, a linear combi-nation of the approximation and the unenhanced residual forms the enhancedresidual.

The steps taken by the enhancer are enumerated below:

1. Perform pitch estimation of each of the two (three for 30 ms mode)new 80-sample blocks.

Pitch estimates are needed to determine the locations of the PSSQs ina complexity-efficient way. For each of the new two/three sub-blocks,a pitch estimate is calculated by finding the maximum correlation inthe range from lag 20 to lag 120.

These pitch estimates are later used to narrow down the search forthe best possible PSSQs.

2. Find the PSSQ n (for block k) by a search around the estimated pitchvalue. Do this for n = 1, 2, 3, −1, −2, −3 to get six PSSQs.

Here are more details of this procedure:Upon receiving the pitch estimates from the prior step, the enhancer

analyzes and enhances one 80-sample sub-block at a time. The PSSQspssq(n) can be viewed as vectors of length 80 samples each shiftedn*lag samples from the current sub-block. The six PSSQs, pssq(−3) topssq(−1) and pssq(1) to pssq(3), are found one at a time. More detailsare in Reference [2].

3. Calculate the smoothed residual generated by the six PSSQs from priorstep.

A linear combination of the six pssq(n) (n! = 0) forms a smoothedapproximation, z, of pssq(0). Most of the weight is put on the sequencesthat are close to pssq(0), as these are likely to be mostsimilar to pssq(0).The smoothed vector is also rescaled so that the energy of z is the sameas the energy of pssq(0).

y =∑

i=−3,−2,−1,1,2,3

pssq(i) ∗ pssq_weight(i),

pssq_weight(i) = 0.5 ∗(

1 − cos(

(i + 4

2 ∗ 3 + 2

))),

z = C ∗ y, where C = ‖pssq(0)‖‖y‖ .

4. Check if the smoothed residual satisfies the criterion.The criterion of the enhancer is that the enhanced excitation is not

allowed to differ much from the unenhanced excitation. This criterion

Page 273: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

258 Principles of Speech Coding

is checked for each 80-sample sub-block.

e < (b∗‖pssq(0)‖2), where b = 0.05 and (Constraint 1),

e = (pssq(0) − z) ∗ (pssq(0) − z), where ∗ means dot product.

5. Use constraint to calculate mixing factor and mix smoothed signalwith unenhanced residual (pssq(n) n = 0).

From the criterion in the previous section, it is clear that the exci-tation is not allowed to change much. The purpose of this constraintis to prevent the creation of an enhanced signal significantly differ-ent from the original signal. This also means that the constraint limitsthe numerical size of the errors that the enhancement procedure canmake. That is especially important in unvoiced segments and back-ground noise segments for which increased periodicity could lead tolower perceived quality.

When the constraint in the prior section is not met, the enhancedresidual is instead calculated through a constrained optimization byusing the Lagrange multiplier technique. The new constraint is that

e = (b ∗ ‖pssq(0)‖2), where b = 0.05 (Constraint 2).

There are thus two solution regions for the optimization: (1) theregion where the first constraint is fulfilled and (2) the region wherethe first constraint is not fulfilled and the second constraint must beused.

In the first case, where the second constraint is not needed, the opti-mized re-estimated vector is simply z, the energy-scaled version ofy.

In the second case, where the second constraint is activated andbecomes an equality constraint, we have

z = A ∗ y + B ∗ pssq(0), where

A =√(

b − b2

4

)∗(

w00 ∗ w00

w11 ∗ w00 + w10 ∗ w10

)and

w11 = pssq(0) ∗ pssq(0),

w00 = y ∗ y,

w10 = y ∗ pssq(0), where (∗ means dot product) and

B = 1 − b2

− A ∗ w10

w00.

Reference [2] contains a listing of a reference software implementa-tion for the enhancement method.

Page 274: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 259

10.7 iLBC’s Synthesis and Postfiltering

After decoding or PLC of the LP excitation block, the decoded speech blockis obtained by running the decoded LP synthesis filter over the block. Thesynthesis filters have to be shifted to compensate for the delay in the enhancer.For 20 ms frame size mode, they have to be shifted one 40-sample sub-block,whereas for 30 ms frame size mode, they have to be shifted by two 40-samplesub-blocks.

The LP coefficients are changed at the first sample of every sub-block whilekeeping the filter state. For PLC blocks, one solution is to apply the last LPcoefficients of the last decoded speech block for all sub-blocks. The referencesoftware implementation for the synthesis filtering can be found in Reference[2]. The decoded speech block can be postfiltered by an HPF to remove thelow frequencies of the decoded signal. A reference implementation of this,with cutoff at 65 Hz, is shown in [2].

Finally, the iLBC speech coding algorithm is not subject to any knownsecurity consideration; however, its RTP payload format [1] is subject to sev-eral considerations, which are addressed there. The resulting media streamscan be encrypted using external mechanisms, such as SRTP [8].

Any iLBC implementation can be evaluated by utilizing methodology andtools available at http://www.ilbcfreeware.org/evaluation.html [2].

10.8 MATLAB� Signal Processing Blockset iLBCDemo Model

MATLAB Signal Processing Blockset version 6.6 and higher versionsinclude an iLBC demo [9] that can be used to test the performance of this codecunder different packet loss rates. We used the MATLAB’s release R2007b.

This demo model basically loads a .wav speech file and passes it to the iLBCencoder block to convert it to an iLBC packet. Next, the packet is sent to theiLBC decoder block to be converted back into a speech signal, which is thenplayed on a speaker (see Figure 10.8).

Double clicking on the iLBC encoder and decoder blocks brings up theblocks parameter’s dialogs (Figure 10.9), where it is possible to change thedata transmission rate to one of the two iLBC modes [13.33 kbps (240 sam-ples) or 15.20 kbps (160 samples)]. To run this demo, the encoder and decodertransmission rates must be identical for the speech signal to be properlyreconstructed after the encoding/decoding process.

Double clicking on the Lossy Channel block allows the user to enter a packetloss rate in percentage from 0 to 100. Setting this parameter to different valueswill show how this codec’s robustness offers an acceptable performance forhigh packet loss rates (see Figure 10.10).

Page 275: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

260 Principles of Speech Coding

FIGURE 10.8 MATLAB’s iLBC demo model block diagram.

The encoder and decoder blocks can be represented with different blocksthat are used to execute different functions within the codec such asencoding/decoding start state, computing LPC/LSF coefficients, indexesdecoding, and so on.

These blocks are displayed in Figure 10.11 with a brief description of itsfunctionality. Each sub-block can also be subdivided into smaller blockswhich are not mentioned here. A general overview of the decoder is sufficientto describe how this codec works and why its performance is still accept-able even with high packet loss rates which occur within congested packetnetworks.

FIGURE 10.9 MATLAB’s iLBC demo model encoder and decoder blocks.

Page 276: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 261

FIGURE 10.10 MATLAB’s ilBC demo model Lossy Channel setup.

FIGURE 10.11 MATLAB’s iLBC demo model library.

Page 277: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

262 Principles of Speech Coding

FIGURE 10.12 Original speech signal for MATLAB’s iLBC demo model library.

The iLBC demo model was run with different packet loss rates: 0%, 10%,and 50%. MATLAB’s demo model generates a sample speech signal for thisdemonstration. The original signal is plotted in Figure 10.12.

The iLBC demo model simulation is executed first with 3% packet loss rateand the signal output (yout) is plotted below. This signal is also converted toa .wav file using the following MATLAB command:

�wavwrite(yout,’yout_3_PLR’);

The output signal is plotted in Figure 10.13. The next step is to simulatepacket loss rate of 10% to evaluate the output signal and convert it into anaudio file. The new PLR value is entered into the Lossy Channel block of theiLBC main diagram. The output signal with a packet loss rate of 10% is plottedin Figure 10.14. This signal is also converted to a .wav file using the followingMATLAB’s command:

�wavwrite(yout,’yout_10_PLR’);

Next, for a 50% packet loss rate network, the signal output is plotted inFigure 10.15. And for 80% packet loss rate, the output signal is plotted inFigure 10.16.

The figures displayed above show that iLBC keeps the shape of the inputsignal in spite of the high packet loss rate; thanks to the packet independency(the use of LSF/LPC interpolation) and PLC techniques, making this codermore appropriate for packet-switched networks.

Page 278: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 263

FIGURE 10.13 Output speech signal (with 3% packet loss).

FIGURE 10.14 Output speech signal (with 10% packet loss).

Page 279: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

264 Principles of Speech Coding

FIGURE 10.15 Output speech signal (with 50% packet loss).

FIGURE 10.16 Output speech signal (with 80% packet loss).

Page 280: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 265

The decoded packets were still intelligible even for high packet loss rateslike 50% and more. iLBC was designed to preserve the spectral content of thesignal even during heavy packet loss conditions which occurs in congestednetworks. The spectrum of the input signal and the output signal for differentpacket loss rates are displayed in Figures 10.17 through 10.19.

Where y is the input signal generated by the iLBC demo with a samplingfrequency fs of 8 kHz, yout is the output signal. The following commands areused to generate the spectrum of the input signal as well as the output signalfor packet loss rates of 10% and 50%.

�y1=double(y);�yout1=double(yout);�n=0:1:39921;�freq=2 ∗ pi ∗ (n−19960)/39922;�y2=fft(y1, 39922);�y3=fftshift(y2, 39922);� plot(freq,abs(y3))� grid on;�yout2=fft(yout1, 39922);�yout3=fftshift(yout2, 39922);� plot(freq,abs(yout3))� grid on;>

FIGURE 10.17 Spectrum of original input speech signal.

Page 281: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

266 Principles of Speech Coding

FIGURE 10.18 Spectrum of output speech signal (with 10% packet loss).

FIGURE 10.19 Spectrum of output speech signal (with 50% packet loss).

Page 282: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 267

The spectrum of the output signal for different packet loss rates (10% and50%) is plotted in Figures 10.18 and 10.19. From these spectral figures, wesee that iLBC preserves the spectral content of the signal even during heavypacket loss conditions which occurs in congested networks.

10.9 PESQ

PESQ has been defined by ITU-T recommendation P.862 as a method thatpredicts the subjective quality of speech codecs for IP telephony, wireless,and other narrowband speech coding applications.

In brief, PESQ compares a speech signal that enters a communications sys-tem (PSTN, IP, and wireless networks) with the output signal which is adegraded version of the input signal. The degradation is caused by many fac-tors such as packet loss, delays, packet jitter, signal distortion, quantizationerrors, and so on. PESQ algorithms predict in an objective way the subjectivequality of the output signal determined by a group of subjects evaluating theoutput signal of the communications system (Figure 10.20).

10.10 Evolution from PSQM/PSQM+TO PESQ

Speech analysis is the means by which speech is analyzed and the physicalcharacteristics that define the speech can be extracted from the original speech.PSQM represents an evolution from general audio quality measurement suchas perceptual audio quality measurement (PAQM). The difference betweenthese systems resides in the fact that PSQM focuses more on speech signalsbased on the data collected from different tests that show how the human

Comm.system

Subjective evaluation

Qualitypredictive

Inputsignal

Outputsignal

FIGURE 10.20 PESQ representation.

Page 283: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

268 Principles of Speech Coding

brain tend to recognize speech signals more accurately than general audiosignals including music.

In brief, this algorithm converts the input and output signals into psy-chophysical representations that match the way our human brain representsthem inside our heads. A cognitive model block estimates the quality level ofthe output signal based on the difference between the internal representationof both original and coded speech.

10.10.1 PSQM+

This algorithm overcomes some limitations of PSQM (ITU-T Recommenda-tion P.861) and improves the correlation between the quality level predictedby this method and the results of the subjective listening tests.

The major improvements of this algorithm are as follows:

1. PSQM+ adds a new time alignment algorithm that enhances the mea-surements on noisy lines. The drawback of PSQM is that its standardtime alignment which starts with the detection of the first 0.5 ms whenthe input and coded signal exceed a certain energy level. If noise isdetected during that time window, the process will generate errorsand the quality level would not be within the expected range.

2. PSQM+ tries to match closely human ear’s asymmetric processing ofdistortions whereas PSQM’s weights them much stronger.

This takes into account the distortion caused by packet loss or dropouts.

10.11 PESQ Algorithm

PESQ addresses the fact that the delay between the reference and the testsignal is no longer constant. PSQM/PSQM+ algorithms were primarilydesigned for systems where the delay is constant (i.e., GSM). The deploy-ment of VoIP demonstrated that this delay is no longer constant and it is aconsequence of congested networks.

PESQ improves over PSQM by including a time alignment process, byhandling handovers on mobile networks and offering a slightly lower scoreduring these conditions, and also by enhancing the cognitive model to obtainbetter correlation results between input and output signals.

This algorithm starts by calculating delays between input signal anddegraded output when the interval is significantly different from the pre-vious interval one. Next, a start and stop points are calculated and willbe used to keep the alignment especially when the delay is not constant(VoIP).

Page 284: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 269

The next step is to convert the input and output signals into psychophysicalrepresentations in the human ear by following the steps listed below:

• Time alignment• Level alignment• Time-frequency mapping• Frequency warping• Loudness sealing (not implemented on PSQM/PSQM+)

This internal representation takes into account the effects of local gain vari-ations and linear filtering to compensate for minor differences between theoriginal input speech signal and the degraded output speech signal.

The cognitive model computes two errors that will be used to determinewhat the MOS level is for the output signal. This process of predicting thesubjective evaluation of the quality of the output signal of a communicationsystem is represented in Figure 10.21. This block diagram can be expandedinto different sub-blocks that will perform the conversion to psychophysicalacoustic domain and the congnitive modeling that will predict the qualitylevel output in an objective way (Figure 10.22).

The functions performed on this block are described as follows:

1. Time alignment ensures that the signals to be compared will be exactlythe same. This compensates for delays that are common in packetnetworks.

2. Mimic ear resolution processes the speech signal in the frequencydomain the same way the human ear does.

3. Remove filter influence from circuit-switched or packet-switched net-works that can impact the PESQ score.

Perceptualtime

model

Timealignment

Perceptualtime

model

Pyscophysicalrepresentation

input signal

Differencebetween

representations

Pyscophysicalrepresentationoutput signal

Inputsignal

Comm.system

under test

Outputsignal

Qualitylevel

Cognitivemodel

FIGURE 10.21 PESQ block diagram.

Page 285: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

270 Principles of Speech Coding

Scale

Scale

Time

alignMimic ear

Mimic ear

Removefilter

influence

Remove

Gain

Variations

Brainloudness

Brainloudness

Perceptualsubtraction

FIGURE 10.22 Conversion to psychophysical representation.

4. Remove gain variation to make sure automatic gain control (AGC)would not impact the quality of the PESQ score.

5. Brain loudness perception: Internal transformation of intensity intoloudness perception.

The cognitive modeling consists of the following sub-blocks (Figure 10.23):

1. Perceptual subtraction generates a disturbance density signal that isbased on the difference between the perceived loudness of the inputand output signals.

2. Bad intervals identification forces time alignment and other functions tobe redone again. If the resulting disturbance signal is better, then thecognitive model will use it to compute the PESQ score.

3. Aggregates disturbances: This block sums the disturbance signals inthe frequency domain to represent how distorted the speech is atcertain time intervals called split-second disturbances. Averaging

Identify

Aggregatesdisturbances

Transform toMOS-LQO

To timealignment

To timealignment

Perceptualsubtraction

Disturbance

Asymmetricdisturbance

Identify

FIGURE 10.23 Cognitive modeling.

Page 286: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 271

split-second disturbances and split second asymmetric disturbanceswill produce a PESQ_MOS score that will be the starting point tocompute the final PESQ score.

4. Transforms to MOS-LQO: Converts the PESQ-MOS score into MOS-LQO conforming to ITU-T P.862. MOS-LQO score varies from 1.0(worst possible score) to 4.5 (highest).

10.12 PESQ Applications

ITU-T Recommendation P.861 lists some applications where PESQ has anacceptable accuracy such as

• Codec evaluation and selection• Live network testing (after ensuring that physical layer is working

properly) and prototype and emulated networks testing• Coding technologies:

• Waveform codecs; G.711, G.726, and G.727• CELP and hybrid codecs: G.728, G.729, and G.723.1• Other codecs employed in mobile networks (GSM, CDMA,

and TDMA (time-division multiple access))

There are of course some applications where PESQ’s prediction is notaccurate and where ITU does not recommend using this method:

• In-service nonintrusive measurement equipment where the physicallayer level is too low to make a good prediction on the quality, andevaluating the performance of bidirectional communications.

Coding technologies when more than 25% of the speech signal is replacedby silence.

10.13 Summary

In this chapter, we have discussed the iLBC, a speech codec designed forrobust voice communication over the IP.

For narrowband speech under an IP environment (which can lead to degra-dation in speech quality due to lost frames, which occurs in connection withlost or delayed IP packets), the iLBC helps to reduce errors and improveintelligibility of the transmitted speech.

Page 287: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

272 Principles of Speech Coding

The iLBC supports two basic frame lengths: 20 ms at 15.2 kbps and 30 msat 13.33 kbps. We presented details of the encoding and decoding processesfor the two frame lengths.

The critical PLC techniques were discussed. This helps to lower error rateseven in high error environments. Extra features of the iLBC include gain-shape waveform matching forward in time, gain-shape waveform matchingbackward in time, start state encoding, and pitch enhancement.

A real-time implementation of iLBC on the TI C6416 DSP (based on theMATLAB’s Simulink model) has been completed and is available fromReference [10].

A MATLAB’s signal processing blockset iLBC demo model was pre-sented and experimental results were shown that demonstrated the robustperformance of iLBC.

Also, PSQM/PSQM+ values that are used for quality measurement arereviewed. The need for perceptual quality and the PESQ standard, algorithm,and applications were discussed to end the chapter.

EXERCISE PROBLEMS

10.1. Why does the iLBC perform better than ITU-T G.729A when used in a packetnetwork?

10.2. Describe how burst network errors are handled by the iLBC.10.3. Describe the process of encoding in the iLBC.10.4. Describe the process of decoding in the iLBC.10.5. Describe the process of PLC in the iLBC.10.6. List the advantages of the iLBC over typical CELP coders (Hint: Think of the

differences in terms of an adaptive codebook).10.7. What information occupies the most bits in the bit allocation table within the

encoded bits and why (see Tables 10.1a and b)?

References

1. Duric, A. and S. Andersen, Real-time transport protocol (rtp) payload format forInternet low bit rate codec (iLBC) speech, RFC 3952, December 2004.

2. Andersen, S., et al., Internet low bit rate codec (iLBC), RFC3951, http://tools.ietf.org/html/rfc3951, IETF organization, December 2004.

3. Global IP Sound, iLBC—designed for the future, White Paper, http://www.gipscorp.com/files/english/white_papers/iLBC.WP.pdf, October 15, 2004.

4. ITU-T Recommendation G.711, available online from the ITU bookstore athttp://www.itu.int.

5. PacketCable(TM) Audio/Video Codecs Specification, Cable Television Laborato-ries, Inc.

6. Andersen, S.V., W.B. Kleijn, R. Hagen, J. Linden, M.N. Murthi, and J. Skoglund,iLBC—a linear predictive coder with robustness to packet losses, Proceedings ofthe IEEE Speech Coding Workshop, 2002.

Page 288: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Internet Low-Bit-Rate Coder 273

7. Kleijn, W.B., Enhancement of coded speech by constrained optimization, Proceed-ings of the IEEE Speech Coding Workshop, 2002.

8. Baugher, M., D. McGrew, M. Naslund, E. Carrara, and K. Norman, The secure realtime transport protocol (SRTP), RFC 3711, March 2004.

9. Mathworks, iLBC demo, Mathworks MATLAB Central, 2008.10. Koji Seto, Fixed-point implementation of iLBC speech codecs on C6416 DSP,

Research Report, Santa Clara University, 2009.

Bibliography

1. Bradner, S., Key words for use in RFCs to indicate requirement levels, BCP 14,RFC 2119, March 1997.

2. ITU-T, Perceptual evaluation of speech quality (PESQ): An objective method forend-to-end speech quality assessment of narrow-band telephone networks andspeech codecs, ITU-T P.862 01/2001.

3. Ericsson, AQM in TEMS Automatic-PESQ, Technical Paper, 40/19817-AOMR 305001 Rev C.

4. OPTICOM GmbH, Germany, State of the art voice quality testing, White Paper.5. Juan Marsmela, Study of speech coding and iLBC, Independent Study Report,

Santa Clara University, 2007.6. Gibson, J., Speech coding methods, standards and applications, IEEE Circuits and

Systems Magazine, 30–40, Fourth quarter, 2005.7. Childers, D., et al., The past, present and future of speech processing, IEEE Signal

Processing Magazine, (May), 24–48, 1998.8. Yang, W., M. Benbouchta, and R. Yantorno, Performance of the modified bark

spectral distortion as an objective speech quality measure, Proceedings of the IEEEIACSP Conference, pp. 541–544, 1998.

9. Wang, S., A. Sekey, and A. Gersho, An objective measure for predicting subjectivequality of speech coders, IEEE Journal on Selected Areas of Communications, JSAC-10,819–829, 1992.

Page 289: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

11Signal Processing in VoIP Systems

11.1 Introduction

Speech communication using the VoIP is rapidly replacing the ubiquitouscircuit-switched telephone service. The packetization of speech and its trans-mission through packet-switched networks, however, introduces numerousimpairments including delay, jitter, packet loss, and decoder clock offset,which degrade the quality of the speech. Advanced signal processing algo-rithms can combat these impairments and render the perceived quality of aVoIP conversation to be as good as that of the existing telephone system.

We begin this chapter with a brief introduction to PSTN and VoIP networks.We then address the characterization of the impairments and discuss signalprocessing algorithms such as echo cancelation, adaptive jitter buffering, PLC,and decoder clock synchronization to mitigate their effects.

11.2 PSTN and VoIP Networks

A typical voice call using the PSTN proceeds as follows: analog speech fromthe near-end handset is first encoded at the originating exchange using the64 kbps G.711 PCM standard; it is then transported through a series of 64 kbpstime-division multiplexed (TDM) circuit switches to the terminating exchangewhere it is decoded back to the original analog waveform and sent to the far-end handset. Since the TDM switches in the voice path have small framebuffers, and are all synchronized to a common reference clock by an overlaysynchronization network, there is virtually no impairments to the switchedvoice samples. Thus the PSTN is ideally suited for voice communicationsand the resulting speech quality is considered to be outstanding. But it isnot flexible for switching traffic with rates other than 64 kbps, and is alsonot efficient for transmitting bursty traffic. Moreover, it requires two separatenetworks: a circuit-switched TDM network for the voice path and a packet-switched signaling system number 7 (SS7) network for setting up and tearingdown the connections.

275

Page 290: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

276 Principles of Speech Coding

TABLE 11.1

VoIP Impairments and Signal Processing Algorithms

Impairment Typical Values Signal Processing Algorithm

End-to-end delay 100–200 ms Echo canceler (ECAN)Packet jitter 20–100 ms Adaptive jitter buffer (AJB)Packet loss 1–5% Packet-loss concealment (PLC)Receiver clock skew 100–500 ppm Speech time-scale modification (TSM)

On the other hand, the packet-switched IP network natively supports vari-able bandwidth connections and uses the same network for both media andsignaling communications. In a VoIP call, speech is first encoded at the trans-mitter using one of the voice encoding standards, such as G.711, G.726, orG.729. The encoded speech is then packetized using the RTP. After appendingadditional headers to complete the protocol stack, the packets are routedthrough the IP network. They are depacketized and decoded back to analogvoice at the receiver.

The packetization and routing processes used in a VoIP system necessi-tate speech buffers. Thus, the voice packets transported in such a networkincur larger delays compared to the PSTN. They also arrive at the receiverunevenly spaced out in time due to the variation of the buffering delay in thepath routers. This delay variation, known as jitter, must be smoothed out witha jitter buffer. Furthermore, in a typical IP network, the intermediate routersdrop packets when they are overloaded due to network congestion. The ensu-ing gaps in the received packets have to be bridged with PLC algorithms. Afurther consequence of transporting voice through an IP network is that it isdifficult to reproduce the original digitizing clock at the receiving end, andhence the frequency of the (independent) playout clock generally differs fromthat of the sampling clock. This leads to underflow or overflow situations atthe receive buffer, as the voice samples are written into it at the original voicedigitizing rate but read out at a different rate. Such a clock skew problemhas to be corrected using silence interval manipulation or speech waveformcompression/expansion techniques.

Satisfactory communication of voice using the packet IP network thereforedemands that the effect of the above-mentioned impairments be mitigatedusing proper signal processing algorithms. Table 11.1 delineates the typicalvalues for these impairments and the relevant signal processing algorithmsto alleviate their effects.

11.3 Effect of Delay on the Perceived Speech Quality

The perceived quality of speech in a telephone network depends on the trans-mission delay and the amplitude of the reflected signal of the talker speech,

Page 291: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 277

TABLE 11.2

ITU Recommendation G.114 Delay Guidelines

One-Way Delay ITU Guidelines

>25 ms Echo control required<150 ms Mostly acceptable150–400 ms Acceptable (maybe)>400 ms Unacceptable (in general)

commonly referred to as echo. Almost all telephone conversations are con-taminated by some form of echo: either the electrical line echo generated atthe two-wire to four-wire translators (hybrids) in switching exchanges or theacoustic echo due to the coupling between the speaker and the microphonein hands-free telephone sets. A modest amount of low-delay echo, known asside tone, is actually desirable to preserve the naturalness of speech. This rep-resents the acoustic coupling between the mouth and the ear during normalconversations and is intentionally inserted in the telephone handset by feed-ing back a portion of the transmitted speech to the earpiece. If the round-tripdelay of the network echo is small, it is simply perceived as a change in thelevel of the side tone, and is therefore harmless. But it gets more annoying asthe delay increases. In fact, the ITU recommends the deployment of externalecho control equipment, such as echo suppressors or cancelers, when the one-way delay exceeds 25 ms. According to ITU Recommendation G.114, even ifecho is adequately controlled, the one-way delay has to be limited to 150 msfor acceptable quality of the conversation. For delays in the range 150–400 ms,the conversation may be acceptable. But if the delay exceeds 400 ms, humanswill find it difficult to carry on a normal two-way conversation, as they tend tocut each other off. Such delays are clearly unacceptable. These ITU guidelinesare summarized in Table 11.2.

The overall delay in a typical VoIP conversation consists of speech codecand speech packetization delays at the transmitter, propagation and queuingdelays in the transmission system, and decoding and jitter buffer delays at thereceiver. The combined one-way delay, tabulated in Table 11.3, is typically in the

TABLE 11.3

One-Way VoIP Delay Breakup

Speech codec 0.2–68 msRTP packetization 5–30 msNetwork 25–150 msJitter buffer 50–100 msTotal ∼80–400 msTypical 100–200 ms

Page 292: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

278 Principles of Speech Coding

range 100–200 ms, although it can be as much as 400 ms under worst-case sit-uations. These values should be doubled for estimating the round-trip delay.

The increased round-trip delay in VoIP connections normally mandatesthe deployment of echo control equipment. As mentioned earlier, there aretwo types of echo mechanisms at play in the telephone network: line echoand acoustic echo. We will describe their characteristics and the methods toalleviate their effects in the following sections.

11.4 Line Echo Canceler

Telephone line echo arises due to an impedance mismatch at the hybrid thatconverts the subscriber’s two-wire line to four-wire line. Hybrids are typicallyemployed at the switching exchanges. Figure 11.1 illustrates how echoes aregenerated at these two-wire to four-wire junction points. The speech origi-nating at the near end, for example, is reflected by the hybrid at the far-endswitch, and the resulting echo reaches the speaker after a time lag equal to theround-trip transmission delay. The amplitude of the echo naturally dependson the return loss of the hybrid, which is typically in the 10–20 dB range. Itturns out that the actual annoyance due to the echo is a function of both theamplitude and the transit delay; it escalates as the values of these parame-ters increase. Of course, the far-end speaker also suffers from the same echoproblem due to reflection from the near-end hybrid. Employing split ECANs,as shown in Figure 11.2, can minimize the irritation caused by the echoes. Itshould be noted that the ECAN at the near end actually provides protectionfor the far-end talker and vice versa.

Figure 11.3 illustrates the block diagram of a typical network line ECAN.The adaptive filter (AF) is the main functional element of the canceler. It isexcited by the near-end receive signal that is responsible for generating theactual echo via the hybrid. As shown in the figure, a replica echo signal issynthesized by this filter. This replica is then subtracted from the actual echo,thereby providing the required echo canceling function. The filter coefficients

2 wire loop 2 wire loop4 wire trunk

Speech

Echo

Hybrid

Hybrid

FIGURE 11.1 Line echo generation in the PSTN.

Page 293: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 279

2 wire loop 2 wire loop4 wire trunk

EchoH HEC ECEcho

FIGURE 11.2 Deployment of split ECANs in the PSTN.

are generally adapted using the residual signal so that filter response tracksthe time-varying hybrid echo path.

Many other functional elements are essential to the proper operation of thecanceler. These include the double-talk detector (DTK det), whose function is toprevent the adjustment of the filter coefficients during double-talk conditions,as adaption during such a condition corrupts the coefficients, and the nonlinearprocessor (NLP), which is required to suppress the low-level residual that isnot canceled by the subtraction process. An optional comfort noise generatormay be included as part of the NLP to insert background noise in place ofsilence when the NLP kicks in.

The basic principle of operations of line ECANs in the VoIP network issimilar to those in the PSTN. However, they have to be designed to oper-ate under larger signal delays and more severe signal distortions. They aretypically integrated into media gateway equipments that bridge PSTN andIP networks.

11.4.1 Adaptive Filter

The most common adaptive filter is an FIR filter. Figure 11.4 shows a blockdiagram of this filter. For such a filter, the output y(n) is given by

y(n) = h(n)Tx(n) = x(n)Th(n), (11.1)

H EchoDTKdet

NLP

AF

Signal fromfar-end

Error+–

Σ

Echoreplica

Signal tofar-end

Signal fromnear-end

FIGURE 11.3 Functional block diagram of a line ECAN.

Page 294: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

280 Principles of Speech Coding

z–1

h1h0 h2 hN–1

+–

Filteradaption

Signal fromfar-end

Signal fromnear-end

Echoreplica

Error

To NLP

x(n)

y(n)

e(n) d(n)

Σ

Σ

z–1z–1

FIGURE 11.4 FIR adaptive filter (N is the filter length).

where n denotes the sample time index (and also the iteration number), andthe coefficient and signal vectors are given by

h(n)T = [h0(n), h1(n), . . . , hN−1(n)], (11.2)

x(n)T = [x(n), x(n − 1), . . . , x(n − N + 1)]. (11.3)

The coefficients are generally updated according to the LMS algorithm:

h(n + 1) = h(n) + μe(n)x(n), (11.4)

where μ is the adaption gain, and the error signal e(n) is the difference betweenthe desired signal d(n) and filter output y(n):

e(n) = d(n) − y(n). (11.5)

The adaption gain μ determines the speed of convergence of the algorithm.It has to be chosen large enough to provide fast convergence time but smallenough to ensure stability of the update algorithm. Further, it turns out thatthe convergence time is a function of the input signal power with a fixedadaption gain. This can be avoided by normalizing it with the signal power.The resulting update rule is known as the normalized least mean square(NLMS) algorithm:

h(n + 1) = h(n) + μe(n)x(n)

γ + xT(n)x(n). (11.6)

Page 295: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 281

Now the effective adaption gain is

μ

γ + xT(n)x(n). (11.7)

A small protection term γ is included in the denominator to ensure that thegain value does not become excessively large when the signal power is small.

11.4.2 Double-Talk Detector

The double-talk detector, as the name implies, flags the condition that boththe near-end and far-end signals are active simultaneously. The adaptive filtershould not be updated under such conditions as it leads to incorrect filtercoefficients.

Asimple algorithm, often used in practice, compares the current echo signald(n) to the history of the filter input x(n), and declares double-talk if theabsolute value of d(n) exceeds one-half of the maximum of the absolute valuesof the filter states. That is, a double-talk condition is announced if

|d(n)| ≥ 12

max {|x(n)|, |x(n − 1)|, . . . , |x(n − N + 1)|} . (11.8)

Here N is the FIR length, and the factor of one-half is based on the assump-tion that the hybrid provides an echo return loss of at least 6 dB. The filteradaptation freezes once this condition is detected. The freezing actually isheld over for a short time interval, known as the hangover time, after thecondition clears.

An alternative method for double-talk detection is the cross-correlationmethod. This is based on estimating the cross-correlation function betweenthe two signals and declaring double-talk if the maximum of its absolute val-ues is larger than a prescribed threshold. Specifically, the correlation functionbetween d(n) and x(n) is evaluated first:

r(k) = E {x(n)d(n + k)}√E{x2(n)

}√E{d2(n)

} . (11.9)

Double-talk is declared if max{|r(k)|} ≤ T, k = 0, 1, . . . , N − 1, where T is adouble-talk correlation threshold, and E{·} is the expectation operator, whichcan be approximated by time averaging.

It takes a finite time interval to stop the filter adaption after the onset ofdouble-talk since its detection is based on the computation of signal energies.The filter coefficients may get corrupted during this interval. There are twotechniques to solve this problem. One approach is to save the most recent“good” coefficients that were calculated before the occurrence of double-talk,and replace the corrupted coefficients with the saved ones. The problem with

Page 296: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

282 Principles of Speech Coding

this approach is that several copies of the coefficients spaced out in timeneed to be saved, as it is difficult to figure out when actually the double-talkoccurred. A second technique is to employ an ECAN that has two separatefilters: an online (or foreground) filter that synthesizes the echo replica tocancel out the actual echo and an offline (or background) filter that only adaptsto the echo path but is not in the signal path. If it is determined that theoffline filter yields better echo cancelation performance than the online filter,then the coefficients of the online model are refreshed by those of the offlinemodel. This online–offline approach provides superior double-talk protectionand avoids filter divergence during double-talk conditions. But it is moreexpensive as it requires two separate filters.

11.4.3 Nonlinear Processor

The function of the NLP is to further reduce the residual echo level thatremains after imperfect cancelation so that the far-end talker does not hearany hybrid echo when the near end is quiet. Conceptually, it is a device thatblocks low-level signals and passes high-level signals. As an option, comfortnoise is transmitted to the far end when the NLP is operated.

The NLP should be deactivated when near-end speech is present; otherwiseit will distort the near-end speech. Moreover, it should be activated only whenthe level of the residual is significant since its primary role is to annihilate thisresidual.

A typical input–output characteristics of the NLP is shown in Figure 11.5.According to ITU Recommendation G.168, the suppression threshold levelof the NLP, which is equal to the highest level of a sine-wave signal at agiven moment that is just suppressed, should be adaptive and be in the range15–21 dB below the level of the far-end received signal.

11.4.4 Comfort Noise Generator

When the NLP is activated, it cuts off not only the residual echo but alsothe ambient noise that is inevitably present during normal conversation. Thefar-end listener hears nothing but pure silence during that period. There aretwo issues associated with this mode of operation. First, the person at the farend may think that the line is dead and hang up. Second, the backgroundnoise appears suddenly when the NLP is deactivated subsequently due tothe onset of near-end speech. The frequent activation and deactivation of theNLP therefore result in an unpleasant modulation of the background noise asheard by the far-end user. Consequently, the perceived quality of the speechdeteriorates. To circumvent these problems, the NLP must insert an artificiallygenerated comfort noise whose characteristics should be similar to that of theambient noise.

Page 297: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 283

Suppression threshold

FIGURE 11.5 Transfer function of the NLP.

The background noise may be adequately described by its energy andspectral content. Since the noise is usually time varying, these parametershave to be estimated frequently. It is also necessary to average the parame-ter estimation over a period of time in order to prevent abrupt changes inthe characteristics of the generated comfort noise. The appropriate amount ofaveraging depends on the degree of nonstationarity of the ambient noise.

The comfort noise generator usually consists of a noise-shaping filter,whose spectral characteristics mimic that of the background noise, drivenby a white noise source. The required filter order depends on the spectralnature of the ambient noise and the desired signal bandwidth.

Before the filter parameters can be estimated, a voice activity detector (VAD)is employed to determine the intervals during which only the backgroundnoise is present. A set of autocorrelation values is computed during theseinactive voice segments. Then, the required filter coefficients (predictor orreflection) can be obtained using the LD algorithm, for example.

11.5 Acoustic Echo Canceler

Hands-free telephone communication systems are prevalent today in mobiletelephony, teleconferencing, video conferencing, and distance learning.Acoustic echoes arise in such systems due to coupling of the audio signalfrom the loudspeaker to the microphone. These echoes are just as annoying

Page 298: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

284 Principles of Speech Coding

as line echoes if the round-trip delay is large. Even if this delay is small, high-pitched squealing, also known as howling, can occur due to acoustic feedbackif the parties at both the ends are using hands-free communications systemswith the microphone placed too close to the loudspeaker. The echo and howl-ing problems can be corrected by incorporating an acoustic echo canceler(AEC) in the terminal equipment. ITU Recommendation G.167 specifies theperformance requirements for such systems.

As in the case of line ECANs, AECs also need to be deployed in a splitarrangement to provide echo protection to both the ends. The AEC at thenear end provides echo and howling protection to the people at the far endand vice versa. Figure 11.6 is a typical functional diagram of the AEC. Herethe adaptive filter synthesizes the acoustic echo, which is subtracted from themicrophone output, consisting of the near-end speech and the actual acousticecho. But the echo attenuation provided by the adaptive filter is typically notsufficient. Hence an NLP is employed to reduce the residual echo that is notcanceled by the AEC. It uses nonlinear processing, typically a center clipper,to suppress the echo to a level that is normally not perceivable by the subjectat the far-end of the conversation.

An additional functional unit is typically needed in the AEC to preventhowling that ensues if the closed-loop system formed by the transmissionmedium and the two echo paths oscillates due to positive feedback. In such acase an unwanted strong sinusoidal signal, whose frequency depends on theloop delay, circulates in the closed loop along with speech signals. This canbe very annoying and therefore has to be suppressed by the AEC.

Since a howling condition results in strong sinusoidal signals, it can beeasily detected using a second-order adaptive FIR filter, for example. Thereare a number of methods to prevent the unwanted oscillatory behavior of thesystem due to howling.

One such method to prevent howling is to slightly shift the frequency ofthe transmitted signal after the echo cancelation. ITU Recommendation G.167specifies a maximum value for this frequency shift. But it also emphasizes thatthe frequency shift should be avoided if the terminal is likely to be used on

Echo DTKdet

NLP

AF

+

Signal fromfar-end

Error

EchoReplica

Signal tofar-end

Signal fromnear-end Howling

control

Howlingdetect

M

Wall SPKR

Mike

FIGURE 11.6 AEC functional block diagram.

Page 299: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 285

connections that include line ECANs, which will not be able to cope withthe time-varying tail echoes implied by the frequency shift. An alternativemethod is to gradually attenuate the signal in one or both of the paths whena howling condition is detected, and gradually remove the attenuation oncethe condition clears. Since G.167 allows up to 6 dB of attenuation in the paths,the gain control of the paths is a straightforward and simple solution to thehowling problem. A third technique is to employ an adaptive notch filter tosuppress the howling frequency and disable the filter when it is no longerneeded. The drawback of this method is that the notch filter will distort thedesired signal frequencies while suppressing the howling frequency.

It should be noted that the design of the AEC adaptive filter is more com-plicated than that of the line ECAN since the acoustic refection path can be“electrically” quite long. The speed of sound is about 343 m/s, which is muchlower compared to that of electromagnetic waves. Also, there can be multipleechoes within a room due to high reflectivity of the walls. For example, if theloudspeaker and microphone are colocated on one wall, and if there are eightsignificant reflections with a distant wall that is 10 m away, the total durationof the echo response is 8 × 10/343 = 0.233 s. This is quite long compared totail lengths of 32–128 ms typically encountered with line echoes, and requires233 ms × 8 taps/ms = 1864 taps in the adaptive filter. Also, the acoustic echopaths can be highly time varying due to the movements of objects or the hand-set equipment itself. (Note that a mere 4.3 cm change in the length of the echopath moves the impulse response by 1 tap at the 8 kHz sampling frequency.)This necessitates highly agile adaptive filters that are difficult to implement,especially in the speech environment where the eigenvalue spread of the cor-relation matrix is rather large. Sub-band techniques can be effective in thiscase as they generally perform better when the filter is long and the inputsignal is correlated.

11.6 Jitter Buffers

Since individual voice packets experience varying delays in a VoIP network,their arrivals at the receiver will be irregular even though they are generateduniformly at the transmitter. This delay variation, known as jitter, must beabsorbed in a buffer so that voice samples can be played out in the receive-end DAC at the exact (uniform) sampling rate of the transmit-end ADC. Thefunction of the jitter buffer therefore is to provide larger delays for packetsthat take a shorter time to travel through the network, and shorter delays forpackets that take a longer time, thereby transforming the varying networkjitter into a fixed delay. The size of the jitter buffer is a trade-off betweenpacket loss and buffering delay; late arriving packets will be discarded if thebuffer is too short, while the overall speech-path delay will be unnecessarilyextended if it is too long.

Page 300: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

286 Principles of Speech Coding

Network delay

Buffer delay

Playout delay

Packet emission

Packet arrival

Packet playout

Packet number

Tim

e

Packet loss

FIGURE 11.7 Principle of jitter buffering.

Figure 11.7 illustrates the principle of jitter buffering. Packets are emitted ata uniform rate and arrive at the receiver at irregular intervals. The differencebetween the two is the varying network delay. These packets have to be playedout at the same uniform rate as the emission rate, as shown by the playout linein the diagram. The difference between the playout time and the emission timerepresents the constant playout delay incurred by the packets, whereas thedifference between the arrival time and the emission time is the variable bufferdelay provided by the jitter buffer. Packets that arrive after their scheduledplayout time are lost.

Jitter buffers can be broadly classified as fixed or adaptive, based on thebuffering discipline. A fixed jitter buffer maintains a constant size, whereasan AJB adjusts its size dynamically to optimize the above-mentioned trade-off. However, care is needed in the adaptation process since adjustments in themiddle of an active speech segment will distort the underlying speech signal.Pauses between talk spurts are a natural place to perform the buffer adaptionas it will not significantly affect the perceived speech quality. However, itmay not always be possible to wait for the silence intervals if buffer spill isimminent due to large delay variation within talk spurts. In such cases, thealteration can be carried out within talk spurts by using well-known speechTSM algorithms.

Figure 11.8 illustrates the three jitter buffer strategies described above. Here,the y-axis denotes just the delay experienced by the packet, and not the actualtime. That is, we have subtracted the (constant slope) emission time from thevalues. The solid lines indicate the network delay incurred by the packets,whereas the dotted lines depict the playout delay, which is defined as the sumof the network delay and the buffer delay, for these packets. The differencebetween the two is the buffering delay provided by the jitter buffer.

Page 301: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 287

Typical playout delay

Packet emission

Packet delay

Packet number

Del

ayD

elay

Del

ay

Small playout delay

Large playout delay(a)

(b)

(c)Packet number

Talk spurt 1 Talk spurt 2 Talk spurt 3

Playout delay

Packet delay

Packet number

Playout delay

Packet delay

FIGURE 11.8 Jitter buffer strategies: (a) fixed playout schedule; (b) playout delay adaptiononly during silence periods; and (c) playout delay adaption within talk spurts.

In the fixed jitter buffering scheme, the buffer size, which essentially deter-mines the playout delay, is chosen at the beginning of the call and is heldconstant for the entire conversation. There is a trade-off between bufferingdelay and packet loss in this case, as illustrated in Figure 11.8a. If the bufferis small, the playout delay is also small, but a large proportion of the arrivingpackets is lost. This naturally deteriorates the voice quality. For an optimizedbuffer size, there is an evenhanded trade-off between the percentage of pack-ets lost and the magnitude of the playout delay. On the other hand, if thejitter buffer size is increased so that there is no packet loss, the playout delayis correspondingly increased, which also results in poor voice quality. Since

Page 302: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

288 Principles of Speech Coding

the network jitter varies during the length of the conversation, it is difficultto choose the proper size for the jitter buffer at the beginning of the call.

An AJB with adjustment of the silence intervals is depicted in Figure 11.8b. Itbehaves just like a fixed jitter buffer during a typical talk spurt. But the buffersize is altered for the next talk-spurt duration based on the network statisticsgathered so far. This is equivalent to simply altering the duration of the silenceinterval between the talk spurts; it is lengthened if a larger buffer is desiredor shortened for a smaller desired buffer capacity.

It is necessary to detect the onset of a talk spurt to use this method. This canbe accomplished in a number of ways. Transmitters that possess VAD circuitsset a marker bit in the transmitted RTP packet to indicate the first packet of atalk spurt. Alternatively, assuming that the transmitter does not send any RTPpackets during silence intervals, it is possible to deduce the beginning of atalk spurt by observing the sequence numbers and time stamps; the sequencenumbers differ by one for consecutive packets, whereas a big jump in theirtime stamps signals a talk spurt. But, if the transmitter does not have the VADcircuit, it will not be able to discriminate between silence and active speech,and will transmit normal RTP packets during silence intervals. In such a case,the receiver has to include the VAD circuit to mark talk-spurt boundaries.

The AJB that adjusts the buffer size only during silence periods requires aVAD either at the transmitter or at the receiver to recognize the talk spurts. Fur-thermore, it does not perform well in networks with rapid delay variations.Better results can be obtained in such networks if the playout delay is also adaptedwithin talk spurts, as illustrated in Figure 11.8c. Buffer modification during atalk spurt, however, is more complicated. It implies the alteration of the lengthof an active speech segment, as opposed to a silence interval, without chang-ing the natural pitch period of the underlying speech signal. SophisticatedTSM algorithms are generally employed to accomplish this. They perform thenecessary shrinking or stretching by deleting or inserting a number of pitchperiods within the active speech segment.

11.7 Clock Skew

The algorithms described above for computing the playout time are basedonly on the network delay and jitter statistics and do not take into accountthe actual fill level of the buffer. They perform effectively in a synchronizedsystem where the receiver clock is locked to that of the transmitter. As mostpractical receivers operate in the asynchronous mode, the actual bufferingdelay experienced by a typical packet will be different than what it wasintended to be.

Assume an ideal jitter-free network and a fixed jitter buffer of capacity Bmilliseconds.Assume further that the first received packet is placed at the cen-ter of the buffer. Then, in the synchronous case, this packet, and all subsequent

Page 303: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 289

packets, will incur a constant buffering delay of (B/2) milliseconds. (Note thatin an ideal synchronous network B can be set to zero, which implies that thepackets are played out as soon as they arrive.) But if the read clock is slowerthan the write clock, the actual buffering delay builds up to the maximumcapacity of the buffer, and eventually the buffer will overflow. The variationof this delay will be from zero to B milliseconds for a circular buffer design.And when the buffer overflows, a speech segment, equal in duration to thatof the buffer, will be deleted. The delay variation will be the same if the readclock is faster; the only difference in this case is that the buffer underflowsas a result of the buffering delay shrinking with time, and consequently thespeech segment will be repeated instead. This deletion or repetition of a sliceof the speech waveform is known as a slip.

The uneven playout, due to the buffering delay variation, in reality dis-torts the speech signal. But its effect on the subjective quality is minimalbecause the time scales involved are fairly long. However, the distortioncaused by the slips will be more pronounced, as it results in audible clicksdue to the discontinuity of the speech signal.

The frequency of slips depends on the clock skew and the buffer capacity. Ifthe fractional frequency offset between the transmitter and receiver clocks is ε

and if the buffer capacity is B seconds, then the duration between slips can beshown to be B/ε seconds. For example, assuming ε = 10−3, which means thatthe clocks differ by 1000 ppm, and B = 100 ms, the duration between slipsworks out to 100 s.

Thus, although it is theoretically possible to employ a very small receivebuffer in an ideal jitter-free network, such a scheme is not practical when theclocks are not synchronized, because the frequency of slips increases as thebuffer size decreases. On the other hand, increasing the buffer size implieslarger playout delays, which also adversely affect the subjective quality.

It is generally difficult to estimate this clock skew and convert the samplingrate of the received stream to a new rate to account for the skew. Hencean integrated AJB, where both the jitter and the clock skew issues can besimultaneously addressed, is the preferred solution in modern VoIP systems.

11.8 Packet Loss Recovery Methods

Packet loss is an unavoidable phenomenon in VoIP networks. In fact lossrates in the range 1–5% are routinely observed in uncontrolled IP networkssuch as the Internet. The loss mainly occurs at the intermediate routers dueto network congestion. Packets may also be lost at the receiver if they arriveafter their scheduled playout time, in which case they are simply discarded,or if there is no room in the jitter buffer to place them. In order to maintainan acceptable speech quality, it is essential to employ either forward error

Page 304: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

290 Principles of Speech Coding

correction (FEC) techniques to recover the lost packets or signal processingalgorithms to bridge the created gaps.

11.8.1 Transmitter-Based FEC Techniques

In the FEC method, the transmitter sends error correction information in addi-tion to the normal speech packets. If the speech packets are lost in the trans-mission, they can usually be recovered using the error correction information.There are two FEC methods specified by the IETF: media-independent FECand media-specific FEC.

The media-independent FEC scheme, specified in RFC 2733, is a generic methodthat is applicable to the transmission of any type of media including speech,video, or multimedia. Here, the transmitter generates one parity packet everyn media packets by applying an exclusive-or (XOR) operation across thegroup, as shown in Figure 11.9. The parity packets are sent along with themedia packets. (The receivers that do not implement FEC can just ignorethem.) If any of the packets within that group to which the FEC is appliedis lost, then it can be recovered by performing a second XOR operation onthe correctly received packets of the group. This scheme is easy to implementand does not require much computational resources. But it needs additionalbandwidth for transmitting the FEC packets. Further, the speech-path delayis increased, as the receiver must wait for the arrival of all (n + 1) packets toimplement the required error correction.

The media-specific FEC scheme, specified in RFC 2198, provides redundancyby transmitting multiple copies of the original data. The primary data packetis sent with the normal high bandwidth coding, while its copy is low bit rateencoded, to conserve bandwidth, and sent in the following packet, as depictedin Figure 11.10. If there is no packet loss, the copies are simply ignored at the

1 2 3 4

1 2 3 4 FEC

1 2 4 FEC3

1 2 43

Originalstream

Tx stream

Rx stream

Recoveredstream

FIGURE 11.9 Media-independent FEC scheme. (From Perkins, C., Hodson, O., and Hardman,V., IEEE Network Magazine, 12(5), 40–48, 1998. With permission.)

Page 305: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 291

1 2 3 4

1 2 3 4

1 2 43

1 2 43

Originalstream

Tx stream

Rx stream

Recoveredstream

1 2 3

1 32

FIGURE 11.10 Media-specific FEC scheme. (From Perkins, C., Hodson, O., and Hardman, V.,IEEE Network Magazine, 12(5), 40–48, 1998. With permission.)

receiver. However, upon the loss of a packet, the copy can be used to recoverthe lost information as illustrated. This scheme has the advantage of addingjust a single packet decoding delay. But, because of the low-bit-rate encodingand decoding employed, the computational complexity is higher and thespeech quality is somewhat compromised.

11.8.2 Receiver-Based PLC Algorithms

These techniques do not require the assistance of the transmitter and cantherefore be used instead of, or in addition to, the FEC method. Here thereceiver substitutes a lost packet with an estimate of it, often based on past andfuture decoded packets. This estimation can be done in many ways leadingto a variety of error concealment algorithms.

An easy substitution scheme is to simply replace a lost packet with a copy ofthe previous packet. Gradual fading of the repeated packets may be employedto yield better subjective quality in burst loss situations. This scheme is essen-tially an elementary form of signal interpolation. It is quite satisfactory, andcomputationally attractive, at low packet loss rates. More complex interpola-tion schemes, however, are necessary at higher loss rates to provide adequatespeech quality. These can be broadly classified as waveform substitution, pitchwaveform replication, and TSM. Two of these schemes are discussed brieflyhere.

The waveform substitution method bridges the gap created by the miss-ing packet with a suitable segment of the speech waveform found inalready decoded packets, or optionally in future packets. It is illustrated inFigure 11.11. A template block T of length (say 4 ms) is chosen at the end ofthe packet just before the gap. This is compared with the previously decoded

Page 306: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

292 Principles of Speech Coding

1 2 4

L L

1 2 4TM

TemplateBest match

Search window

1 4TM

L Replaced packet

FIGURE 11.11 Waveform substitution method.

waveform at least one packet period away, and the block that matches it best(call it M) is found by correlation or AMDF methods. Then, the waveformsegment of length L from the end of M is used to bridge the packet gapas shown in the diagram. With this technique, the phase discontinuity isminimized at the beginning of the gap, but is not properly addressed at theend of it.

TSM methods generally yield better performance. An example of thistechnique is illustrated in Figure 11.12. Here, the packet before the gap is head-expanded, the one after the gap is tail-expanded, and the two expansions aremerged using the overlap-and-add scheme, with fade-in and fade-out gains.

1 2 3 4 6 7

1 2 3 4

6 7

1 2 3 74 6

(a)

(b)

(c)

Time

Time

FIGURE 11.12 TSM method.

Page 307: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Signal Processing in VoIP Systems 293

The optimum point of overlap is chosen based on maximizing the correlationof the overlapped segments. But this may yield a bridged segment whoselength is slightly different from the gap. This should not cause any problemsif the jitter buffer adaptation is also performed using waveform expansion orcompression methods, as they typically lengthen or shorten a packet by anamount slightly different from the requested correction.

11.9 Summary

It is becoming apparent in recent days that all forms of communications,including voice, will be transported through the ubiquitous packet-switchedIP network, which inevitably imparts numerous impairments to the speechsignal. These include delay, jitter, packet loss, and clock skew. To improve theperceived speech quality in such networks, it is essential to mitigate theseeffects using appropriate signal processing algorithms.

The increased transport delay in VoIP networks renders the normally tolera-ble echoes to be more annoying. Hence, the deployment of ECANs is virtuallymandatory in such networks. We discussed two different types of cancel-ers: line ECANs to eliminate hybrid echoes and AECs to facilitate hands-freetelephone conversations.

Jitter buffers are essential to smooth out the inevitable delay variationscaused by the network routers. Fixed jitter buffers are simple but performpoorly as the network delay can vary substantially over the duration of atypical conversation. Thus, most modern VoIP systems employ AJBs thatautomatically adjust the buffer delay based on the observed network jitter.We talked about two methods to minimize the distortion effects caused bythe adjustment of the buffer delay in the midst of a conversation: buffer delayadaption only during silence periods and TSM of speech that permits delayadaption even within talk spurts.

The frequency offset between the transmitter and receiver clocks disturbsthe equilibrium at the jitter buffer and causes buffer spills. It is generallydifficult to assess this offset as it will be clouded by the network jitter. Weshowed, however, that both the network jitter and clock skew can be correctedby a properly designed AJB.

Packet loss can be a major source of impairment in long-distance packet-switched networks, and it is essential to use loss concealment algorithms toalleviate their effect in VoIP systems. While transmitter-based FEC methodscan be used to correct isolated packet losses, receiver-based signal process-ing algorithms are generally preferred as they can work independent of thetransmitter. We discussed several methods in this category, including wave-form substitution, pitch waveform replication, and TSM. They perform wellin moderate random packet loss situations but degrade dramatically in burstloss conditions.

Page 308: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

294 Principles of Speech Coding

Bibliography

1. ITU-T Recommendation G.114, One-Way Transmission Time, InternationalTelecommunication Union, Geneva, 2003.

2. ITU-T Recommendation G.167, Acoustic Echo Controllers, International Telecom-munication Union, Geneva, 1993.

3. ITU-T Recommendation G.168, Digital Network Echo Cancellers, InternationalTelecommunication Union, Geneva, 2004.

4. ITU-T Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies,International Telecommunication Union, Geneva, 1993.

5. ITU-T Recommendation G.711, Appendix I, A High Quality Low-Complexity Algo-rithm for Packet Loss Concealment with G.711, International TelecommunicationUnion, Geneva, 1999.

6. ITU-T Recommendation G.726, Adaptive Differential Pulse Code Modulation(ADPCM) of Voice Frequencies, International Telecommunication Union, Geneva,1990.

7. ITU-T Recommendation G.729, Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic-Code-Excited-Linear-Prediction (CS-ACELP), InternationalTelecommunication Union, Geneva, 1996.

8. RFC 2198, RTP Payload for Redundant Audio Data, IETF, September 1997.9. RFC 2733, An RTP Payload Format for Generic Forward Error Correction, IETF,

December 1999.10. Ramjee, R., J. Kurose, D. Towsley, and H. Schulzrinne, Adaptive playout mech-

anisms for packetized audio applications in wide-area networks, Proceedings ofIEEE INFOCOM 94, Toronto, pp. 680–688, June 1994.

11. Liang, Y.J., N. Farber, and B. Girod, Adaptive playout scheduling and lossconcealment for voice communication over IP networks, IEEE Transactions onMultimedia, 5(4), 532–543, 2003.

12. Goodman, D.J., G. Lockhart, O.J. Wasem, and W.-C. Wong, Waveform substitutiontechniques for recovering missing speech segments in packet voice communica-tions, IEEE Transactions on Acoustics, Speech and Signal Processing, 34(6), 1440–1448,1986.

13. Wasem, O.J., D.J. Goodman, C.A. Dvorak, and H.G. Page, The effect of waveformsubstitution on the quality of PCM packet communications, IEEE Transactions onAcoustics, Speech and Signal Processing, 36(3), 342–348, 1988.

14. Sanneck, H., A. Stenger, B. Younes, and B. Girod, Anew technique for audio packetloss concealment, Proceedings of IEEE GLOBECOM 96, pp. 48–52, November 1996.

15. Perkins, C., O. Hodson, and V. Hardman, A survey of packet loss recoverytechniques for streaming audio, IEEE Network Magazine, 12(5), 40–48, 1998.

Page 309: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

12Real-Time DSP Implementation of ITU-TG.729/A Speech Coder

12.1 Introduction

In this book, we present both theoretical and practical aspects of speechcoding. One of the practical aspects is the real-time implementation of thesespeech coders on digital signal processing chips (DSP processors).

This chapter describes the design and implementation of an applicationproviding multiple channels of real-time speech encoding and decoding onthe TMS320C6X DSP processor. The application is based on the ITU-T speechcoding Recommendation G.729/A with application development frameworkRF3 from Texas Instruments (TI).

The G.729/A (reduced complexity “CS-ACELP”) speech coding standardprovides an algorithm to encode 16-bit linear quantized speech data sampledat 8 kHz to an 8 kbps data stream. The ITU-T speech coding standard G.729/Awas modified to meet TI’s eXpressDSP algorithm standard (or “XDAIS”),which allows it to be used in TI’s multichannel reference framework (RF3).The hardware used was the TMS320C6X evaluation module (EVM). Apartfrom this, new algorithms are implemented for some of the signal processingintensive tasks.

One of the drawbacks of using RF3 with the C6x is that the TI example appli-cation is not directly compatible with the EVM hardware. RF3 was designedto run directly only on C54x and C6x DSP Starter Kit (DSK) hardware. Com-patibility with the C6x EVM can however be achieved by adding a customdevice driver controller to the example application. This component known asthe low-level I/O driver or LIO is a hardware-dependent library, which con-trols the DSP’s external peripherals related to audio data transfer. LIO puts astandardized wrapper around the device driver, allowing it to be compatiblewith RF3.

In order to write LIO modules for other hardware devices, see Reference[1], namely the TI Application Report SPRA802, “Writing DSP/BIOS (BinaryInput Output Studio) Device Drivers for Block I/O.”

295

Page 310: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

296 Principles of Speech Coding

Our objective to complete a real-time application running multiple chan-nels of the G.729/A vocoder∗ on the TI TMS320C6X DSP was accomplishedsuccessfully. Profiling results of the code are also presented. We also improvedthe performance and reduced the complexity by implementing some signalprocessing intensive parts more efficiently. Similar methods have been usedfor video and audio codec implementations [2,3].

This chapter is organized as follows. In Section 12.2, we briefly discuss theITU-T G.729/A speech vocoder standard. In Section 12.3, we describe ourimplementation using the RF3 and other software and hardware required.Finally, we summarize our major findings and make some conclusions.

12.2 ITU-T G.729/A Speech Coding Standard

The ITU has standardized many speech coders. One of the popular onesis the G.729/A vocoder standard [4]. This reduced complexity version ofthe vocoder has been developed for multimedia simultaneous voice anddata applications, although the use of the vocoder is not limited to theseapplications. Details are given in Chapter 9.

The general description of the coding/decoding algorithm is similar to thatof the full version G.729. The bit allocation is the same as that given in G.729.It has also the same delay (speech frame of 10 ms and lookahead of 5 ms). Themajor algorithmic changes to the full version of Recommendation G.729 aresummarized in Table 12.1.

The perceptual weighting filter uses the quantized LPF parameters and it isgiven by W(z) = A(z)/A(z/γ) with a fixed value of γ = 0.75. Open-loop pitchanalysis is simplified by using decimation while computing the correlationsof the weighted speech. Computation of the impulse response of the weighted

TABLE 12.1

Summary of the Principal Routines That Have BeenChanged

G.729 Routine Name G.729A Routine Name

Coder_1d8k () Coder_1d8a ()Decod_1d8k () Decod_1d8a ()Pitch_o1 () Pitch_o1_fast ()Pitch_fr3 () Pitch_fr3_fast ()ACELP_Codebook () ACELP_Code_A ()Post () Post-Filter ()

∗ The terms TMS320, C6X, eXpressDSP, and XDAIS are all trademarks of TI.

Page 311: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 297

synthesis filter W(z)/A(z), computation of the target signal, and updating thefilter states are simplified since W(z)/A(z) is reduced to 1/A(z/γ).

The search of the adaptive codebook is simplified. The search maximizesthe correlation between the past excitation and the backward filtered targetsignal (the energy of filtered past excitation is not considered). The search ofthe fixed algebraic codebook is simplified. Instead of the nested-loop-focusedsearch, an iterative depth-first tree search approach is used. At the decoder,the harmonic postfilter is simplified by using only integer delays.

12.3 TI TMS320C6X DSP Processors

The TI VelociTI Architecture used in the C6X DSP processor [5,6] is a mod-ification of the very-long instruction words (VLIW) architectural style thatwas developed many years ago. It is also a reduced instruction set computer(RISC) processor because there are only 44 instructions compared to otherDSP processors with more than 100 instructions. The instructions are wellsuited for intensive signal processing tasks.

Figure 12.1 shows a block diagram of the C6X architecture. The CPU fetchesVelociTI advanced VLIW (256 bits wide) to supply up to eight 32-bit instruc-tions to the eight functional units during every clock cycle. The VelociTI VLIWarchitecture features controls by which not all eight units have to be suppliedwith instructions if they are not ready to execute. The first bit of every 32-bit instruction determines if the next instruction belongs to the same executepacket as the previous instruction or whether it should be executed in thefollowing clock as a part of the next execute packet. Fetch packets are always256 bits wide; however, the execute packets can vary in size. The variable-length execute packets are a key memory-saving feature, distinguishing theC67x CPU from other VLIW architectures.

The CPU has two sets of functional units. Each set contains four units anda register file. The first set contains functional units .L1, .S1, .M1, and .D1; thesecond set contains units .D2, .M2, .S2, and .L2. The two register files eachcontain 16 32-bit registers for a total of 32 general-purpose registers. The twosets of functional units, along with two register files, compose sides A and Bof the CPU (see the functional block and CPU diagram in Figures 12.1 and12.2). The four functional units on each side of the CPU can freely share the16 registers belonging to that side. Additionally, each side has a single databus connected to all the registers on the other side, by which the two sets offunctional units can access data from the register files on the opposite side.While register access by functional units on the same side of the CPU as theregister file can service all the units in a single clock cycle, register access usingthe register file across the CPU supports one read and one write per cycle.

The C67x CPU executes all C62x instructions. In addition to C62x fixed-point instructions, the six out of eight functional units (.L1, .S1, .M1, .M2, .S2,

Page 312: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

298 Principles of Speech Coding

and .L2) also execute floating-point instructions. The remaining two func-tional units (.D1 and .D2) also execute the new data load LDDW (load doubleword) instruction, which loads 64 bits per CPU side for a total of 128 bits percycle.

Another key feature of the C67x CPU is the load/store architecture, whereall instructions operate on registers (as opposed to data in memory). Two setsof data-addressing units (.D1 and .D2) are responsible for all data transfersbetween the register files and the memory. The data address driven by the .Dunits allows data addresses generated from one register file to be used to loador store data to or from the other register file. The C67x CPU supports a varietyof indirect addressing modes using either linear- or circular-addressing modeswith 5- or 15-bit offsets. All instructions are conditional, and most can accessany one of the 32 registers. Some registers, however, are singled out to supportspecific addressing or to hold the condition for conditional instructions (ifthe condition is not automatically “true”). The two .M functional units arededicated for multiplies. The two .S and .L functional units perform a generalset of arithmetic, logical, and branch functions with results available everyclock cycle.

The processing flow begins when a 256-bit-wide instruction fetch packetis fetched from a program memory. The 32-bit instructions destined for theindividual functional units are “linked” together by “1” bit in the least signifi-cant bit (LSB) position of the instructions. The instructions that are “chained”together for simultaneous execution (up to eight in total) compose an exe-cute packet. A “0” in the LSB of an instruction breaks the chain, effectivelyplacing the instructions that follow it in the next execute packet. If an executepacket crosses the fetch-packet boundary (256 bits wide), the assembler placesit in the next fetch packet, while the remainder of the current fetch packet is

Internal buses

CPU

.D1

.M1

.L1

.S1

.D2

.M2

.L2

.S2

Regs(B0-B15)

Regs(A0-A15)

Control regs

EMIF

Ext’lMemory

ProgramRAM Data ram

D (32)

Serial port

Host port

Boot load

Timers

Pwr down

DMA

Addr

-Sync-Async

FIGURE 12.1 Architecture of the TI C6000 DSP chip. (Courtesy of Texas Instruments.)

Page 313: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 299

C64x CPU

Data path 2 Data path 1 Register file A

D2 S2 L2

A31-A16

Instruction decode

Instruction dispatchInstruction fetch Control registers Interrupt

control

S1L1

Dual 64-bit load/store paths

A15-A0

++++

++++

++++

+++

M1

X

D1++

Register file B

B31-B16 B15-B0

++

M2

Advanced instructionpacking

+x

X

xxx

X

X

xxxx

C62x/C67x CPU

Data path 2 Data path 1

L1

Register file A

M2D2 S2 L2

Instruction decodeInstruction dispatch

Instruction fetch Controlregisters

InterruptcontrolEmulation

S1

A15-A0

++ +

++

X+

M1

X

D1+

Register file BB15-B0

+

Dual 32-bit load/store path(Dual 64-bit load path-C67x only)

Advancedemulation

FIGURE 12.2 Detailed block diagram of the TI C6000 DSP CPU. (Courtesy of Texas Instruments.)

padded with no operation (NOP) instructions. The number of execute packetswithin a fetch packet can vary from 1–8. Execute packets are dispatched totheir respective functional units at the rate of one per clock cycle and the next256-bit fetch packet is not fetched until all the execute packets from the currentfetch packet have been dispatched. After decoding, the instructions simulta-neously drive all active functional units for a maximum execution rate of eightinstructions every clock cycle. While most results are stored in 32-bit registers,they can be subsequently moved to memory as bytes or half-words as well.All load and store instructions are byte-, half-word, or word-addressable. TheCPU can be configured as little or big endian.

The original VelociTI architecture has been enhanced into the VelociTI.2architecture as in the C64x CPU. Here are some of the enhancements: (i) regis-ter file enhancements 64 (up from 32) registers in total; (ii) data path extensions64-bit load/store data path; (iii) packed data processing dual 16-bit arithmeticon six functional units; (iv) quad 8-bit arithmetic on four functional units; (v)additional functional unit hardware instruction set extensions for communi-cations, video, and imaging applications; (vi) increased orthogonality; and(vii) increased code density.

The C64x CPU is 100% compatible with the C62x CPU. The C64x brings thehighest level of performance for addressing the demands of intensive real-time signal processing applications. At clock rates of 1.1 GHz and greater,the C64x can process information at a rate of nearly 9 billion instructionsper second. The C64x VelociTI.2 extensions improve performance of theC62x/C67x VelociTI architecture by a factor of 8 in broadband communi-cations and a factor of 15 in speech and image processing applications.

Page 314: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

300 Principles of Speech Coding

12.4 TI’s RF and DSP Algorithm Standard

TI has developed reference frameworks (RF), which is used as guidelines indeveloping software for any multichannel applications that are compliantwith the XDAIS standard. There are five different RFs (RF1–RF5).

The TI algorithm standard XDAIS and multichannel framework RF3 can bethought of as time-saving resources to assist both algorithm and applicationdevelopers alike. To demonstrate the usefulness and practicality of the RF, TIprovides an example application implementing a pair of XDAIS-compliantalgorithm modules in a two-channel real-time configuration. The two appli-cations in the example are a volume controller and an FIR filter used to processstereo streaming audio data from the EVM board’s codec.

XDAIS provides developers with a foundation in the form of a set of rulesand guidelines governing the implementation and use of algorithms. Follow-ing the standard ensures that algorithms are easily portable to any applicationsupporting TI DSP hardware. The complete set of rules and guidelines for thestandard are outlined in Reference [7] “TMS320 DSP Algorithm StandardRules and Guidelines,” SPRU352D.

As a basic requirement to XDAIS, all compliant algorithms must imple-ment an abstract interface called “IALG.” The interface provides, among otherthings, a binary standard for calling functions supported by all algorithms.Although written in “C” code, the IALG interface is extendible much the wayclasses are in C++. This extension is done through the use of a structure offunction pointers similar to a virtual function table or v-table. Extending theIALG interface gives engineers a mechanism for adding unique functionalityto any algorithm.

XDAIS-compliant algorithms must also define the collection of data fields,which represent their state. Combining the state information with the v-tablemakes XDAIS-compliant algorithms look like object-oriented programs. Analgorithm following the standard can be compiled into a binary componentand supplied to an application developer/engineer along with the properinterface header file. The application developer needs to create “instances” ofthe algorithm, state information and the v-table, and use them as needed.

RF1 is the most basic RF offered. This framework uses a minimal footprint tosupport between 1 and 3 channels of processing. It is designed as a completelystatic framework with no dynamic memory allocation.

RF3, on the other hand, benefits from greater flexibility and increased num-bers of channels, at the expense of a larger footprint (see Reference [8]). RF3also uses the DSP/BIOS for memory and resource management. We chose touse RF3 based on the desired channel count and the need for flexibility overminimized code size. It was also because it offers enough channels (up to 10)and good flexibility for our application.

In addition to the reference source code for the framework, TI includeswith RF3 two sample XDAIS-compliant algorithms: a simple FIR filter and a

Page 315: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 301

D/AStereo

left/rightcombine

A/DStereo

split left/right

Vol.control

Right channel

Left channel

Vol.controlLP filter

HP filter

FIGURE 12.3 Visual representation of the RF3 example application. (Note: Left and right channelprocessed independently.)

signal level or “volume” control. The two example algorithms combined withthe framework make up a complete reference example application. Runningthe example application demonstrates independent channel processing ofa stereo signal by applying an LPF to the left channel and an HPF to theright, with independent volume control on each. A set of GUI-driven slidercontrols is used to set volume levels for each channel independently [9]. Thesignal flow in the example application results in the incoming stereo signalbeing split into left and right channels, processed independently, and thenrecombined prior to being output. Figure 12.3 shows a graphical overview ofthe RF3 example application.

The simple example supplied by TI was modified as shown in Figure 12.4.This example was modified to run on the TI EVM hardware board. The simpletwo-channel model was then extended to provide the additional channelsneeded for the vocoder application.

12.5 G.729/A on RF3 on the TI C6X DSP

A block diagram of the RF3 is shown in Figure 12.5. In order to make RF3compatible with the C6x EVM hardware, additional components must beadded to the picture in Figure 12.5. The significance of all the RF3 componentsand how they relate to our example application running on the C6x EVMhardware is discussed in References [8–10].

D/AStereo

left/rightcombine

A/DStereo

split left/right

Decode

Right channel

Left channel

Encode

DecodeEncode

FIGURE 12.4 Modified signal flow diagram for G.729/A on RF3. (Note: Left and right channelprocessed independently.)

Page 316: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

302 Principles of Speech Coding

XDAIS algorithm module interface (IALG)

Memory management interface (ALGRF)

UTL (debug &diags.)

DSP/BIOS

Device driver adapter (PLIO)

Device driver controller (LIO)

Chip support library (CSL)

C6x EVM hardware

Application flow control

XDAIS compliant algorithm modules

Overview of an ExpressDSP application

Algorithm specificinterface andimplementation

RF3 Multi-channelframework

FIGURE 12.5 Block diagram of an RF3-based application. (Courtesy of Texas Instruments.)

One of the drawbacks to using RF3 with the C6x is that the exampleapplication is not directly compatible with the EVM hardware. RF3 wasdesigned to run directly on TI’s C54x and C6x DSK (not EVM) hardwareonly. Compatibility with the C6x EVM can however be achieved by addinga custom device driver controller to the example application. This com-ponent known as “LIO” is a hardware-dependent library, which controlsthe DSP’s external peripherals related to audio data transfer. LIO puts astandardized wrapper around the device driver allowing it to be compatiblewith RF3.

For more information on writing LIO Modules for other hardware devices,see Reference [1], namely the TI Application Report SPRA802, “WritingDSP/BIOS Device Drivers for Block I/O.” We now discuss two of the mostimportant modules needed.

12.5.1 IALG Interface

IALG is the abstract interface that all XDAIS algorithms are required toimplement. The interface is defined in the header file “Ialg.h” located in the“include” folder. The IALG interface, although not technically a componentin the “Src” folder, is closely tied to the modules located there.

Part of the interface is a structure “IALG_Fxns” containing pointers to func-tions. This structure acts as a table of virtual function pointers (v-table) thatget filled in by the overriding algorithm. It is up to the application developerto define an implementation for these functions based on a particular algo-rithm’s desired behavior. Algorithm developers have the ability to extendthe IALG interface by embedding the IALG_Fxns structure into a super-structure containing additional function pointers. This behavior of extending

Page 317: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 303

FIGURE 12.6 IALG_Fxns structure as defined in ialg.h.

an existing interface is common to object-oriented languages such as C++.Figure 12.6 shows the IALG_Fxns structure defined in “ialg.h.”

Another important feature of the IALG interface is a standard for algo-rithms to request memory at run time from a heap space. The structure“IALG_MemRec” defines fields used to determine the size, alignment, type,and location of memory required. Figure 12.7 shows the fields of the datastructure that the algorithms use to request memory blocks.

12.5.2 ALGRF

Algorithmic reference framework (ALGRF) is a library module used in RF3 togovern how heap memory gets allocated and initialized. RF3 relies on ALGRFfor creating and maintaining instances of an algorithm. Heap allocation fromALGRF can be done from internal or external memory for both scratch andpersistent memory. In RF3, scratch memory refers to memory whose contentsare not maintained between successive function calls. This type of memorycan be shared among all algorithm instances in an application. Persistentmemory, on the other hand, is guaranteed to retain state between functioncalls, and is therefore not shared between threads. Choosing scratch memorywhen possible helps to reduce overall memory requirements for multichannelapplications.

The ALGRF project builds an output library file “algrf.l62” that links intothe RF3 application. For further details of the ALGRF implementation, referto References [8–10] and the project files located in the “ALGRF” folder.

FIGURE 12.7 IALG_MemRec struct used by algorithms for requesting heap memory.

Page 318: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

304 Principles of Speech Coding

12.6 Running the RF3 Example on EVM

After making these necessary modifications to RF3 in order to support theEVM, the example application should build and run correctly. When theapplication is run, although the RF3 example only supports two channels,the framework can be expanded to yield additional channels. The subject ofincreasing channels is covered in relation to the G.729/A vocoder applicationin a later section in this chapter.

12.7 RF3 Resource Requirements

Using RF3 as a framework incurs overhead in terms of memory and CPU cyclerequirements. Determining the number of possible channels an applicationcan support requires knowing the overhead costs of the framework. Codecomposer studio (CCS) provides tools to measure both memory usage andclock cycle requirements. Determining the requirements for RF3 is the subjectof this section.

12.7.1 RF3 Memory Requirements

In the application project settings, the linker options tab allows for specifi-cation of a map file using the –m option. When using this option, buildingthe application project generates a map file showing the total memory sizeand portions used for all memory types. Figure 12.8 shows the memoryrequirements for the framework.

The EVM has five distinct memory banks listed in the name column inFigure 12.8. In Figure 12.8, the origin, length, and amount used are listedfor each memory bank. The internal program memory (IPRAM) section forinstance is 10,000 hex bytes long, which translates to 64 KB in decimal. Ofthe 64 KB available, the framework occupies almost half. The internal datamemory (IDRAM) section is also 64 KB in length; however, the amount usedis closer to 50 KB. The 50 KB in Figure 12.8 are somewhat misleading however,

FIGURE 12.8 Memory requirements for the RF3 framework.

Page 319: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 305

FIGURE 12.9 IDRAM properties panel showing internal heap size.

because 32 KB of it is actually reserved as an internal heap. This heap memorycan be allocated at run time as needed. Taking this into account, the actualstatic data memory requirement for the framework is around 16 KB. Changingthe internal heap size can be done through the DSP/BIOS IDRAM propertiespanel. Figure 12.9 shows the internal heap memory settings.

Setting the stack size appropriately is also important to optimizing memoryrequirements for an application. Figure 12.10 shows the stack size field in theMemory Section Manager properties panel. The remaining memory sectionsSBSRAM, SDRAM0, and SDRAM1 refer to external memory banks that arepresent on the EVM. A 32 KB block of SDRAM0 is reserved as an externalheap. The setting is controllable through SDRAM’s property panel.

Accessing external memory requires adding wait states to a program’s exe-cution time. This undesired side effect makes internal memory more attractivewhen execution time is critical. To assist in optimizing memory allocation foran application, the developer can use the “CODE_SECTION” pragma∗ tospecify which memory bank a particular section of the code is allocated from.

FIGURE 12.10 Stack size settings in the Memory Section Manager properties panel.

∗ For more information on pragma directives refer to the CCS built-in tutorial or help wizard.

Page 320: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

306 Principles of Speech Coding

FIGURE 12.11 Percentage of available clock cycles used by the RF3 framework.

12.7.2 RF3 Clock Cycle Requirements

To determine the clock cycle requirements for the framework, we require thatthe application be run without the volume and filter algorithms. In CCS, theDSP/BIOS “CPU Load Graph” is used to monitor the percentage of avail-able clock cycles used by the application. For the C6701 processor runningat 133 MHz, the program execution clock has 133 million cycles per second.Figure 12.11 indicates the percentage of available clock cycles used up by theframework. Surprisingly the RF3 framework uses less than 3.5% of the avail-able cycles. On the 133 MHz DSP, this equates to roughly 4 million cycles persecond.∗

12.8 Details of Our Implementation

The previous sections of this chapter have described the TI C6X processor,the RF3, and its application to the EVM hardware. Specific details of theG.729/Aapplication are in the Appendix. The focus will now shift to adaptingthe G.729/A reference code to run in the RF3. More details can be found inReference [5].

12.8.1 Adapting, Building, and Running the G.729/A Code

The reference “C” source code provided by ITU for the G.729/A vocoder iswritten to be a single channel non-real-time application. Both the encoderand decoder are combined into a single command line application, whichprocesses data from test vector files.

There are several modifications required in order to get multichannel imple-mentation that operates on accumulated audio data samples in real time.

∗ The 3.5% figure was achieved when the framework was compiled with full optimizationwithout the volume or filter algorithms running.

Page 321: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 307

By default, the vocoder example does not lend itself to working well in amultichannel architecture like the example provided in RF3. In order to makeuse of RF3, the reference code needs to be restructured to make it more “object-oriented”-like. Due to the complexity of code modifications, it is stronglyrecommended that the early stages of development be carried out on a PC orWorkstation general-purpose compiler rather than on the TI compiler in CCS.The reason for doing this is to minimize the time spent recompiling and run-ning test vectors on the EVM hardware. Since there are numerous changesnecessary in the source code, it makes sense to run the ITU test vectors tomake sure that the vocoder algorithm did not get broken in the process. Thischecking can be directly run on the EVM; however, the intended use of theEVM is for processing real-time data from the codec. As a result, reading andwriting data from test vector files is quite slow. Since one of the objectives is tomaintain bit-exactness, the test vectors should be run frequently throughoutcode development. For this reason, a compiler such as Microsoft Visual Studiois recommended initially. Once the vocoder source code has been modified tosupport the XDAIS standard, the algorithms can be built to run in RF3, anddevelopment can then be transitioned over to CCS.

12.8.2 Defining the Data Type Sizes for the Vocoder

The G.729/A reference code is designed to yield equivalent numerical com-putations output regardless of hardware platform. Since data type sizes in theC language can vary from one hardware platform to the next, the referencecode includes the header file “typedef.h,” which defines standards for datasizes based on platform. This file contains conditional data size definitions,which are selected by defining the appropriate hardware platform symbol inthe application project’s build settings. It is the developer’s responsibility toensure that the proper platform symbol gets defined. The symbol requiredfor the Microsoft C compiler for example is “_MSC_VER.”

The ITU-T G.729/A file “typedef.h” does not include a set of definitionsfor the C6x processor. The proper type definitions need to be added to thisfile by the developer, and a symbol should be chosen to select these types.Figure 12.12 shows the modifications to “typedef.h” required for the C6xprocessor.

In this case the symbol chosen for the C6x DSP processor is “_C6X_.”

FIGURE 12.12 Modifications to Typedef.h to support the TMS320 C6x.

Page 322: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

308 Principles of Speech Coding

12.8.3 Early Development Using Microsoft Visual Studio

The ITU reference software combines the encoder and decoder together intoa single executable. For a practical vocoder, the encoder and decoder shouldbe separate processing units. In a typical real-world application, the encoderand decoder can be processing independent data streams.

Some of the sources from the ITU reference code have the function declara-tions organized into a few centralized header files. In order to make the organi-zation process simpler the function declarations were first decentralized fromthe original header files. This was accomplished by ensuring that each sourcefile had a corresponding header file of the same name. For example, the sourcefile “GainPred.c” now has a unique header file “GainPred.h.” Following thisprocess, a set of files are created, which are not tightly coupled together.

In order to separate the encoder and decoder into independent modules,the source files were first divided into three categories: encoder-specific files,decoder-specific files, and common files. From the three categories of sourcefiles, three separate Microsoft Visual Studio library projects were then createdand used to build independent encoder and decoder algorithm modules. Thethree library projects created for the vocoder are titled “G729A_CommonLib,”“G729A_DecodeLib,” and “G729A_EncodeLib.” Each of the project folders forthese library components can be found in the “MSProjects” directory locatedinside the CCS root directory “ti.”

The three individual library projects used to build the vocoder modules aredescribed as follows:

• G729A_CommonLib—This Common Library project contains filescommon to both the encoder and decoder components. However,the source code for this library does not contain any static stateinformation for either component. The common library is simplya collection of common support function calls that the encoder ordecoder libraries use or link to. The list of source files and headerfiles used in G729A_CommonLib is shown in Figure 12.13.

• G729A_EncodeLib—The G729A_EncodeLib project is a library con-taining source code unique to the encoding algorithm. In order tomake this library multichannel friendly, all of the encoder’s stateinformation was collected into a single data structure. Having stateinformation contained in a structure allows the RF3 framework tocreate each channel with a separate instance of the encoder struc-ture. The definition for the encoder state structure can be found inthe header file “g729AEncState.h.” Figure 12.14 shows all the fieldsthat make up the encoder state.

Since ultimately the vocoder must implement the “IALG” interfacein order to work with RF3, it made sense to begin implementing theencoder library where applicable. The file “g729A_scu_ialg.c” con-tains the source code needed for initializing instances of the encoder

Page 323: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 309

FIGURE 12.13 Files used to build G729A_CommonLib project.

FIGURE 12.14 Encoder state definition.

Page 324: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

310 Principles of Speech Coding

FIGURE 12.15 Files used to build G729A_EncodeLib project.

state structure. This file is modeled after the “vol_ti_ialg.c” sourcefile from the RF3 example application.

The file “g729A_scu_ialg.c” contains some statements that are notneeded for the Microsoft Visual Studio project but are required forthe EVM implementation. For this reason, a conditional statement“#ifndef HOST_CODE_ONLY” was added to make the file compat-ible in both situations. Figure 12.15 lists all of the source files andheader files used to build the encoder library.

• G729A_DecodeLib—This library component contains all codes spe-cific to the decoder implementation. Following the same reasoningused in the encoder, a structure was created to encapsulate thedecoder state. In the application, each channel of the decoder hasits own instance of the state structure. Figure 12.16 shows the fieldsrepresenting the decoder state structure.

Page 325: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 311

FIGURE 12.16 Decoder state definition.

Similar to the file g729A_scu_ialg.c used in the encoder,“g729A_scu_ialg.c” contains the state initialization for instances ofthe decoder.∗ The list of source files used to build the decoder libraryis shown in Figure 12.17.

12.8.4 Microsoft Visual Studio Encoder Project

In order to test the encoder library, an application is required to run thetest vectors. To make this possible, a new Microsoft Visual Studio projectwas created at the same directory level as the library projects, entitled“G729A_Encoder.” This MS windows console application, contains a singlesource file “G729A_Encoder.c,” which implements the program’s entry point“main.” Both the encoder library and common library files are linked into theapplication. The function main takes two command line arguments, whichprovide the Input data vector file name and the Output bit stream vectorname. The main loop of the program reads from the input test vector file oneframe at a time and encodes the audio data into an 8 kbps stream. The result-ing bit stream is then written to the output file. Figure 12.18 shows the mainloop for the encoder.

∗ The “i” prefix in this file is used to indicate inverse to the encoder.

Page 326: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

312 Principles of Speech Coding

FIGURE 12.17 Files used to build G729A_DecodeLib project.

12.8.5 Microsoft Visual Studio Decoder Project

The same principles employed on the encoder are used on the decoder aswell. In this case, the application project name is “G729A_Decoder” and theproject’s main source code file is “DECODER.C.” The decoder library andcommon library are linked into the application. Figure 12.19 shows the mainloop for the decoder application.

12.8.5.1 Comparing Test Vectors

The output vector files generated from the encoder and decoder applicationsneed to be compared against the reference vectors provided by ITU. Rather

FIGURE 12.18 G729A_Encoder main program execution loop.

Page 327: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 313

FIGURE 12.19 G729A_Decoder main program execution loop.

than tediously comparing the file contents manually, a program was writtento make this binary comparison between any two files. The program’s sourcecode is located in the “G729VectorCompare” folder at the same directorylevel as the encoder and decoder program folders. This is a windows con-sole program, which takes two command line arguments to specify the twotest vectors being compared. When this program runs, it displays a messageindicating pass or failure.

12.8.5.2 Measuring Performance Timing on Microsoft Visual Studio

In order to measure the time required for the console application to encodeand decode test vectors, a project “HPTime” was created. This project buildsa library file, which uses Windows OS high-performance timers to measurethe execution time. The “HPTime” project folder is at the same directory levelas the other Windows project folders.

12.8.5.3 Automating the Test Vector Comparisons on Windows

Considering the amount of modifications required to separate the encoderand decoder, and to make them useable in a multichannel environment,the test vectors should be processed and checked often. Manually runningthe encoder, decoder, and vector-compare application for each test vectoris a tedious process. This process can be automated, however, by runningthe applications from scripts rather than manually from the command line.A folder “G729TestVectors” at the same directory level as the encoder anddecoder applications contains a set of batch script files for running the test vec-tors. Running the scripts in order “EncodeAllVecs.bat,” “DecodeAllVecs.bat,”and “compareAllVecs.bat” will put the entire ITU test vectors through theencoder, decoder, and vector to be compared.∗ Within the “G729TestVectors”

∗ The set of ITU test vectors is included with the reference code example for G.729/A.

Page 328: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

314 Principles of Speech Coding

folder, there are subfolders containing all of the reference vectors. Automatingthis process of running and checking the test vectors reduces the developmenttime significantly.

12.9 Migrating ITU-T G.729/A to RF3 and the EVM

The desired use for the G.729/A encoder is not for processing files, but ratherfor processing data captured from the EVM codec in real time. Similarly, thedesired decoder behavior is to provide the codec with output data in real time.In real-world applications, analog speech inputs would pass through a codecand then processed locally by the G.729A encoder channels; the resultingbit streams would then be transmitted at a reduced bandwidth to remotelocations for decoding. Similarly, encoded bit streams from remote locationswould be transmitted back to the local G.729A decoder, and the decodedoutput would be sent to the output codec. Figure 12.20 shows a signal flowdiagram for the real-world example case.

Given the limited scope of this project, and the architecture of the mul-tichannel framework, a modified signal flow was used in conjunction withRF3. The processing model was made to behave similarly to the original RF3example application. In this context, the input speech is encoded on a channel-by-channel basis and then passed directly to the decoder channels where thespeech data are restored and sent to the output codec. Figure 12.4 shows themodified signal flow diagram for the vocoder running on the EVM hardware.

12.9.1 Creating a New Application

Starting from a known working implementation on the PC makes theprocess of migration of the vocoder to the TI EVM board much easier.

A/D Stereo

split left/right

Left channel

Right channel

Decode

Decode

Stereo left/right combine

D/A

Encode

Encode

Left channel

Output toremote decoder

Input fromremoteencoder

Right channel

FIGURE 12.20 Real-world G.729/A vocoder example signal flow. (Note: Left and right channelprocessed independently.)

Page 329: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 315

In order not to upset the working EVM-supported RF3 example appli-cation, a copy of the entire “RFs” directory was made and renamed to“Elen299referenceframeworks_G729A.”

12.9.1.1 Adapting the Vocoder Library Files for the EVM

The concept of the encoder, decoder, and common libraries used in theMicrosoft Visual Studio vocoder application can be directly applied to theimplementation on the EVM. The original RF3 example application usedthe Vol_ti and FIR_ti library projects to build the volume and FIR filtermodules, respectively. For the G.729/A application, the Vol_ti and FIR_tilibrary projects were replaced with three new ones, “G729A_CommonLib,”“G729A_SCU_67x,” and “IG729A_SCU_67x.” These new projects are equiv-alent to the PC library projects “G729A_CommonLib,” “G729A_EncodeLib,”and “G729A_DecodeLib,” respectively.

In order to make the encoder and decoder compatible with RF3, theimplementation for the IALG interface must be completed. This implemen-tation requires adding the following project source files to the encoder,“g729A_scu_ig729A_vt.c,” “g729A_scu_ig729A.c,” and “g729A_scu.c,” andthe equivalent files “ig729A_scu.c,” “ig729A_scu_ig729A_vt.c,” and“ig729A_scu_iig729A.c” to the decoder. Figure 12.21 shows the sourcefiles required to build the encoder, decoder, and common libraries for RF3.

12.9.1.2 G.729/A Application for RF3

Having added the new library projects for the G.729/A encoder anddecoder modules, it becomes necessary to create a new application projectto use them. In the “apps” directory, the “RF3” example is replaced with“rf_3_Dual_G729A_67x.” This is the project directory for a two-channelG729A vocoder application targeted for the TI C6x EVM. There are severalsubdirectories within the project directory.

12.9.1.3 algG729A and algInvG729A (Function Wrappers)

The two subdirectories “algG729A” and “algInvG729A” contain wrapperfunctions to the vocoder interface. These are a direct replacement for theinterface wrappers to the TI volume and FIR filter modules.

12.9.1.4 appModules

This subdirectory contains an equivalent set of source files from the RF3 exam-ple application appModules directory. For most part the files in the G.729/AappModules directory are direct copies of the files from the example applica-tion, with references to the TI modules changed to reference the encoder and

Page 330: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

316 Principles of Speech Coding

FIGURE 12.21 Source files to build “G729A_CommonLib,” “G729A_SCU_67x,” and“IG729A_SCU_67x.”

decoder modules. However, there are some additional changes as well. Themost notable changes are the following:

• appResources.h—Sets the data frame length equal to the G.729A framesize. This is defined by ITU to be 80 samples.

• appThreads.c/.h—Defines the length of the intermediate data buffer tobe equal to the G.729 frame size +2. This is the buffer, which storesthe bit stream data between the encoder and decoder modules. Itmay seem counter intuitive that such a large buffer is required for abit stream that requires only 1/8 the space of the input frame. Thereason for this buffer size is that the ITU reference code encodes eachbit of the stream into a 16-bit data word representing a 1 or 0. Theextra two 16-bit words are used for frame sync and frame size. Thefollowing info from the G.729A reference code defines the bit andsync word values:– #define SYNC_WORD (short)0x6b21– #define BIT_0 (short)0x007f /* definition of zero-bit in bit stream

*/

Page 331: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 317

– #define BIT_1 (short)0x0081 /* definition of one-bit in bit stream*/

For information on the bit stream fields, refer to the “Read.Me” file includedin the ITU reference code.

• ThrAudioproc.c/.h—Initializes and runs the encoder and decoder.

12.9.1.5 C67xEMV_RF3 (Application Project)

The application project for the G.729A vocoder implementation requires onlya few changes from the RF3 example application. The main difference is theset of files required to build the vocoder application. Figure 12.22 shows thesource files required to build the G.729/A application.

In addition to the source files, the vocoder application requires linking in theG.729/A encoder and decoder library files. Figure 12.23 shows the changesto the “link.cmd” file necessary to link the encoder and decoder libraries.

FIGURE 12.22 Files used to build G.729/A vocoder application.

Page 332: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

318 Principles of Speech Coding

FIGURE 12.23 link.cmd file for the G.729/A vocoder application.

12.9.1.6 Building the G.729/A Vocoder Application

Following the modifications outlined in this section, we should get the appli-cation to a state where it compiles with no errors; however, an error will begenerated when the linker is running. The reason for this error is that the com-bined memory from the framework and vocoder modules exceeds the IPRAMspace where they are allocated. A quick fix to this problem is to open the prop-erties panel for the DSP/BIOS Memory Section Manager and change the .textsection allocation to SBSRAM memory. This will move the compiled programcode into the sufficiently larger external memory. The drawback to doing thisis that the program execution time will slow down considerably due to theadditional wait states required for external memory access. In fact, not evena single channel of the vocoder will run in real time from external memory.Optimizing the code to run in real time is the subject of the next section.

12.10 Optimizing G.729/A for Real-Time Execution onthe EVM

Without proper optimization, the G.729A vocoder will not run in real timeon the TI EVM. There are several steps that need to be taken in order to getthe best performance on the EVM hardware. The optimization process canbe broken into three categories, “Project settings,” “DSP/BIOS settings,” and“Code changes.”

12.10.1 Project Settings

The most basic optimization technique is to “tweak” the project settings.Setting the optimization level tells the compiler to make certain decisionswhen building the program, such as removal of unnecessary data, pipelining

Page 333: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 319

FIGURE 12.24 Build options file level optimization on full.

software loops, and so on. Each of the project files including libraries as well asapplications has a user-configurable build options panel. By default, all of theRF3 example project files have optimizations turned off in the build optionspanel. Turning the optimization level to full requires changing the buildoption’s “Opt. Level” setting to –o3. This is the highest optimization possibleat the file level. Additional optimization can be achieved by also changing thesetting for “Program Level Opt.” The program level settings do not appear tohave much effect on the program execution speed. Figure 12.24 shows the –o3optimization level setting in the project build options. The project optimiza-tion settings alone, however, are not enough to get the application executingin real time.

12.10.2 DSP/BIOS Settings for Optimization

In order to get optimal performance from the hardware, the program codeshould reside in IPRAM. The combined memory requirements from theG.729/A libraries and RF3 framework, however, do not allow for this possi-bility. Another method must be used in order to get good performance (suchas we get if the program is run from IPRAM) while using external memory forcode storage. To solve this, the TI DSP provides a caching mode that allows itto read program code from external memory in blocks and store internally atrun time. This allows the cached code sections to run from internal memory.The drawback to this method is the overhead required in moving data fromexternal memory to the cache. The caching mode can be enabled through theDSP/BIOS global settings properties panel, by selecting the “Cache Enable”setting from the Program Cache Control field. Doing this turns the entireIPRAM section (IPMEM) into a giant cache reserve. Figure 12.25 shows thecache enable setting in the DSP/BIOS.

Page 334: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

320 Principles of Speech Coding

FIGURE 12.25 Enable the program cache control.

When using this feature the developer must ensure to allocate all pro-gram memory sections externally. The hardware will then cache blocks ofthe program to internal memory as needed. The DSP/BIOS Memory Sec-tion Manager allows the developer to specify where code and data sectionsare allocated in memory. Figure 12.26 illustrates how to use the DSP/BIOSMemory Section Manager properties panel to allocate the program sectionsfrom external memory.

12.10.3 Code Changes for Optimization

Even with the optimizations described in Sections 6.3.1 and 6.3.2, furtherwork needs to be done to get the G.729A vocoder processing data in realtime. The next optimization step comes from analysis of the source code andlook for “problem” areas. Without even doing any functional analysis, just

FIGURE 12.26 Allocating the application code from external memory.

Page 335: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 321

browsing through the application source code reveals one section of codelikely to benefit from optimization. This section of code is located in the file“basic_op.c.”

The ITU G.729/A reference code implements a set of functions that emulate16-bit logical and arithmetic hardware operations in a file called “basic_op.c.”These basic operations are the foundation on which the vocoder is built. Thereason for ITU using these functions is to make the G.729/A source codecompatible with any generic processor. Unfortunately, the basic operationsrequire many instructions to complete operations that should otherwise takea single instruction given the proper hardware. One way to solve this prob-lem is through the use of “intrinsic” functions in the program code. Intrinsicinstructions are a means by which a programmer can directly call individualDSP assembly instructions from C code. The majority of the implementationof the basic operations can be replaced with the TI C6x intrinsic instructions.In fact TI provides a translation for the basic operations to intrinsic functionsin a header file titled “gsm.h.”∗ As an example, the basic operation L_add isredefined to the intrinsic “_sadd” in the following statement taken from thefile GSM.h.

#define L_add(a,b)(_sadd((a),(b)))/∗int sat addition∗/

Following methodology, it stands to reason that the entire contents of gsm.hcan be cut and pasted into the header file “basic_op.h.” Doing so makes thesource code in the file “basic_op.c” obsolete and it can then be removed fromthe G729A_CommonLib project. Since the intrinsic functions are specific to theTI DSP, the developer may want to conditionally include the gsm.h definitionsin the basic_ops.h header file, allowing the header file to be used in multiplecontexts. In order to make the “basic_op.h” header file compatible with boththe Microsoft Visual Studio version of the application and the C6x-specificversion, the conditional statement “#ifndef HOST_CODE_ONLY” was usedto signal the compiler to use the intrinsic-based function definitions. Thissymbol needs to be defined only in the Microsoft-based version.

Once the 16-bit basic operations have been optimized, the focus shifts toanother similar group of functions representing 32-bit operations. The files“oper_32b.h” and “oper_32b.c” contain functions that are built completelyfrom the 16-bit basic operations. Since the basic operations are now definedas intrinsic functions in the “basic_op.h” file, the functions in “oper_32b” cannow be inlined.† This optimization forces the compiler to directly insert codefor the 32-bit operations, rather than branching to a function address eachtime they are called.

In the unoptimized version of the basic operations, a global variable isused to flag overflow conditions during the emulated operations. In several

∗ The TI header file gsm.h is located at “ti\C6000\cgtools\include.”† Inlining functions for the C6x compiler requires that they not call other functions except

intrinsic functions.

Page 336: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

322 Principles of Speech Coding

places throughout the G.729/A source code, this flag is checked and usedto make decisions on how data processing should proceed. When using theoptimized version of the basic operations however, the overflow flag doesnot get updated. Instead most of the substituted intrinsic functions use sat-uration to indicate the equivalent overflow condition. Saturation conditionscan be determined by examining the result of an operation, or by checkingthe hardware saturation flag. The TI C6x control status register (CSR) definesbit 9 as the saturation indicator. To check the hardware saturation flag from Ccode the CSR should be defined as follows in the source code, “extern creg-ister volatile unsigned int CSR.” The cregister keyword is a special indicatorto the compiler that this is referencing a DSP-specific hardware register. Thedeveloper can get the status of the saturation bit by taking the bitwise “AND”of 0x200 with the CSR register.

The source code optimizations discussed should be enough to get thevocoder to execute in real time for two channels. The next section dealswith determining the performance characteristics for the two-channelimplementation.

12.11 Real-Time Performance for Two Channels

After making the necessary optimizations, the two-channel G.729/A vocoderapplication should run in real time. The hardware resource requirements forthe RF3 framework were determined in an earlier section. It is now neces-sary to determine the resource requirements per channel for the G.729/Aalgorithm.

12.11.1 Memory Requirements for G.729/A

The memory requirements can again be determined by compiling the projectwith the –m option. Figure 12.27 shows the generated memory output for thecombined two-channel G.729/A vocoder and RF3 framework.

FIGURE 12.27 Memory requirements for two-channel G.729/A vocoder and RF3.

Page 337: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 323

FIGURE 12.28 Heap space use in a two-channel G.729/A vocoder application.

From Figure 12.27, it is apparent that no IPRAM is used for the applica-tion. It was determined that using only external program memory was arequirement for the hardware cache feature. A large portion of data memory,however, is used by the application. This amount reflects an 8 KB block forthe heap, and 4 KB for the stack. Adjusting the stack size is ultimately a trialand error process. For RF3 applications, the required heap size can be deter-mined easily through the UTL diagnostic features. Opening the DSP/BIOS“Message Log” will display diagnostic messages from the UTL library mod-ule.∗ The diagnostic messages reveal the percentage of allocated heap spaceused by the application. Figure 12.28 shows the internal heap memory usage,which is 83% of the reserved 8 k for the two-channel vocoder application.From this information, the minimum heap size per channel can be estimatedto be slightly less than 3.5 KB.

12.11.2 Clock Cycle Requirements for G.729/A

In addition to the memory requirements, it is important to determine the CPUclock cycle load that the G.729/A application exerts. The clock cycle load canbe determined from the DSP/BIOS “CPU Load Graph.” Figure 12.29 showsthe DSP clock cycle usage for the two-channel vocoder application.

The measured load in Figure 12.29 represents both vocoder channels as wellas the additional load from the framework. Subtracting 3.4% incurred by the

∗ The level of diagnostics must be set up in the project build settings.

Page 338: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

324 Principles of Speech Coding

FIGURE 12.29 CPU load for a two-channel G.729/A application.

framework and dividing the remaining load by the number of channels yield11.15% per channel.

12.11.3 Resource Requirements Summary

In summary, the observed hardware resource requirements per channel forG.729/A are 11.15% of available clock cycles (at 133 MHz) and 3400 bytesof internal heap memory. In addition to the heap memory required for thevocoder state information, an additional 1200 bytes are required for creation ofrequired DSP/BIOS objects (pipes and SWI). Based solely on the available DSPclock cycles, the predicted number of vocoder channels would be 8, resultingin 92.6% processor load. However, factoring in the available IDRAM reducesthe channel count by 1. Seven channels of the vocoder require approximately32 KB of memory, which occupies almost the entire internal heap.

12.12 Checking the Test Vectors on the EVM

Before taking further steps to get additional vocoder channels, it makes senseto first run all of the ITU test vectors on the EVM. Doing so ensures thatthe hardware-specific code changes made would still result in a bit-exactimplementation of the ITU Recommendation. Running the test vectors onthe EVM requires modifying the source code to process data directly fromthe vector files rather than the codec. Adding a few extra lines of code to thefile “thrAudioproc.c” allows for conditionally running the vocoder with testvector files. The symbol “DEBUG_WITH_VECTOR” when defined in thrAu-dioproc.c selects the test vector file processing over codec data processing.

Page 339: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 325

FIGURE 12.30 Conditional code from the file thrAudioproc.c used to run test vectors on EVM.

When running the vocoder with the vector files, each file is run individuallyone at a time, and the resulting output is compared with the reference data.The thing to remember when processing the vector files from the EVM is thatreading file data over the PCI bus is quite slow. It is therefore not surprising tosee that the vector processing runs much slower than real time. Figure 12.30shows the conditional code from the function “thrAudioprocRun,” whichruns the test vectors through the vocoder.

12.13 Going Beyond a Two-Channel Implementation

The original motivation for using the RF3 framework was to support as manychannels of the G.729Avocoder as possible on a single C6x processor.Alimita-tion of the EVM hardware is that the codec only supports two audio channels.The RF3 framework, on the other hand, can be modified to process additionalchannels.Aslight change in the framework architecture was made to allow theEVM and RF3 to work together for an N channel system. On the input side, theleft codec channel is fanned to all the odd processing channels (1, 3, 5, . . .), andthe right input to all the even channels (2, 4, 6, . . .). On the output side, a singlechannel is selected and routed to both sides of the stereo codec. The chan-nel selection occurs through a user-controllable slider implemented throughthe app.gel scripts “set_active_channel” command. Figure 12.31 shows therevised N channel G.729/A application signal flow.

Page 340: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

326 Principles of Speech Coding

A/D D/A

Encode

L

R

Decode

Encode Decode

Channel 1

Set_active_channel (app.gel)

Encode

L

Decode

Channel 3

Encode

R

Decode

Encode Decode

Channel N

Channel 4

Channel 2

Stereo split left/

right

Active output channel selector

FIGURE 12.31 Revised signal flow graph for N channel G.729/A application. (Note: At thecodec output, a single vocoder channel’s data is stuffed into both the left/right channels.)

12.13.1 Adding Channels

Increasing the number of vocoder channels in the G.729/A applicationrequires changes to both the DSP/BIOS and application source code. Thissection outlines the necessary steps required for the addition of a singlechannel to an arbitrary “N” channel application. These steps should be fol-lowed for each additional channel added to the vocoder application. In thissection, the symbol N represents the channel number being added indexedfrom 1.

12.13.2 DSP/BIOS Changes for Adding Channels

The following changes should be made to the DSP/BIOS configuration:

1. Insert a new SWI object “swiAudioproc(N−1).” The new object’s func-tion, priority, and mailbox field should be set the same as the existingswiAudioproc objects. The object’s “arg0” field should be set to thevalue (N−1).

2. The existing “swiRxSplit” and “swiTxJoin” objects should have themost significant zeroed bit of the mailbox field set to one.∗ Thischange forces the “receive split” and “transmit join” threads to holdoff executing until all processing channel threads complete.

∗ Due to the physical constraint of the mailbox bit field (16 bits), the total possible number ofchannels could never exceed this limitation. More elaborate tricks can be done with the mailboxfields of the DSP/BIOS objects to obtain greater numbers of channels using RF3.

Page 341: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 327

3. Insert new pipRx(N−1) and pipTx(N−1) objects. For pipRx(N−1),change the “nrarg0” field to _swiAudioproc(N−1), and the “nwarg1”field to the value of the previous channel’s “nwarg1” times two. For thepipTx(N−1) object, change “nwarg0” to _swiAudioproc(N−1), and“nrarg1” to the previous channel’s “nrarg1” value times two.

4. Increase the internal heap memory size by at least 4 KB to accommo-date the additional channel.

12.13.2.1 Source Code Changes for Adding Channels

The following source code changes should be made in order to support anadditional channel:

1. In the “appResources.h” file, increment the value of “NUMCHAN-NELS” to accommodate the additional channel.

2. In the “appBiosObjects.h” file, add declarations for the newpipTx(N−1) and pipRx(N−1) objects.

3. In “thrAudioproc.c,” add an additional initialization term to the“ThrAudioproc thrAudioproc[ NUMCHANNELS ]” structure.

4. In “thrTxJoin.c,” add the newly created pipTx(N−1) identifier to theThrTxJoin initialization.

5. In the “thrRxSplit.c” file, add the newly created pipRx(N−1) object tothe thrRxSplit struct initialization.

6. In the “thrControl.c” file, update the Int deviceControlsIOArea[] structinitialization to include a default value for the new channel.

7. In the app.gel file, update the set_active_channel function to add anextra channel to the channel selector.

12.13.3 Running Seven Channels of the Vocoder on the EVM

In an earlier section, it was determined that the vocoder should be able torun seven channels in real time on the EVM. Following the steps outlinedin this section to add five more channels to the initial two-channel vocoderapplication should confirm this determination. Figures 12.32 through 12.34show the observed resource usage for seven channels of the vocoder.

Based on Figure 12.32, the peak DSP cycle usage is less than 82% for sevenchannels. Considering only this measurement, and the 11.15% load per chan-nel estimate, suggests the possibility of inserting another channel. Examiningthe IDRAM usage in Figure 12.34 reveals that 65,516 bytes of the 65,536 bytesavailable were used by the application. Based on Figure 12.34 for memoryallocation, an additional channel is not possible. Viewing the UTL messagesin Figure 12.34 reveals that 97% of the reserved heap space is being used bythe seven-channel application.

Page 342: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

328 Principles of Speech Coding

FIGURE 12.32 CPU load graph for the seven-channel G.729/A vocoder.

FIGURE 12.33 Memory requirements for the seven-channel G.729/A vocoder.

12.13.4 Getting Eight Channels on the G.729/A Application

In an earlier section, it was determined that roughly 18% of the available DSPclock cycles are available to support an additional processing load. Clearlythis is sufficient bandwidth to handle an addition channel at 11.5%. Adding

FIGURE 12.34 Percentage of heap memory used in seven-channel G.729/A vocoderapplication.

Page 343: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 329

an eighth channel, however, requires that the available IDRAM is also suffi-cient. One possible way to reclaim IDRAM is by reducing the stack size. Giventhe vocoder’s data memory requirements per channel and the allocated stacksize, it is not likely that reducing the stack would be sufficient to support anadditional channel. A more reasonable approach would be to try and reducethe amount of memory required per vocoder channel. This has the benefit ofmemory savings across all channels. Making a decision on what can possiblybe cut out requires a careful analysis of the encoder and decoder state infor-mation and the context in which it is used. If some of the state informationis not completely necessary, it could possibly be eliminated, or allocated asscratch memory rather than as persistent memory.

12.13.5 Going beyond Eight Channels in the G.729/A Application

Assuming that enough IDRAM could be freed up to allow for greater thaneight channels, the limiting factor would then become the load exerted on theDSP. Getting additional channels requires making additional optimizationsto the vocoder source code. CCS provides profiling tools that can be used todetermine which parts of the code exert the biggest load on the DSP.

In CCS, enabling a profiling session allows the developer to choose func-tions to be measured for execution time and other statistics. From the statisticsgathered, a determination can be made as to which functions would benefitmost from further optimization. Profiling the code is an iterative process,which should be done from the top down. The remainder of this sectiondescribes the actual profiling done on the G.729/Aapplication, and the resultsdiscovered.

12.13.5.1 Profiling the Vocoder from the Top Level

Starting at the highest level, a profiling session was run which compared theG.729/A encoder versus decoder CPU clock cycle requirements. Figure 12.35shows the profiler session results for the functions “g729A_apply” and“ig729A_apply,” which implement the algorithms for the encoder anddecoder, respectively. To give the session a frame of reference, the function“thrAudioprocRun” was profiled as well, since it represents the thread fromwhich the channel processing algorithms are run.

From Figure 12.35, it can be seen that each function was executed five timesindicating five frames of data processed on a single channel. The total count

FIGURE 12.35 CCS profiling session for the encoder and decoder algorithms.

Page 344: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

330 Principles of Speech Coding

of clock cycles occupied by all five executions is indicated in the column “Incl.Tot . . ..” Shifting over three columns to the right to “Incl. Ave . . .” shows theaverage count of cycles occupied over the five iterations. The average count ofclock cycles per frame processed was 157,673. A few quick calculations showthat this value does indeed make sense. First, consider that the DSP clockspeed is 133 MHz or 133,000,000 cycles per second. Since each frame is 80samples taken at 8 kHz, the data collection time elapsed per frame is 1/100 ofa second. So for each frame there are 133,000,000/100 or 1,330,000 DSP clockcycles available. Each channel takes on average 157,673 cycles to process a sin-gle frame of data. For seven channels the total average requirement is 1,103,711cycles, which is roughly an 83% load on the DSP. This number in Figure 12.35is close to the actual DSP clock cycle usage recorded earlier, which was 81.88%.

Viewing the function “thrAudioprocRun” as a combination of the encoderand decoder algorithms shows that most of the processor load is spent encod-ing the data. Comparing “g729A_apply” with “ig729A_apply” shows thatthe encoder requires roughly four times the processing power of the decoder,since on an average the encoder requires 124,369 cycles per frame. Based onthese observations, it makes sense to spend more time profiling the encoderto look for potential optimization points.

12.13.5.2 Profiling the Encoder

The encoder’s main function “Coder_ld8a” was broken down into more gran-ular functions, which were then profiled, in a new session. The raw data werethen exported from the session to a Microsoft Excel spreadsheet. The statisticsacquired from the spreadsheet were then used to prioritize each function interms of its need for optimization.

In this case the encoder processed four data frames; during execution, somefunctions execute only once per frame while others execute multiple times. Inorder to take this into account, the last column in the spreadsheet calculatedthe average cycles per frame for each function. The functions were then sortedfrom lowest optimization priority to highest. From this ranking it was deter-mined that the function “ACELP_Code_A” is the highest priority candidatefor optimization.

Table 12.2 displays some important data fields from the encoder profilingsession. The fields displayed correspond to function name, code size, numberof times executed, total clock cycles used, average clock cycles used, averageclock cycles multiplied by number of times executed, and average cycles perdata frame, respectively.

12.13.5.3 Profiling ACELP_Code_A

Focusing on the “ACELP_Code_A” function, it can be further decomposedinto more granular functions to be profiled in a new session. Table 12.3 showsthe results of decomposition, and profiling session. From this analysis, it is

Page 345: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 331

TABLE 12.2

Encoder Profiling Session Statistics

Total Average

Code Average Cycle/

Areas Size Rn Total Average Cycle Frame

Check_Parity_Pitch 128 4 120 30 120 30Enc_lag3 336 8 422 52 416 104Update_exc_err 776 8 1411 176 1408 352Lag_window 180 4 2046 511 2044 511Corr_xy2 828 8 2710 338 2704 676G_pitch 1464 8 3940 492 3936 984Weight_Az 176 24 4893 203 4872 1218Int_qlpc 516 8 12,507 1563 12,504 3126Levinson 2688 4 12,607 3151 12,604 3151Residu 292 16 15,090 943 15,088 3772Autocorr 1236 4 17,197 4299 17,196 4299Qua_gain 3552 8 25,686 3210 25,680 6420Pitch_ol_fast 3428 4 41,177 10,294 41,176 10,294Az_lsp 1944 4 46,621 11,655 46,620 11,655Pitch_fr3_fast 616 8 57,114 7139 57,112 14,278Qua_lsp 120 4 81,035 20,258 81,032 20,258Syn_filt 876 56 126,972 2267 126,952 31,738ACELP_Code_A 444 8 147,926 18,490 147,920 36,980Coder_ld8a 8308 4 557,397 139,349 557,396 139,349

clear that the bulk of the processing power is used by the “D4i40_17_fast”function, making it a potential target candidate for optimization.

Following the methodology outlined in this section, the entire program canbe systematically broken down and analyzed in a meaningful fashion. Usingthe top-down approach helps the developer to narrow down the focus to onlythe most critical functions. The developer will then know where to best spendenergy for the actual code optimization.

TABLE 12.3

Details of Encoder Profiling Session Statistics

Total Average

Average Cycle/

Areas Size Rn Total Average Cycle Frame

Cor_h_X 840 18 51,013 2834 51,012 1136Cor_h 4060 9 33,182 3686 33,174 7372D4i40_17_fast 9284 9 103,077 11,453 103,077 22,906ACELP_Code_A 460 9 159,963 17,773 159,957 35,546

Page 346: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

332 Principles of Speech Coding

12.14 Conclusions

From our implementation, the TI TMS320C6701 EVM proved to be a use-ful platform for developing the multichannel G.729A vocoder application.The application development time and effort were minimized by adopt-ing the XDAIS algorithm standard, and using the RF3 along with the toolsavailable in TI CCS and Microsoft Visual Studio. Getting to a multichannelvocoder implementation on the EVM took several steps. The first step wasmodifying the RF3 example application to work with the TI TMS320 C6xEVM. From there, an XDAIS-compliant version of the ITU G.729/A vocoderalgorithm was created. Finally, the vocoder and modified RF3 frameworkwere combined to create a two-channel G.729/A application to run on theEVM.

Afterwards, more channels were added to the two-channel application.The end result was a seven-channel G.729/A application, which matchedthe expected outcome based on available hardware resources. This resultconfirms an estimate from Reference [11], a TI publication, “G.729/A SpeechCoder: Multichannel TMS320C6x Implementation,” which lists the measuredperformance of the G.729A vocoder as 18.2 MHz. This 18.2 MHz per channelwould yield approximately 7.3 channels on a 133-MHz C6x DSP. Finally, wefound that reducing the vocoder state variable data and further optimizingthe source code could potentially yield additional channels.

References

1. Texas Instruments Application Report SPRA802, Writing DSP/BIOS DeviceDrivers for Block I/O, February 2002.

2. Tokunbo Ogunfunmi, Implementation of a H.263++ video codec in real-timeapplication on the TI C6000 DSP, Proceedings of the TI Developer Conference,Houston, TX, February 2004.

3. Shi-Lei Han and Tokunbo Ogunfunmi, Embedded Ogg Vorbis decoder: An effi-cient implementation on the TI TMS320C6416 DSP, Proceedings of the TI DeveloperConference, Dallas, TX, March 2007.

4. ITU-T Recommendation G.729 (03/96), Coding of Speech at 8 kbit/s usingconjugate structure algebraic-code-excited linear-prediction (CS-ACELP).

5. James Foote and Tokunbo Ogunfunmi, Internal Technical Report, Speech CodecDSP Project, Santa Clara University, Santa Clara, CA, November 2002.

6. Texas Instruments Data Sheet, TMS320 C6X DSP Data Sheet, SPRS088B, Septem-ber 2001.

7. Texas Instruments Application Report, TMS320 DSP Algorithm Standard Rulesand Guidelines, SPRU352D, January 2001.

8. Texas Instruments, Steve Blonstein, Reference Frameworks for eXpressDSPSoftware: A White Paper, SPRA094, March 2002.

Page 347: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Real-Time DSP Implementation 333

9. Texas Instruments, Davor Magdic, Alan Campbell, and Yvonne DeGraw, Ref-erence Frameworks for eXpressDSP Software: RF3, A Flexible, Multi-Channel,Multi-Algorithm, Static System, SPRA793B, March 2002.

10. Texas Instruments, Alan Campbell, Davor Magdic, Todd Mullanix, and VincentWan, Reference Frameworks for eXpressDSP Software: API Reference, SPRA147,March 2002.

11. Texas Instruments Application Report, Chiouguey Chen, and Xiangdong Fu,G.729/A Speech Coder: Multichannel TMS320C62x Implementation, SPRA564B,February 2000.

Bibliography

1. Texas Instruments Application Report, TMS320C6201/6701 Evaluation ModuleUser’s Guide, SPRU269D, December 1998.

2. Texas Instruments, Stig Troud, Making DSP Algorithms compliant with theTMS320 DSP Algorithm standard, SPRA579B, November 2000.

3. Razvan Ungureanu, Bogdan Costinescu, and Costel Ilas, ITU-T G.729A Imple-mentation on StarCore SC140, AN2151/D Rev. 0, July 2001.

4. Texas Instruments Application Report, Code Composer Studio Getting StartedGuide, SPRU509, 2001.

Page 348: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

13Conclusions and Future Directionsfor Speech Coding

13.1 Summary

The field of speech processing is very wide and there are several importantrapid changes in the field. In this book, we have presented an introduction tothe principles of speech coding. Speech signals are very interesting and formthe basic method of human communication.

Now, in this final chapter, we summarize the previous chapters and wemake a few comments on the future directions for speech coding. We beganin Chapter 1 with an introduction to speech coding. We discussed speechsignals characteristics, classification of speech, and models for generationof speech signals. We reviewed some of the principles behind the vari-eties of speech coders available today. Then we discussed various means ofmeasuring speech quality. Then we summarized speech coding standards.

In Chapter 2, we reviewed the significant signal processing results that arevery widely used in speech processing and coding.

In Chapter 3, we focused on sampling theory and a few related topics as itapplies to the subject of speech coding.

In Chapter 4, waveform coding and quantization is discussed extensively.In particular, we explained the theoretical basis for the μ-law and A-law loga-rithmic quantizers that have been standardized for speech coding by the ITU.

Chapter 5 deals with the principles of differential coding and delineates theITU G.726 ADPCM standard. Deltamodulation, which is a particular differ-ential coding system that uses just a 1-bit quantizer, was also discussed in thischapter.

In Chapter 6, we explained the principles behind the theory of linear pre-diction. It is a very powerful and widely used technique in the field of signalprocessing. It is the basis of many of the speech coding algorithms that arepopular today.

In Chapter 7, we present the LPC model of speech generation. This modelis based on the acoustic model of speech generation using the vocal tractexcitation. The concept of CELP speech coders, which is the basis of many ofthe modern speech coders, was also introduced.

335

Page 349: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

336 Principles of Speech Coding

In Chapter 8, we introduced the technique of VQ, which has found wideuses in speech, audio, image, and video compression. We show where VQis used in speech coding standard. This follows scalar quantization, whichwas discussed in Chapter 3.

Optimal search of VQ is computationally very expensive. We discussedsearch methods and structures introduced to reduce memory and com-putations. Algorithms for suboptimal searching include MSVQ, split VQ,conjugate VQ, PVQ, and PVQ-MA.

The powerful method of speech coding using the AbS method was pre-sented in Chapter 9. Many speech coders are based on this method. There aretwo types of AbS encoders: open loop and closed loop. Examples such as LPC,RPE-LTP, and MELP are open-loop AbS speech coders that analyze the speechinput and extract parameters such as LPC coefficients, gain, pitch, etc., whichare then quantized and transmitted or stored for synthesis later. In a closed-loop AbS system, a decoder is embedded in the encoder. The parameters areextracted by encoding and then they are decoded and used for synthesizingthe speech. The synthetic speech is compared with the original speech, andthe MSE for example is minimized (in a closed loop) to further choose thebest parameters in the encoding. The CELP is an example of an AbS speechcoder.

In Chapter 10, the iLBC speech codec designed for robust voice commu-nication over the Internet using IP was presented. The codec is robust todegradation in speech quality due to lost frames, which occurs in connectionwith lost or delayed IP packets. The iLBC is suitable for real-time communi-cations such as telephony and videoconferencing, streaming audio, archival,and messaging. It is commonly used for VoIP applications such as Skype,Yahoo Messenger, and Google Talk, among others.

In Chapter 11, we focused on signal processing in VoIP systems. Weaddressed the issue of impairments to speech quality in VoIP networks anddiscussed signal processing algorithms to mitigate their effects.

In Chapter 12, a real-time implementation on a DSP chip of the ITU G.729Aspeech coder was presented. Some of the issues that are important for multi-channel real-time implementations are discussed.

We hope the material presented here enables the reader to have a firm graspof the principles and be able to delve more deeply into the subject of speechcoding.

Now, we discuss the future research directions for speech coding.

13.2 Future Directions for Speech Research

The advances made in speech coding in the last 40 years or so have beendramatic and have improved human communication by speech.

Page 350: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Conclusions and Future Directions for Speech Coding 337

The speech codec standardization process is somewhat detached from thelatest speech research results. The best ideas from speech research do notnecessarily make it the codec standards.

Recent papers have discussed the future of speech coding research [1–3,4].Also, recall from Chapter 1 that the coding efficiency (or bit rate) for speech

and the performance of speech coders are usually judged by one or more ofthe following factors:

• Speech quality (intelligibility)• Communication delay• Tandem of speech coders• Computational complexity of implementation• Power consumption• Robustness to noise (channel noise, signal fading, and intersymbol

interference)• Robustness to packet losses (for packet-switched networks)

Improving these factors can lead to an increase in performance of speechcoders and forms the basis of much of the speech coding research topics. Wewill point out some of the main issues in speech coding research for the future.

Intelligibility (speech quality) is measured by whether the speech is eas-ily understandable. Speaker identifiability and naturalness of the speechquality are other measures. New measures of intelligibility apart from thecommon MOS have been introduced. However, there is a need for new perfor-mance measures to incorporate the codec and many heterogeneous networkperformance measures.

One of the possible topics for speech coding research in the future isthe development of bandwidth-scalable and quality-scalable speech coders.Quality is measured by SNR. With the many choices in communication chan-nels for which speech can be transmitted, there should be many variablebandwidths and variable quality speech codecs. These choices include thePSTN, digital cellular networks, wireline packet switched networks, WLAN,and so on.

For example, in video coding such as ITU-T H.264 and MPEG-4, there isa base layer and additional enhancement layers can be added for improvedvideo quality. The base layer may be better error-protected than the enhance-ment layers so that there is a minimum level of quality maintained even inthe presence of transmission errors. In speech coding, furthermore, we canhave a base layer that codes telephone band speech (from 200 to 3400 Hz) andenhancement layers that code speech at frequencies higher than 3400 Hz. Ifbandwidth is not sufficient, the enhancement layer may be dropped.

Some newer variable rate digital cellular codecs offer variable bandwidthrates and may change the coded band on a frame-by-frame basis. This maymean that the bandwidth scalability option may become less of an issue for

Page 351: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

338 Principles of Speech Coding

these cellular codecs. For example, digital cellular carriers may prune bitrates for wireless access points as the number of users increases as a form ofscalability.

In the future, mobile ad hoc networks may attempt to save battery powerby using unequal error protection and allowing intermediate nodes to prunebit rates.

Video coding methods already offer multiple transmitted streams at vari-able, scalable quality and bandwidth requirements. Speech coding methodscan also offer such a diversity advantage by using multiple transmitted bitstreams. This will be critical for voice-over mobile ad hoc networks, sincenodes along a route may leave at any time, and the re-establishment of aroute can take at least 100 ms. Multiple descriptions coding is an exampleof the diversity-based coding method. Here, the total bandwidth availableis split into multiple side channels with bit rates that are a fraction of thefull rate. In the future, efficient, high-quality multiple descriptions codecsfor speech will need to be developed. Also, this method of multiple descrip-tions coding may need to be compared with simple repeated transmissionof a single description bit stream over multiple routes for reduction inbit rates.

For example, a new 8–32 kbps scalable wideband speech and audio codecfor ITU-G.729EV standard has recently been proposed [5]. This combinedcoder for speech and audio is indicative of future trends.

Several different types of speech coders are currently deployed in severaldifferent types of communication networks. Tandeming is the connection ofa speech codec with itself or with another speech codec. As these differentspeech coders are deployed in a heterogeneous network, degradations accu-mulate and may result in the output speech signal, especially if the codecsuse postfiltering.

The PSTN backbone consists of a wired, circuit-switched network thatuses time-division multiplexing. This network was not designed for carryingpacket-switched speech.

The G.711, the speech codec most often used in the PSTN, was designedto work with several asynchronous tandem speech codecs like itself. Up to8 G.711 codecs have been shown to work in tandem without degradations.However, tandem connections of G.711 with other codecs can lead to lossin end-to-end voice quality performance. With VoIP moving to the forefront,codecs other than G.711 may be relied upon for voice coding in the backbonenetwork. Also, it is common for mobile-to-mobile calls to have asynchronoustandem connections of different codecs because cellular network phones mayhave different codecs. Future research on how to avoid these degradationswill be useful.

Transcoding is defined as change of format (translate and code) from onespeech codec standard to another. It may typically involve decoding of thetransmitted encoded speech in one format and then re-encoding it to thenew format. Transcoding of speech from one format to another may lead to

Page 352: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Conclusions and Future Directions for Speech Coding 339

degradation in the quality of speech. Research into new transcoding methodsin order to avoid these degradations will be useful.

Furthermore, we already see that packet-switched voice is being carriedover WLAN links, which again may not use G.711. For VoIP and voice-overWLANs, it seems that tandem-free operation may be the exception ratherthan the rule. To improve the performance of asynchronous tandems, tempo-rary fixes such as removing or modifying any postfilters present should bepursued, plus additional work on parameter transcoding of different codecsis needed. In addition, for multiple-cascaded WLAN access points or meshnetworks used for voiced speech communications, special issues may arisethat need to be resolved to maintain speech quality.

Some of the research issues here include adapting speech codecs designedfor PSTN for use in newer networks such as VoIP and voice-over WLANand voice-over multihop networks, adding comfort noise to standards suchas G.711, G.726, and G.728 as described in Chapter 11, adding packet-lossconcealment to other speech codec standards such as G.711, etc.

Due to the many different channels used for speech communications, error-prone channels especially for packet speech need to have some sort of packet-loss concealment. It was shown in Chapter 11 on VoIP and also in Chapter 12about iLBC that packet-loss concealment helps make the speech coder morerobust to packet errors. For future research, we need to use schemes that takeadvantage of everything that is known about the current state of the speechcodec and the coded speech frame in order to improve their performance.That is important especially as we strive to maintain conversational speechquality over more heterogeneous networks.

When background noise additively corrupts the speech signal to be coded,the noise and the speech are both coded together. As more speech codecsbecome based on LP-based speech production models (such as LPC andCELP), this model can be forced upon the nonspeech signals like the back-ground noise. This will create artifacts and lower the quality of the codedspeech. One solution is to use speech enhancement (or noise cancellation)techniques to suppress the noise components. The speech enhancement meth-ods can help the speech model parameter extraction process, which are usedin low-bit-rate speech coders.

Further research needs to be done to determine appropriate speech enhance-ment methods for different speech codecs [6,7].

References

1. Gibson, J., Speech coding methods, standards and applications, IEEE Circuits andSystems Magazine, pp. 30–49, Fourth Quarter, 2005.

2. Atal, B.S., V. Cuperman, and A. Gersho, eds, Advances in Speech Coding, Springer,Berlin, 1990.

Page 353: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

340 Principles of Speech Coding

3. Atal, B.S. and L.R. Rabiner, Speech research directions, AT&T Technical Journal,65(5), 75–88, 1986.

4. Benesty, J., M.M. Sondhi, and Y. Huang, eds, Handbook of Speech Coders, Springer,Berlin, 2008.

5. Ragot, S., B. Kovesi, D. Virette, R. Trilling, and D. Massaloux,A8–32 kbit/s scalablewideband speech and audio coding candidate for ITU-T G.729EV standardization,Proceedings of the IEEE ICASSP, pp. I1–I4, 2006.

6. Loizou, P., Speech Enhancement: Theory and Practice, CRC Press, Boca Raton, FL,2007.

7. Koo, B., J.D. Gibson, and S.D. Gray, Filtering of colored noise for speechenhancement and coding, IEEE Transactions on Signal Processing, 39, 1680–1684,1991.

Bibliography

1. Atal, B.S., The history of linear prediction, IEEE Signal Processing Magazine, 23(2),154–161, March 2006.

Page 354: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index

μ-Law, 81, 91compressor curves, 82continuous, 86, 87encoding/decoding tables, 92segmented approximations, 86

A

AbS. See Analysis-by-Synthesis (AbS)Acoustic echo canceler (AEC),

283–284adaptive filter, 284, 285additional functional unit, 284echo attenuation, 284howling condition, 284, 285ITU Recommendation (G.167),

284–285removing attenuation, 285

Adaptive codebook, 181, 202, 214, 221,215, 245

in adaptive VQ, 202in CELP bit stream, 212in closed-loop LTP analysis, 181in decoder, 212gains quantization, 235pitch period in, 212in speech coding, 202uses, 210, 211

Adaptive codebook search, 180–182,231–233. See also Closed-looppitch search

boundaries, 231, 232complexity version, 232G.729, 231, 232interpolated past excitation, 233optimal delay, 231, 232subframes, 231

Adaptive deltamodulation (ADM),129–130

Adaptive differential PCM (ADPCM),13

Adaptive filter (AF), 278, 279adaption gain, 280, 281coefficient, 280

FIR, 279, 280LMS algorithm, 280NLMS algorithm, 280signal vectors, 280

Adaptive jitter buffer (AJB), 276,286, 288

Adaptive predictive coding (APC), 14Adaptive VQ, 202

codebook, 202two-stage cascaded, 202

ADC. See Analog-to-digital converter(ADC)

ADM. See Adaptive deltamodulation(ADM)

ADPCM. See Adaptive differential PCM(ADPCM)

AEC. See Acoustic echo canceller (AEC)AF. See Adaptive filter (AF)AJB. See Adaptive jitter buffer (AJB)A-Law, 82, 90, 93

compressor, 83compressor curve positive half, 83encoding/decoding tables, 92equation, 83segmented approximations, 86, 89

Algorithmic reference framework(ALGRF), 303

heap allocation, 303IALG_MemRec struct, 303scratch memory, 303

ALGRF. See Algorithmic referenceframework (ALGRF)

AMDF. See Average magnitudedifference (AMDF)

Analog-to-digital converter (ADC),61, 66

Analysis-by-Synthesis (AbS), 14, 182,207, 336

CELP encoder components, 207, 208closed-loop, 207, 209encoder, 208parameters, 207perceptual weighting filter, 207, 209

341

Page 355: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

342 Index

APC. See Adaptive predictivecoding (APC)

AR. See Autoregressive (AR)Autocorrelation matrix

properties, 138–139Autoregressive (AR), 47, 139

models for speech signals, 47process, 47Yule–Walker equations, 47

Average magnitude difference(AMDF), 55

B

Backward linear prediction (BLP), 135,143, 151

backward linear predictor, 143, 144cross-correlation vector, 145–146M×1 tap-input vector, 145M×1 tap-weight vector, 145minimum PEP, 145predicted value, 143prediction-error filter, 143, 144

Backward prediction-error filter (BPEF),143, 146

backward prediction error, 146, 147backward predictor, 146input sample vector, 146predictor coefficients relationship,

146–147weight vector, 147Wiener–Hopf equations, 147

Bark spectral distortion (BSD), 15Binary Input Output Studio (BIOS),

295Binary search VQ, 197–198BIOS. See Binary Input Output

Studio (BIOS)Bit allocation, 251Bits per second (bps), 17BLP. See Backward linear prediction

(BLP)BPEF. See Backward prediction-error

filter (BPEF)bps. See Bits per second (bps)BSD. See Bark spectral distortion (BSD)Buffering delay variation. See Voice-over

Internet Protocol (VoIP)—jitter

C

CA. See Companding advantage (CA)CableLabs. See Cable Television

Laboratories (CableLabs)Cable Television Laboratories

(CableLabs), 243CCS. See Code composer studio (CCS)CDMA. See Code Division Multiple

Access (CDMA)CELP. See Code-excited linear prediction

(CELP)CELP AbS structure, 208

decoder, 210encoder, 210FS 1016 coder, 210–216linear prediction model, 208perceptual weighted error, 209

CELP-based coders, 174. See also Hybridspeech coders

bit rates, 176encoder structure, 177excitations, 175excitation source model, 177, 178fixed codebooks, 176, 178low-delay, 175LTP pitch synthesis filter, 177LTP transfer function, 178MELP coder, 175naturalness, 175optimum parameters, 179periodic impulse, 175pitch parameters, 179speech coders, 176STP formant synthesis filter, 177, 178STP transfer function, 177synthesis filter parts, 176vector quantization, 176VSELP coder, 176

Centroid rule, 98Cepstrum, 48Clock skew, 288–289

delay variation, 289slip frequency, 289

Closed-loop differential quantizer,112, 113

quantization errors, 112reconstruction error, 112

Page 356: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 343

Closed-loop pitch search, 180, 232. Seealso Adaptive codebook search

LTP analysis diagram, 181LTP parameters, 181pitch period, 182

CNG. See Comfort noise—generation(CNG)

Codebook design, 192Code composer studio (CCS), 304Code Division Multiple Access

(CDMA), 10Code-excited linear prediction (CELP),

10, 335, 336Cognitive model, 268, 269, 270

aggregates disturbances, 270, 271bad intervals identification, 270brain loudness perception, 270mimic ear resolution, 269perceptual subtraction, 270remove filter influence, 269remove gain variation, 270sub-blocks, 270time alignment, 269transforms to MOS-LQO, 271

Comfort noise, 282, 339background noise, 283generation (CNG), 217generator, 279, 282, 283issues, 282NLP activation, 282VAD, 283

Communication, 1performance issues, 17speech signal, 1

Companding advantage (CA), 84–85Compandor, 77

characteristics for 4-bit quantizer, 77Conjugate-structure

algebraiccode-excited linearprediction (CS-ACELP),217, 219

decoder principles, 221encoder principles, 220signal flow, 236

Conjugate VQ, 201advantages, 201codebooks, 201

Continuously variable slopedeltamodulation (CVSD), 130

Continuous signalaliasing, 64–65DAC, 63in-band distortion, 64–65practical reconstruction, 63spectrum, 62

Continuous-time (CT) signal, 29CS-ACELP. See Conjugate-structure

algebraiccode-excited linearprediction (CS-ACELP)

CT signal. See Continuous-time (CT)signal

CVSD. See Continuously variable slopedeltamodulation (CVSD)

D

DAC. See Digital-to-analog converter(DAC)

DAM. See Diagnostic acceptabilitymeasure (DAM)

Decoder, 11, 69Decoder project, 312

execution loop, 313G729A_Decoder, 312G729VectorCompare, 313test vectors comparison, 312,

313–314Decoding process, iLBC’s, 251–252

decoded excitation signalconstruction, 254

excitation decoding loop, 253–254LPC filter reconstruction, 252–253multistage adaptive-codebook

decoding, 254start state reconstruction, 253

Deltamodulation, 125, 131, 335adaptive, 129linear, 125

Department of Defense (DoD), 210DFT. See Discrete Fourier

Transform (DFT)Diagnostic acceptability measure

(DAM), 15Diagnostic rhyme test (DRT), 15Differential coding, 111, 131

closed-loop, 114, 116open-loop, 112simple difference signal, 111

Page 357: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

344 Index

Differential pulse code modulation(DPCM), 185, 198

Digital filter structures, 38all-pole digital filter, 40all-zero digital filter, 40filter implementation, 38FIR digital filter, 41FIR filter output, 40FIR transversal filter, 38IIR digital filter, 39, 40lattice filter, 38

Digitally linearizable compandinglaws, 85

Digital signal processing (DSP), 8, 31,32, 295

advantages, 32–34digital filter structures, 38DT signals, 31, 32exercise problems, 57–58fourier transform, 37speech processing, 27Z transforms, 36

Digital-to-analog converter (DAC), 63frequency response, 64impulse response, 64reconstruction filter spectral

magnitude, 64Discontinuous transmission (DTX), 217Discrete Fourier Transform (DFT), 38Discrete-time Fourier transform (DTFT),

37, 48Discrete time (DT) signal, 31. See also

Digital Signal Processingimpulse train signal, 35linear convolution, 36periodic signal fourier coefficients, 35representation, 32sampled version, 35sampling, 31, 34, 35shifted unit pulse, 31unit pulse, 31unit step digital signal, 32Z transform, 36

Distortion measure, 188–189, 190autocorrelation function, 192hamming, 191Itakura–Saito, 192, 194linear prediction, 191MSE, 190–191

perceptual, 192WMSE, 191

DoD. See Department of Defense (DoD)Double-talk detector (DTK det), 279, 281

condition, 281correlation function, 281filter adaptation, 281filter coefficients, 281–282hangover time, 281online–offline approach, 282

DPCM. See Differential pulse codemodulation (DPCM)

DRT. See Diagnostic rhyme test (DRT)DSK. See DSP Starter Kit (DSK)DSP. See Digital signal processing (DSP)DSP Starter Kit (DSK), 295DTFT. See Discrete-time Fourier

transform (DTFT)DTK det. See Double-talk detector

(DTK det)DT signal. See Discrete time (DT) signalDTX. See Discontinuous

transmission (DTX)

E

EMBSD. See Enhanced modified barkspectral distance (EMBSD)

Encoder project, 311execution loop, 312G729A_Encoder, 311

Encoding process, iLBC’s, 247autocorrelation coefficients, 247–248bit allocation, 251codebook search, 250, 251encoder, 247frame positions, 248LPC analysis, 247LPC coefficient, 248, 249packetization, 251perceptual weighting filter, 249residual computation, 249speech preprocessing, 247start state identification, 249start state quantization, 249–250

Enhanced modified bark spectraldistance (EMBSD), 15

Enhanced variable rate coder (EVRC), 10Enhancer outline, iLBC’s, 256

enumeration, 257–258

Page 358: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 345

objective, 256PSSQ’s linear combination, 257

ETSI. See European TelecommunicationStandards Institute (ETSI)

European Telecommunication StandardsInstitute (ETSI), 12

Evaluation module (EVM), 295, 304, 307.See also RF3

C6x hardware, 302compatibility, 301, 302conditional code, 325G.729/A real-time execution, 318ITU-T G.729/A migration, 314memory banks, 304running RF3, 304running seven channels, 327running test vectors, 324–325

EVM. See Evaluation module (EVM)EVRC. See Enhanced variable rate coder

(EVRC)Excitation codebook, 208, 210, 211,

218, 235Excitation source, 165, 166, 176, 177, 178

gain, 165LPC systems, 166, 167MELP coder, 168, 169modeling, 166MPE-LPC model, 167, 168pitch estimation, 166pitch period, 165RELP coder, 168–170RPE-LPC model, 170–171speech analysis model, 166synthesis filter, 165

Exclusive-or (XOR), 290Exercises, 24, 68, 102–108, 131–133,

162–163, 182, 204, 240, 272eXpressDSP algorithm standard

(XDAIS), 295, 300compliant algorithms, 300IALG interface, 300modified signal flow, 301RF1, 300RF3, 300, 301

F

Fast Fourier transform (FFT), 170FEC. See Forward error correction (FEC)

FFT. See Fast Fourier transform (FFT)Finite impulse response (FIR) filter, 38,

41, 135, 136, 279, 280FIR filter. See Finite impulse response

(FIR) filterFixed codebooks, 176FLP. See Forward linear prediction (FLP)Form factor, 126, 127Forward error correction (FEC),

289–290Forward linear prediction (FLP), 135,

139, 151AR process, 139, 140correlation matrix, 142cross-correlation vector, 142forward linear predictor, 139, 141forward prediction error, 140FPEF, 139, 140M×1 tap-input vector, 141M×1 weight vector, 141minimum PEP, 141prediction-error filter, 139Wiener–Hopf equations, 142–143

Forward prediction error filter (FPEF),139, 141

Fourier transform, 8, 35, 37, 43, 48, 62FPEF. See Forward prediction error filter

(FPEF)Frequency shift key (FSK), 118FS 1016 CELP coder

adaptive codebook, 214, 215automatic gain control, 215, 216bit allocation, 214, 215bit stream dividing, 212decoder, 212encoder, 211excitation codebook, 211excitation sequence, 210first-order smoothing filters, 216LTP parameters, 214overlapping codebook, 211, 212PCM speech, 210perceptual weighting filter, 212pitch multiplication problem, 215pitch periods, 214, 215pitch period index, 211, 212pitch period resolutions, 215postfilter, 212, 213short-term LPA, 210

Page 359: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

346 Index

FS 1016 CELP coder (Continued)speech output, 216square root operation, 216stochastic codebook gain index, 211,

212weighted MSE, 211

FSK. See Frequency shift key (FSK)Full search VQ, 196

computational complexity, 196memory storage, 196

G

G.729/A, 314, 315, 318adapting library files, 315algG729A and algInvG729A, 315application for RF3, 315appModules, 315, 316appResources.h, 316appThreads.c/.h, 316–317C67xEMV_RF3, 317changing TI modules, 315, 316link.cmd file, 318source files, 316, 317ThrAudioproc.c/.h, 317

G.729/A on TI C6X DSP, 301drawback using RF3, 302LIO, 302

G.729/A, optimization, 318application code allocation, 320basic_op.c., 321cache enable setting, 320code changes, 320, 321–322DSP/BIOS settings, 319–320getting eight channels, 328, 329intrinsic functions, 321options file level, 319project settings, 318saturation conditions, 322

G.729/A, real-time performance, 322clock cycle requirements, 323–324CPU load, 324heap memory usage, 323IPRAM, 323memory requirements, 322–323resource requirements summary, 324UTL diagnostic features, 323

G.729/A, vocoder, 306CPU load graph, 328

creating new application, 314,315–318

data type sizes, 307heap memory used, 328memory requirements, 328modifications, 306–307modified signal flow, 301, 314N channel signal flow, 326optimization, 318–321profiling ACELP_Code_A, 330, 331profiling encoder, 330, 331profiling vocoder, 329–330real-time performance, 322–324running seven channels, 327running test vectors, 307

GIPS. See Global IP Sound (GIPS);Global IP Solutions (GIPS)

Global IP Solutions (GIPS), 243Global IP Sound (GIPS), 243Global System Mobile (GSM), 10Granular noise, 72, 129GSM. See Global System Mobile (GSM)GSM-adaptive multirate coder

(GSM-AMR), 10GSM-AMR. See GSM-adaptive multirate

coder (GSM-AMR)

H

Hybrid speech coders, 11

I

IALG interface, 300, 302, 308, 315.See also RF3

IALG_Fxns, 302–303IALG_MemRec, 303

IDRAM. See Internal data memory(IDRAM)

IETF. See Internet Engineering TaskForce (IETF)

IIR filter. See Infinite impulse response(IIR) filter

iLBC. See Internet low-bit-rate codec(iLBC)

Infinite impulse response (IIR) filter, 38Internal data memory (IDRAM), 304,

305, 329Internal program memory (IPRAM),

304, 323

Page 360: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 347

Internet Engineering Task Force(IETF), 12

Internet low-bit-rate codec (iLBC), 10,243, 244

adaptive codebook, 246advantages, 245, 246algorithm, 245–246bit allocation, 251CELP coders, 246computational complexity, 244decoding process, 251–254encoder input, 244–245encoding process, 247–251enhancement techniques,

255–256enhancer outline, 256–258frame/block lengths, 243GIPS, 244PLC techniques, 254–255postfiltering, 259real-time communications, 243speech coding system, 243structure development, 243synthesis, 259

Internet Protocol (IP), 243Interpolation filters, 62

convolution theorem, 63interpolation formula, 63LPF, 62

Interpolation formula, 63IP. See Internet Protocol (IP)IPRAM. See Internal program memory

(IPRAM)Itakura–Saito distortion measure,

192, 194autocorrelation function, 192time-varying weight matrix, 192

ITU (International TelecommunicationUnion), 12, 13

ITU G711 PCM standards, 91μ-law to linear conversion, 94, 93A-law to linear conversion, 95conversion between codes, 91eight-bit μ-law code, 92G.711 A-law quantizer, 93G.711 μ-law quantizer, 93linear to A-law conversion, 94–95linear to μ-law conversion, 92, 94quantization step, 93, 95

ITU G726 ADPCM algorithm, 118, 119adaptive decoder, 119adaptive quantizer, 119–120normalized quantizer, 120predictor adaption, 123predictor structures, 123quantizer adaption speed control, 121quantizer scale factor adaption, 121speed control parameter estimation,

121–125ITU G.729A speech decoder

adaptive gain control, 239high-pass filtering, 238long-term postfilter, 237–238signal upscaling, 238tilt compensation, 238–239

ITU G.729A speech encoder, 217, 218adaptive-codebook search, 231,

234, 235adaptive-codebook structure, 231autocorrelation, 230closed-loop pitch search, 232coarse pitch estimation, 229delay ranges, 230impulse response computation,

230–231LPC to LSP coefficients, 225–226LSP coefficients interpolation, 228maxima of correlation, 229–230open-loop pitch-lag estimation, 229optimum integer delay, 232past excitation, 233perceptual weighting, 228search boundaries, 231–232target signal computation, 231

ITU G.729 speech decoder, 235adaptive gain control, 239CS-ACELP signal flow, 236high-pass filtering, 238integer and fractional parts,

236, 237long-term postfilter, 237–238parameter decoding, 235parity bit, 235–236postprocessing, 237short-term postfilter, 238signal upscaling, 238steps, 235tilt compensation, 238–239

Page 361: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

348 Index

ITU G.729 speech encoder, 217, 21817-bit fixed codebook, 234adaptive-codebook, 221adaptive-codebook search, 231,

233, 234adaptive-codebook structure, 231algebraic codebook structure, 233autocorrelation, 230autocorrelation computation, 224, 225bit allocation, 221, 223closed-loop pitch search, 232coarse pitch estimation, 229conjugate roots, 226CS-ACELP decoder principles, 221CS-ACELP encoder principles,

219–220CS-ACELP signal flow, 222decoder function, 221delay ranges, 230fixed-codebook, 221HPF, 219impulse response computation, 230input speech signal, 219interpolated past excitation, 233LD algorithm, 225low-pass filtered weighted speech,

229LPA and quantization, 223LPC to LSP coefficients, 225–226LPF coefficients, 228LP residual signal, 229LSP coefficient advantage, 225LSP coefficients interpolation,

227–228LSP coefficients quantization,

226–227LSP to LPC coefficients, 228maxima of correlation, 229–230open-loop pitch-lag estimation, 229optimum integer delay, 232original applications, 219perceptual weighting, 228polynomials, 225, 226preprocessing, 223search boundaries, 231, 232subframes, 231, 232transmitted parameters indices, 234windowing, 223–224

ITU-T G.729/A speech codingstandard, 296

adaptive codebook search, 297perceptual weighting filter, 296–297principal routines, 296

ITU-T speech coder, 216–21780-bit frame, 217CS-ACELP coder, 217, 218G.729, 217G.729A, 217G.729B, 217postprocessing filter, 219

J

Jitter, 276adaptive jitter buffer (AJB), 276, 286buffering principle, 286buffer strategies, 287distribution, 66fixed jitter buffer, 286, 287, 288free network, 289network, 288packet, 244, 267, 276pulse position, 168, 169sampling clock, 61, 65

Jitter buffers, 285, 286adaptive, 286, 287, 288buffer modification, 288fixed, 286, 287–288fixed playout schedule, 287function, 285packets, 285, 286playout delay adaptions, 287principle, 286size, 285strategies, 286–287talk spurt, 287, 288

L

LARs. See Log area ratios (LARs)LBG algorithm. See Linde–Buzo–Gray

(LBG) algorithmL-D algorithm. See Levinson–Durbin

(L-D) algorithmLD-CELP. See Low-delay CELP

(LD-CELP)LDM. See Linear deltamodulation

(LDM)

Page 362: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 349

L-D recursion. See Levinson–Durbin(L-D) recursion

Least mean square (LMS) algorithm,118, 280

Least significant bit (LSB), 298Leibnitz rule, 98Levinson–Durbin (L-D) algorithm, 135,

150, 160method A, 158method B, 158–159

Levinson–Durbin (L-D) recursion, 150,159–160

application, 157–158BLP, 151declaring ways, 151–152FLP, 151inverse L-D algorithm, 160–161inverse Levinson’s recursions, 162L-D algorithm, 150–151, 158–159Levinson’s recursions, 162normal equations, 150validity stages, 152–157

Linde–Buzo–Gray (LBG) algorithm,195–196

centroid, 195classification, 195codevector updating, 196compute average distortion, 196initialization, 195termination test, 196

Line echo canceler, 278adaptive filter, 279–280AF, 278comfort noise generator, 282–283DTK det, 279, 281–282hybrids, 278line ECAN, 279NLP, 279, 282, 283PSTN echo generation, 278replica echo signal, 278split ECANs, 278, 279

Line spectral frequency (LSF),178, 219

Line spectrum frequency (LSF), 135Line spectrum pairs (LSPs), 9Linear deltamodulation (LDM), 125

optimum 1-bit quantizer, 126, 127optimum SNR, 127optimum step size, 126–127

SNR for sinusoidal inputs, 128–129special cases, 128

Linear prediction, 135distortion measure, 191forward, 139linear estimation, 135LPA, 9prediction error, 135relation with AR modeling, 142types, 135Wiener filters, 135–137

Linear prediction distortion measure,191

PARCOR coefficients, 192Yule–Walker equations, 191

Linear predictive analysis (LPA), 8, 223long-term, 214short-term, 210

Linear predictive coding (LPC), 9, 165excitation sources, 165speech generation model, 166

Linear, time-invariant (LTI) systems,8, 27

convolution, 29–30CT signal, 29differential equation models, 30DT convolution, 34impulse response in, 28linear convolution, 34linearity, 27LTI properties testing, 28response to stochastic process, 44sifting, 28time-invariance, 27time-invariant property, 28

Linear transversal filter. See Wienerfilters—FIR

Lloyd–Max I algorithm, 99Lloyd’s algorithm, 192, 194, 195

average distortion, 194boundary sets, 193centroid, 193classification, 194codevector updating, 194initialization, 194optimality conditions, 193splitting method, 195termination test, 194WMSE distortion measure, 194

Page 363: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

350 Index

LMS algorithm. See Least mean square(LMS) algorithm

Log area ratios (LARs), 192Logarithmic companding, 80, 95

μ-law, 81μ-law and A-law comparison, 90A-law, 82approximations, 80companding advantage, 84logarithmic curve, 80SNR, 80

Long-term prediction (LTP),176, 179, 214

Low-delay CELP (LD-CELP), 175Low-pass filter (LPF), 62, 100, 168LPA. See Linear predictive

analysis (LPA)LPC. See Linear predictive coding (LPC)LPC-10 Federal Standard, 171

bit allocation, 174CELP-based coders, 174–179closed-loop pitch search, 180–182coded speech, 174decoder, 172encoder, 171, 172excitation source model, 177, 178FS-1015 speech coder, 172, 173LP lattice filter, 174PCM speech signal, 172, 173perceptual error weighting, 179–180pitch estimation, 180pitch period estimation, 173single voiced frame, 173, 174voicing detector, 173

LPF. See Low-pass filter (LPF)LSB. See Least significant bit (LSB)LSF. See Line spectral frequency (LSF);

Line spectrum frequency (LSF)LSPs. See Line spectrum pairs (LSPs)LTI Systems. See Linear, time-invariant

(LTI) systemsLTP. See Long-term prediction (LTP)LTP pitch synthesis filter, 177LTP transfer function, 178

M

MA prediction VQ (PVQ-MA), 185

MATLAB� signal processingblockset, 259

commands, 262, 265decoded packets, 265encoder and decoder blocks, 259, 260iLBC demo model, 259, 260, 262iLBC demo model library, 261input speech signal, 265Lossy Channel setup, 261original speech signal, 262output speech signal, 262, 263–264,

266–267Mean opinion score (MOS), 10, 14

bit rate vs., 20mapping of E-Model into MOSs, 16

Mean square error (MSE), 66, 72, 190MELP coder. See Mixed-excitation linear

prediction (MELP) coderMixed-excitation linear prediction

(MELP) coder, 168, 169, 175high-pass region, 168LPF, 168pulse position jitter, 168

Modern LPC systems, 166, 167limitations, 167parameters, 166Speak and Spell, 167

MOS. See Mean opinion score (MOS)Moving Picture Experts Group (MPEG),

13MPEG. See Moving Picture Experts

Group (MPEG)MPE-LPC model. See Multipulse

excitation LPC (MPE-LPC)model

MPLPC. See Multipulse-excited LPC(MPLPC)

MSE. See Mean square error (MSE)MSVQ. See Multistage vector

quantization (MSVQ)Multichannel reference framework

(RF3), 295, 300, 302, 304, 325. Seealso Evaluation module (EVM)

ALGRF, 303clock cycle requirements, 306G.729/A on TI C6X DSP, 301, 302IALG interface, 302–303ITU-T G.729/A migration, 314memory requirements, 304–305

Page 364: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 351

running, 304two-channel implementation,

325–327visual representation, 301

Multimode speech coders, 11Multipulse excitation LPC (MPE-LPC)

model, 167, 168, 176algorithm, 167pulse sequence, 167

Multipulse-excited LPC (MPLPC), 14Multistage vector quantization

(MSVQ), 185, 198, 199, 201disadvantages, 199K-stage decoder, 200K-stage encoder, 200memory requirement, 201minimizing distance, 199–200number of bits, 200quantized value, 199rotation matrix operation, 199search methods, 201single-stage vector quantizer, 199

N

Nearest neighbor rule, 98NLMS algorithm. See Normalized least

mean square (NLMS) algorithmNLP. See Nonlinear processor (NLP)Nonlinear processor (NLP), 279, 282, 284

suppression threshold level, 282transfer function, 283

Nonuniform quantizer, 75–794-bit nonuniform quantizer, 76compandor, 77compressor characteristic, 78implementation methods, 76MSE, 77, 78nonuniform quantizing methods, 76performance, 77SNR, 80

No operation (NOP), 299No-overload conditions, 72

MSE, 72, 73quantizer performance, 72–73

NOP. See No operation (NOP)Normalized least mean square (NLMS)

algorithm, 280Nyquist sampling theorem, 61

continuous signal spectrum, 62convolution theorem, 62DT sequence, 61Fourier transform relationship, 62

O

Output levels. SeeQuantizer—reconstructionlevels

Overload distortion, 72

P

Packetization, 251, 276, 277Packet loss, 244, 254, 286, 289–290, 293

rate, 259, 262–264, 265–267receiver-based PLC algorithms, 291transmitter-based FEC

techniques, 290Packet loss concealment (PLC), 243, 339PAQM. See Perceptual audio quality

measurement (PAQM)PARCOR. See Partial correlation

coefficients (PARCOR)Partial correlation coefficients

(PARCOR), 9, 135, 157, 192PCM. See Pulse-coded modulation

(PCM)PDF. See Probability density function

(PDF)PEP. See Prediction error power (PEP)Perceived speech quality, 276

acoustic echo, 277delay effect, 276, 277electrical line echo, 277ITU guidelines, 277low-delay echo, 277pauses, 286

Perceptual audio quality measurement(PAQM), 267

Perceptual error weighting, 179translating effect of [W(z)], 180weighted STP [H(z)], 179weighting filter [W(z)], 179

Perceptual evaluation of speech quality(PESQ), 15, 267, 269

applications, 271cognitive model, 268, 269, 270

Page 365: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

352 Index

Perceptual evaluation of speech quality(PESQ) (continued)

conversion to psychophysicalrepresentation, 269, 270

signal delay calculation, 268sub-blocks, 270, 271

Perceptual speech quality measures(PSQM), 244

Periodograms, 48, 49autocorrelation function, 49windowed signal, 48

PESQ. See Perceptual evaluation ofspeech quality (PESQ)

Pitch, 3detection methods, 4languages, 7ranges, 4

Pitch estimation, 166, 180coarse, 180, 229methods, 55open-loop pitch analysis, 180

Pitch period, 12, 54, 165, 167, 172, 174,180, 214

AMDF method, 55, 56autocorrelation methods, 55–56estimation methods, 54frequency-domain methods, 55index, 172, 211, 212mixed time- and frequency-domain

methods, 55time-domain methods, 55

Pitch-period-synchronous sequences(PSSQ), 256, 257

Plain old telephone service (POTS), 9PLC. See Packet loss concealment (PLC)PLC techniques, iLBC’s, 254, 272. See also

Packet losscurrent frame not received, 254–255frames received, 254previous frame lost, 255

POTS. See Plain old telephone service(POTS)

Power spectral density (PSD), 8,43, 47, 49

autocorrelation sequence, 43, 44function, 43, 44, 47properties, 43

Prediction error power (PEP), 135,141, 153

Prediction gain, 53, 114, 115, 117closed-loop, 115, 116, 117LDM, 127

Predictive coding, 114adaptive, 14adaptive prediction, 117–118closed-loop predictor, 115, 117difference signal, 113difference signal variance, 116first-order DPCM coder, 115–116generalization, 113, 115linear, 9, 10, 165prediction coefficients, 114

Predictive vector quantization (PVQ),185, 202. See alsoQuantizer—vector

Predictors relationship, 147,148–150

complex conjugates, 148Wiener–Hopf equation, 147–148

Probability density function (PDF),41, 72, 193

Profiling ACELP_Code_A, 330, 331ACELP_Code_A function, 330session statistics, 331

Profiling encoder, 330, 331ACELP_Code_A function, 330Coder_ld8a function, 330encoder profiling statistics, 331

Profiling vocoder, 329–330CCS profiling session, 329thrAudioprocRun function, 330

PSD. See Power spectral density(PSD)

PSQM. See Perceptual speech qualitymeasures (PSQM)

PSQM/PSQM+TO PESQevolution, 267, 268PSQM+ improvements, 268

PSSQ. See Pitch-period-synchronoussequences (PSSQ)

PSTN. See Public-switched telephonenetwork (PSTN)

Public-switched telephone network(PSTN), 9, 275

backbone, 338network requirement, 275TDM circuits, 275

Page 366: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 353

Pulse-coded modulation (PCM),7, 13, 223

PVQ. See Predictive vector quantization(PVQ)

PVQ-MA. See MA prediction VQ(PVQ-MA)

Q

Quantization, 69adaptive, 99–101Lloyd–Max I algorithm, 99noise, 69noise model for performance

analysis, 70nonuniform quantizing methods, 76optimum, 95–97output signal mean square value, 71

Quantizer, 70, 71adaption speed control, 121–122adaptive, 100algorithms for VQ, 198–202decision level, 71deployment, 70error transfer functions, 72in linear deltamodulationLloyd–Max, 97Lloyd–Max I algorithm, 99multistage vector, 199nonuniform, 75optimum 1-bit, 126, 127performance, 70, 72reconstruction levels, 71scalar, 186scale factor adaption, 121SNR, 73–74transfer function, 71two-stage cascaded vector, 202uniform, 73–75vector, 186–188, 202

R

Random signalsautocorrelation function, 67reconstruction error mean square

value, 67sampling, 67

Real-time transport protocol (RTP), 243

Receiver-based PLC algorithms, 291gradual fading, 291TSM methods, 292–293waveform substitution, 291–292

Reduced instruction set computer(RISC), 297

Regular-pulse excitation LPC (RPE-LPC)model, 170

modified, 171pulse position, 170

Regular-pulse-excited LPC (RPLPC),14

RELP. See Residual-excited linearprediction (RELP)

Residual-excited linear prediction(RELP), 168, 170

FFT-based vocoder, 170LP coefficients, 168, 170receiver, 169residual encoding, 168residual errors, 168speech quality, 170transmitter, 169

Residual vector quantizers. SeeMultistage vector quantization(MSVQ)

RF3. See Multichannel referenceframework (RF3)

RF3 clock cycle requirements, 306RF3 memory requirements, 304

accessing external memory, 305EVM, 304heap memory, 305IDRAM, 304, 305IPRAM, 304stack size settings, 305

RHS. See Right-hand side (RHS)Right-hand side (RHS), 73RISC. See Reduced instruction set

computer (RISC)RPE-LPC. See Regular-pulse excitation

LPC (RPE-LPC)RPE-LTP coder. See also Regular-pulse

excitation LPC (RPE-LPC)model—modified

RPLPC. See Regular-pulse-excited LPC(RPLPC)

RTP. See Real-time transportprotocol (RTP)

Page 367: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

354 Index

S

Sampling clock jitteramplitude error, 65–66effect, 65–66mean square error (MSE), 66SNR, 66

Sampling theoryinterpolation filters, 62Nyquist sampling theorem, 61random signals, 67sampling clock jitter, 65

Scalar quantization, 185, 186decoder output, 186waveform quantization, 186

Segmented companding laws, 85μ-law and A-law comparison, 90approximation, 86doubling rule, 89eight-segment approximation, 87parameter A variation, 90parameter μ variation, 88segmented approximation to A-law,

89segment end points, 87

Short-time Fourier transform (STFT), 8,47–48

Side tone. See Perceived speechquality—low-delay echo

SIFT. See Simple inverse filteringtracking (SIFT)

Signaling system number 7 (SS7), 275Signal-to-noise ratio (SNR), 7, 117,

127, 128large signal levels, 83nonuniform quantizer, 75, 80sine wave of frequency, 66uniform quantizer, 73–74, 75

Signal to quantization noise ratio(SQNR), 66

Simple inverse filtering tracking (SIFT), 4SNR. See Signal-to-noise ratio (SNR)Speech analysis, 1, 8–9, 267Speech coder(s), 10, 13

bandwidth attribute comparisons, 23CELP, 176, 210communication networks, 9, 16FS-1015, 172–174goal, 9

hybrid speech coders, 11ITU G.729, 217ITU G.729/A, 332ITU G.729/B, 217LPC-based, 167multimode speech coders, 11parametric coders, 10, 12, 14performance comparisons, 21, 22waveform coders, 10, 11, 13–14

Speech coders performance, 17communication delay, 18computational complexity, 18power consumption, 18robustness to noise, 18robustness to packet losses, 18speech quality, 14, 17

Speech coding, 1, 9, 335, 337algorithms, 19–20classification, 10, 11digital cellular codecs, 337–338efficiency, 17efficiency factors, 337history, 9intelligibility, 337linear predictive model, 12methods, 338mobile ad hoc networks, 338multiple descriptions coding,

338narrowband, 11packet-switched voice, 339PCM, 13PSTN backbone, 338quality, 337speech coders, 10, 13standards, 12, 18tandeming, 338transcoding, 338–339

Speech modeling, 7unvoiced sounds, 7voiced sounds, 7

Speech quality, 13, 14, 17, 176DAM, 15DRT, 15EMBSD, 15E-Model, 16measuring, 14–16MOS, 14perceptual evaluation, 15

Page 368: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 355

PESQ, 15RELP coders, 170

Speech signals, 1, 8, 335acoustic model, 1category, 2characteristics, 3classes, 3female, 4, 5, 6frequency-domain methods, 53generation, 1human vocal system, 2male, 3, 5, 6narrowband, 7PSD, 50spectral envelope determination, 49spectrogram, 50time-domain methods, 51voiced/unvoiced classification, 51voiced/unvoiced decision making, 54

Speech synthesis, 1model, 8

Split VQ, 201SQNR. See Signal to quantization noise

ratio (SQNR)SS7. See Signaling system

number 7 (SS7)STFT. See Short-time Fourier

transform (STFT)Stochastic codebooks. See Fixed

codebooksStochastic signal processing, 41

autocorrelation function, 42autocovariance function, 42cross-correlation function, 41cross-covariance function, 42power spectral density, 43stationary DT, 41, 42

STP transfer function, 177

T

TDM. See Time-division multiplex(TDM)

Telecommunications IndustryAssociation (TIA), 12, 176

Texas Instruments (TI), 295TI. See Texas Instruments (TI)TIA. See Telecommunications Industry

Association (TIA)

Time-division multiplex (TDM), 275Time-domain methods, 51, 55

frame energy, 52low-to full-band energy ratio, 52peakiness of speech, 53periodic similarity, 51prediction gain, 53pre-emphasized energy ratio, 52spectrum tilt, 53zero crossing, 52

TI TMS320C6X DSP processors, 297C67x CPU functional units, 297–298data-addressing units, 298enhancements, 299execute packets, 299fetch packets, 297, 299load/store architecture, 298LSB, 298TI C6000 DSP chip, 298TI C6000 DSP CPU, 299VelociTIVLIW architecture, 297, 299

Transmitter-based FEC techniques, 290media-independent FEC scheme, 290media-specific FEC scheme, 290, 291

Two-channel implementation, 325adding channels, 326changes in DSP/BIOS, 326–327changes in source code, 327N channel signal flow, 326N channel system, 325

V

VAD. See Voice activity detector (VAD)Vector quantization (VQ), 176, 185, 186,

189–190adaptive VQ, 202algorithms, 185average number of bits, 188binary search, 197–198codebook, 187codebook design, 192codeword, 187–188conjugate VQ, 201distortion measure, 189, 190encoder–decoder system, 186full search, 196input vector, 187LBG algorithm, 195–196

Page 369: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

356 Index

Vector quantization (VQ) (Continued)Lloyd’s algorithm, 192, 194, 195mapping, 187maximum number of bits, 188MSVQ, 198–201output vector, 187overall distortion measure, 188–189PVQ, 185, 202scalar quantization, 185split VQ, 201suboptimal search, 198training vectors, 188transmission rate, 188

Vector sum excited linear prediction(VSELP), 176

coder, 176Very-long instruction words (VLIW), 297Video coding methods, 338VLIW. See Very-long instruction words

(VLIW)Vocoder. See Speech

coder(s)—parametric codersVocoder development, Microsoft Visual

Studio, 308–314decoder project, 312–314encoder and decoder separation, 308encoder project, 311, 312encoder state definition, 309G729A_CommonLib, 308, 309G729A_DecodeLib, 310–311, 312G729A_EncodeLib, 308, 310g729A_scu_ialg.c, 310IALG interface, 308, 310ITU reference code, 308library projects, 308, 311measuring performance timing,

313test vector comparison, 313–314

Voice activity detector (VAD), 217,283, 288

Voice-over Internet Protocol (VoIP),10, 276, 293

impairments, 276jitter, 276line ECAN, 279one-way delay, 277–278packetization, 251, 276, 277signal processing algorithms, 276voice encoding standards, 276

VoIP. See Voice-over Internet Protocol(VoIP)

VQ. See Vector quantization (VQ)VQ applications in standards, 203VSELP. See Vector sum excited linear

prediction (VSELP)

W

Waveform coding, 69PCM, 13speech signals, 11

Weighted Mean Square Error(WMSE), 191

Wide-sense stationary (WSS), 43Wiener filters, 135–137, 142, 147. See

Levinson–Durbin(L-D) recursion—normalequation

error signal, 136FIR, 136linear filter, 136(M + 1) × 1 coefficient vector,

147(M + 1) simultaneous equations,

143, 147N×1 cross-correlation vector, 137N × N autocorrelation matrix, 137performance function, 136single matrix relation, 142statistical criterion, 136

Wiener–Hopf equations, 142–143Window function. See also Windowing

Barnwell, 47Bartlett, 45Blackman, 46Chen, 47Hamming, 45Hanning, 45, 46Kaiser, 46Rectangular, 45

Windowing, 45, 223–224Wireless local area networks (WLANs),

16Wireless metropolitan area networks

(WMANs), 16WLANs. See Wireless local area

networks (WLANs)

Page 370: Principles of Speech Coding. - Ogunfunmi - Narasimha. 2010

Index 357

WMANs. See Wireless metropolitan areanetworks (WMANs)

WMSE. See Weighted Mean Square Error(WMSE)

World Wide Web (WWW), 16WSS. See Wide-sense stationary

(WSS)WWW. See World Wide Web (WWW)

X

XDAIS. See eXpressDSP algorithmstandard (XDAIS)

XOR. See Exclusive-or (XOR)

Y

Yule–Walker equations, 47, 191