efficient vector quantization of lpc parameters for harmonic...

EFFICIENT VECTOR QUANTIZATION OF LPC

PARAMETERS FOR HARMONIC SPEECH CODING

by

Bhaskar Bhattacharya

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

in the School

of

Engineering Science

@ Bhaskar Bhattacharya 1996

SIMON FRASER UNIVERSITY

October, 1996

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

APPROVAL

Name:

Degree:

Title of thesis :

Doctor of Pl~ilosopl~y

Efficient Vector Quantization of LP(I Para~neters for H a -

~nonic Speech (loding

Examining Committee: Dr. .John .Jones, (.:hairn~an

Date Approved:

. , . - Dr. Vladimir ( 'u$rn~an, Senior Supervisor ProfessorJn~ineering Science, SFTI

Y V ,

Dr. Paul Ho. Supervisor Associate Professor, Engineering Science, SFIT

/

v f . Dr. JacquesValsey, Supervisoj Assistant Profesor, Engineering Scienct., SF11

Jim (:avers, Internal Examiner rofessor, Engineering Science. SFIT

Dr. Sanjit K. Mitra, External Examiner Professor, Electrical and Comput,er Engineering University of (Ihlifornia, Santa Barbara

October 11, 1996

PARTIAL COPYRIGHT LICENSE

I hereby grant to Simon Fraser University the right to lend my thesis, project or extended essay (the title of which is shown below) to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its usrs. I further agree that permission for multiple copying of this work for scholarly purposes may be granted by me or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without my written permission.

Title of Thesis/Project/Extended Essay

"Efficient Vector quantization of LPC Parameters for Harmonic Speech Codin?"

Author: 1

(signature)

(name)

October 11. 1996 (date)

Abstract

The present thesis deals with the problem of efficient (in bit rate and computational

complexity) quantization of Linear Prediction Coding (LPC) parameters for low bit

rate speech coding. The thesis introduces a new LPC quantization technique based on

the Multi-Stage Vector Quantization (MSVQ) combined with a multi-candidate M-L

search. The resulting procedure is assessed by evaluating the quantization spectral

distortion on a speech data-base and by evaluating the subjective speech quality of a

low-rate speech coder which employs the MSVQ LPC quantization.

The general structure of MSVQ is described along with a geometrical interpreta-

tion to provide insight into the structure of the reproduction alphabet in MSVQ. In

~ar t icular , it is shown that MSVQ codevectors provide a tiling of the sample space

with repetitive patterns. Two tree-search techniques are suggested and one of them,

the M-L search technique is studied in more detail.

The experimental results obtained with MSVQ indicate that transparent quan-

tization of LSFs (Line Spectral Frequencies - an efficient LPC representation) can

be achieved with just 22 bitslvector with computational complexity comparable to

the Split VQ at 24 bitslvector. Alternatively, transparent quantization of LSFs can

be done using 24 bitslvector (as is done using Split VQ) at a much lower computa-

tional complexity. Several results relating performance and complexity trade-offs are

reported showing that MSVQ is a very flexible approach which provides a wide range

of performance-complexity trade-offs and good robustness.

The performance of MSVQ codes have been studied under channel error condi-

tions and codebook ordering using pseudo-Gray coding. It is shown that while VQ

based systems have lower average spectral distortion and a lower percentage of 2-4

dB outliers even with transmission errors, scalar quantization may lead to a lower

percentage of 4 dB outliers particularly at high error rates.

Performance of the IVQ codes have also been studied for effects of language and

input spectral shape. It has been shown that MSVQ codes become more robust as

the number of stages are increased.

Finally, one of the MSVQ codes developed here has been used to implement a

1800 bps speech coder using a harmonic coding of excitation and a very coarse 0-bit

quantization of harmonic spectral shape. The speech quality of the 1800 bps coder

was better than the 2400 bps LPC-lOe coder.

Acknowledgements

I would like to thank Prof. Vladimir Cuperman for all his guidance and patience all

along this work. His suggestions were very helpful during the course of this research.

I also thank Dr. Jacques Vaisey and Dr. Paul Ho for being on my advisory committee

and making constructive criticism of the work.

I wish to express my heartfelt gratitude to my wife Roma for all her encourage-

ments and tolerance, and all my friends, particularly Peter Lupini, Aamir Husain,

and Yingbo Jiang for the exciting discussions that make research a lively occupation.

I also obtained a lot of help in keeping my spirits up from my friends Hong Shi and

Jacqueline Duffy, my sincere thanks to them.

Contents

... Abstract ................................................................... in

Acknowledgements ........................................................ v

List of Tables .............................................................. x

List of Figures ............................................................. xi

1 Introduction ........................................................... 1

1.1 Speech Coding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Waveform Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Parametric Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Speech Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Motivation and Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 A Brief Review of Speech Coding Literature ........................ 13

2.1 Source Coding and Rate Distortion Theory . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Analysis-by-synthesis Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Sinusoidal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Relative Merits and Demerits of Different Coding Strategies . . . . . . . . . 26

3 Quantization of LPC parameters ..................................... 28

3.1 Choosing an Appropriate Spectral Representation . . . . . . . . . . . . . . . . . . 29

3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Pre-emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2 Bandwidth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.3 High Frequency Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Vector Quantization of LPC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 Stochastic VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.2 Techniques Exploiting Interframe Correlations . . . . . . . . . . . . . . 40

3.4 Constrained (suboptimal) VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Tree Structured VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.2 Classified VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.3 Product Code VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.4 Basis Vector VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.5 Multi-Stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4.6 Partitioned VQ (Split VQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Multi-Stage VQ of LPC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Suboptimality of Sequential Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.1 Optimality conditions for sequential search . . . . . . . . . . . . . . . . . 62

4.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 Search Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2.2 Detailed Analysis of The Search Complexity . . . . . . . . . . . . . . . . 71

4.3 Codebook Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.1 Centroid Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.2 Outlier Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Choice of Parameter Representation and Distance Measure . . . . . . . . . . 75

4.5 Performance and Complexity Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Robustness Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6.1 Effect of Language and Input Spectral Shape . . . . . . . . . . . . . . . 80

4.6.2 Performance in the presence of channel errors . . . . . . . . . . . . . . . 82

4.7 Improved Codebook Designs for Multi-Stage VQ . . . . . . . . . . . . . . . . . . . 85

4.7.1 Iterative Sequential Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.7.2 Simultaneous Joint Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.8 Recent Developments in MSVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Summary 88

.............................. 5 A Low Rate Spectral Excitation Coder 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction 90

5.2 Architecture of a Very-Low Rate Spectral Excitation Coder . . . . . . . . . 91

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Treatment of Unvoiced Segments 92

. . . . . . . . . . . . . . . . . . . . . . . . 5.3 Computation of the Unquantized Residual 93

5.4 Estimation and Quantization of Harmonic Parameters . . . . . . . . . . . . . . 94

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Pitch Estimation 95

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Modelling of Harmonic Phases 99

5.4.3 Estimation and Quantization of Harmonic Magnitudes . . . . . . . 105

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 An 1800 bps Spectral Excitation Coder 111

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Evaluation of Coder Performance 113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusions 114

.................................... 6 Conclusion and Future Directions 115

...................................................... A Linear Prediction 117

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . 1 Conceptual Formulation 117

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Equivalent Representations 121

. . . . . . . . . . . . . . . . . . A.2.1 Computation of Line Spectral Frequencies 125

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Maximum Entropy Principle 129

........................................................... B Quantization 131

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.l Scalar Quantization 131

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B . 1.1 Performance Measures 133

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B . 1.2 Robust Quantization 135

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.3 OptimumQuantization 137

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Vector Quantization 141

. . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Vector Quantizer Performance 144

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 Optimum VQ 146

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.3 VQ Design 150

........................................ C Pitch Computation Algorithm 152

... Vll l

D List of Citations . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

List of Tables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Digital Speech Coding Standards 9

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Some important ITU-T recommendations 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Some early scalar quantization results 31

. . . . . . . . . . . . . . . . . . . . 3.2 Channel error performance of Basis Vector VQ 50

4.1 MSVQ Configurations and Rates Producing an Average Spectral Dis-

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tortion of 1 dB 80

4.2 Spectral Distort ion Performance over Different Languages and Input

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral Shapes 81

4.3 Percentage of Outliers (2-4 dB) for Different Languages and Input Spec-

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tral Shapes 81

4.4 Average Spectral Distortion for Different Error Rates and Codes . . . . . 84

4.5 Percentages of Outliers for Different Error Rates and Codes . . . . . . . . . 84

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Bit Allocation for the 1800 bps coder 113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 MOS results 114

. . . . . . . . . . . . . . . C . 1 Values of empirical constants used in 1800 bps coder 155

List of Figures

A classification of speech coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A generalized predictive coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

The Source-Filter Parametric Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LPC-10 Speech Synthesis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

A schematic diagram of the CELP coder . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Historical bit rates of toll quality coders . . . . . . . . . . . . . . . . . . . . . . . . . . 8

The primary parameters of R-D theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A Generalized Analysis-by-Synthesis System . . . . . . . . . . . . . . . . . . . . . . 18

Computational structure of the CELP coder . . . . . . . . . . . . . . . . . . . . . . 19

Schematic diagram of a Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . 22

Spectral envelope of speech without (solid line) and with (dash line)

high frequency compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

SIVP coding system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A tree-searched VQ for m = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A tree structured VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Classified VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

The Split VQ Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Structure of a two-stage two dimensional VQ . . . . . . . . . . . . . . . . . . . . . . 58

A sequentially searched multi-stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Voronoi regions for a two-stage MSVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Growing Tree search of a three stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . 65

M-L Tree search of a three stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Failure of multi-candidate search in a 2-stage VQ . . . . . . . . . . . . . . . . . . 67

4.7 Failure of M-L search in a 3-stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8 Performance of LSF-6+6 MSVQ with M-L search . . . . . . . . . . . . . . . . . . 71

4.9 Performance comparison of LAR and LSF codebooks with M-L search 77

4.10 Spectral distortion of M-L Tree searched MSVQ at 24 bits/vector . . . . 78

4.11 M-L search performance versus search complexity for different rates . . 79

4.12 Performance over different languages and input spectral shapes . . . . . . 82

5.1 Magnitude spectrum of a voiced speech segment and corresponding

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LPC residual 90

5.2 A conceptual schematic of a spectral excitation coder . . . . . . . . . . . . . . . 92

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis of SEC parameters 94

. . . . . . . . . . . . . . . . . . . . . . . 5.4 Performance of the geometric pitch detector 100

. . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Pitch pulses marked by the pitch detector 101

5.6 Difference between measured and predicted phase changes for a voiced

frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.7 Difference between measured and predicted phase changes for an un-

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . voicedframe 104

. . . . . . . . . . . . . . . . . . . . . 5.8 Frequency sampling points for a P-point DFT 106

5.9 Log magnitude spectrum templates for voiced and unvoiced speech . . . 109

. . . . . . . . . . . . . . . . . . . . . . . . . 5.10 A Low bit rate Spectral Excitation Coder 112

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.l Linear Prediction Model 121

. . . . . . . . . . . . . . . . . . . . . . . . . A.2 Stepped cylinder model of the vocal tract 123

. . . . . . . . . . . . . . . . . . . A.3 Transformation of Predictor coefficients to LSFs 124

A.4 Plots showing relationships between LSFs and other parameters . . . . . 128

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.l A typical mid-tread scalar quantizer 132

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Additive noise model of quantization 133

. . . . . . . . . . . . . . . . . . . . . B.3 Compander model of nonuniform quantization 136

B.4 A uniform joint pdf over a rectangular region (shown shaded) along

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with the marginal pdf's 143

. . . . . . . . . . . . . . B.5 A vector quantizer satisfying the necessary conditions 149

xii

Chapter 1

Introduction

Recent advances in Multimedia Communication and the real possibility of an impend-

ing integrated services network have generated a lot of interest in digital coding of

speech. With increasing demand on the bandwidth, more and more emphasis is being

placed on low bit rate speech coders. The present thesis addresses an important prob-

lem in low bit rate speech coding - that of efficient quantization of LPC parameters.

A low bit rate coder based on harmonic excitation is also presented that produces

good speech quality at rates below 2 kb/s.

1.1 Speech Coding Techniques

Detailed reviews of different speech coding techniques can be found in [36, 23, 411.

A brief overview is presented below. Speech coding algorithms can be categorized in

different ways depending on the criterion used. The most common classification of

coding systems divides them into two main categories waveform coders and parametric

coders. The waveform coders, as the name implies, try to preserve the waveform being

coded and pay no attention to the fact that the signal being coded is speech. The

parametric coders, on the other hand, depend upon a parsimonious description of

speech using a priori knowledge about how the signal was generated at the source. The

idea is that certain physical constraints of the signal generation can be quantified, and

turned to advantage in efficiently describing the signal. This implies that the signal

must be fitted into a specific mold and parameterized accordingly. These coding

techniques which exploit constraints of signal generation are also called source coders

CHAPTER 1. INTRODUCTION

or vocoders(V0ice CODERS).

Some coders use a mixture (or hybrid) of these two approaches. They use a synthe-

sis filter that models the vocal tract but attempt to quantize the excitation sequence

through an waveform matching procedure. We have put these coders under the cat-

egory of parametric coders in our classification. A broad classification of different

speech coders is shown in Fig. 1.1.

Speech Coding Systems s Waveform Coders Parametric Coders

7'7 Time Frequency Direct Excitation

Domain Domain Speech Encoding

DM Encoding

SBC DPCM ATC STC

ADPCM MBE

APC VPC Open /7\ Mixed Closed

LOOP Loop PWI

LPC- 10 TFI MP-LPC RELP W E SEC CELP

VSELP

Figure 1.1: A classification of speech coders

1.1.1 Waveform Coders

The waveform coders operate either in the time domain or in the frequency domain

and can be classified accordingly.

1.1.1.1 Time Domain Waveform Coders

The time domain waveform coders are all predictive coders, in that they code infor-

mation that cannot be predicted from already reconstructed speech signals. They


Figure 1.2: A generalized predictive coder

evolved from DM (Delta Modulation) [58] which uses a first order fixed predictor

and a one-bit adaptive quantizer, to VPC (Vector Predictive Coding) [24] which uses

a vector predictor and a vector quantizer for the error sequence. APC (Adaptive

Predictive Coding) [9, 10, 111 is a technique that uses a scalar, higher (> 1) order,

predictor to predict both short-term and long-term structures of speech signal and

optionally uses a filtered quantization error feedback to control noise spectrum. A

schematic diagram of a generalized APC coder is shown in Fig. 1.2.

1.1.1.2 Frequency Domain Waveform Coders

Sub Band Coding (SBC) [21] divides the speech spectrum into four or five sub-bands

using a bank of bandpass filters. Each sub-band is translated to base-band by a

single-sideband modulation process, resampled at its Nyquist rate, and encoded by

adaptive quantization or ADPCM. In the receiver, the sub-bands are decoded, mod-

ulated back to their original position in the frequency domain, and summed to give a

reconstruction of the original signal. The spectral shape of the quantization noise is

controlled by bit-allocation.

In Adaptive Transform Coding (ATC) [113], the speech signal is subdivided into

blocks and a transform is applied to each block. The transform coefficients are adap-

tively quantized and transmitted to the receiver where they are decoded and inverse

transformed to obtain the waveform.

CHAPTER 1. INTRODUCTION 4

1.1.2 Parametric Coders

Right from its introduction [57, 8, 771, linear prediction has been very successful in

coding speech. A very popular model used for speech production is the source-filter

model. The sound generating mechanism (the source) is assumed to be linearly sep-

arable from the intelligence-modulating vocal tract (the filter) (Fig. 1.3). The speech

signal, s(n), is analyzed to compute a set of excitation control parameters, J(n), and a

set of synthesis filter control parameters a(n) . The output of the excitation generator,

e (n ) , when passed through the synthesis filter produces reconstructed speech, i (n) .

Excitation Synthesis Generator Filter

Figure 1.3: The Source-Filter Parametric Coder

Despite the success of the source-filter model, some coders do not use it, and

attempt to model the speech signal as a whole. Thus, the class of parametric coders

can be further subdivided into those that attempt to model the speech directly, and

those that attempt to model the excitation sequence and the synthesis filter separately.

1.1.2.1 Direct Speech Encoding

A powerful speech modelling technique uses a sum of sinusoids model to represent

speech signals. This is represented by

s(n) = C Am (n) cos ( e m (n) ) m

(1.1)

where m is the harmonic number and the summation is taken over the number of

harmonics which vary with time.

This was first introduced by Hedelin [55] and later developed by Almeida and

Tribolet [3], McAulay and Quatieri [82, 831, and Marques, Almeida and Tribolet [80].


This technique has been called Harmonic Coding and Sinusoidal Transform Coding

(STC) by different authors.

A slightly different form of sinusoidal speech modelling was done by Griffin and

Lim [54]. A closed loop estimation was done for pitch and harmonic magnitudes. The

speech spectrum was divided into voiced and unvoiced bands and voiced and unvoiced

components of a speech frame were synthesized differently. The voiced component

was synthesized in the time domain using Eq. (1.1) and the unvoiced component was

computed from a synthetic DFT using the overlap-add method [53]. They were added

together to form the synthetic speech signal. This technique, although performed

directly on the speech signal is called Multi Band Excitation (MBE). One version of

MBE, called improved MBE (IMBE) [16] was subsequently adopted by INMARS AT

as a standard for satellite voice communication. Another version [85] is currently

under consideration for the TIA half-rate TDMA digital cellular standard. Typical

bit rates for sinusoidal coders range from 4.1 kb/s to 9.6 kb/s.

1.1.2.2 Excitation Encoding

The oldest parametric coder is the Channel Vocoder by Dudely [31]. It exploits the

insensitivity of the aural mechanism to phase, and only attempts to reproduce the

short time power spectrum of the speech waveform. The spectral envelope of the

speech is measured with a bank of filters and ascribed wholly to the vocal tract filter,

while the excitation is estimated to be either a quasi-periodic pulse train, or noise.

In recent coders, that use excitation modelling, the synthesis filter is computed

from a linear prediction analysis of segments of speech and uses what are called LPC

parameters. A variety of techniques are used to represent the excitation signal. So,

the problem in this class of coders is how to quantize the LPC parameters and the

excitation most efficiently. In some coders the excitation is chosen in a closed loop

fashion so as to minimize a perceptually significant distortion between the original and

synthetic speech, and some others use an open loop approach without any reference

to the synthetic speech. There are also some mixed approaches where a classifier is

used and different classes are dealt with in an open or closed loop manner (Fig. 1.1).


r"'--""'--"'-"

Speech

I I L-,,,,,,,,,,,,,-,,,I

Excitation Generator

Figure 1.4: LPC-10 Speech Synthesis Model

Open loop techniques

The oldest speech coding standard, LPC-10 (U.S. Government Federal Standard 101 5 )

[103, 181, uses a 10th order synthesis filter, and pulses and random sequences as the

excitation (Fig. 1.4). The LPC parameters are represented as reflection coefficients

and are scalar quantized. Regular pulses at pitch intervals are used as excitation for

voiced portions and a white random sequence is used for unvoiced portions of the

speech being coded. The energy distribution is maintained by a gain parameter.

A modification of the LPC-10 called RELP (Residual Excited Linear Prediction)

[106] uses a quantized low-pass filtered version of the residual as the excitation and

avoids the problem of classification and computation of pitch.

The Spectral Excitation Coder (SEC) [25] uses a sum-of-sinusoids model to syn-

thesize the excitation signal which is passed through an LPC based synthesis filter

to produce speech. Since the residual is more spectrally flat than speech itself, it

offers advantages in quantizing the harmonic magnitudes over conventional sinusoidal

coders.

Closed loop techniques

The hybrid coders CELP (Code Excited Linear Prediction) [12] and VSELP (Vector

Sum Excited Linear Prediction) [44] employ the same source-filter model (Fig. 1.3)

but the excitation is selected from a fixed and an adaptive codebook in a closed loop

fashion known as analysis by synthesis. A schematic structure of the CELP coder is

shown in Fig 1.5.

VSELP models the excitation sequence as a linear combination of a fixed set of

CHAP TER 1. INTRODUCTION

Adaptive codebook

Input speech

I

Figure 1.5: A schematic diagram of the CELP coder

M basis vectors.

m=l

where 0 5 i 5 2M - 1 and 0 < n < N - 1. The linear combination coefficients O;,

are restricted to either $1 or -1. This simpl&es the procedure of codebook search

for optimum innovation and also makes the system comparatively robust to bit errors

as a single bit error only affects one component. Computational complexity is also

reduced for a joint optimal search of the VSELP codebook and the adaptive codebook

as it requires orthogonalization of a small (typically 10) number of basis vectors only.

MP-LPC (Multi Pulse LPC) [5] and RPE (Regular Pulse Excitation) [67] are pre-

cursors to CELP that uses codebooks of pulse trains whose positions and amplitudes

are determined in a closed loop fashion.

Mixed techniques

It is possible that different approaches be applied in modelling different segments

of the excitation. Specially, advantage can be taken of the apparent periodicity of

the voiced portions of speech. The techniques Prototype Waveform Interpolation

(PWI) [63, 641 and Time Frequency Interpolation (TFI) [97] use open loop frequency

domain interpolation techniques to model the gradually changing pitch cycles of a

C H A P T E R 1. INTRODUCTION 8

voiced excitation while using closed loop techniques like CELP for unvoiced segments

which are difficult to model parametrically due to lack of specific spectral structures.

1.1.3 Speech Coding Standards

A summary of different speech coding standards currently in use is shown in Table 1 . l .

The ITU-T (formerly CCITT) has also passed some recommendations (Table 1.2) for

digital coding of speech. The progression of tolllnear-toll quality speech coding can

be seen in Fig. 1.6 where bit rates of toll quality coders have been plotted with the

year of their introduction.

1975 1980 1985 1990 1995 2000 Year

Figure 1.6: Historical bit rates of toll quality coders

1.2 Motivation and Original Contributions

1.2.1 Motivation

For low bit rate speech coders that employ the source-filter model, a large portion of

the bit rate is invested in coding synthesis filter parameters. Obviously, one way to

improve synthetic speech quality at low bit rates will be to minimize the number of

C H A P T E R 1. INTRODUCTION

Rate ( W s ) 64

32

16

16

13

9.6

8

6.7

6.4

4.8

4.8

2.4

Application

PSTN (1st Generation)

PSTN (2nd Generation)

PSTN (3rd Generation)

INMARSAT Standard B (Maritime)

Pan European Digital Mo- bile Radio (DMR) Cellular System (GSM)

S kyphone (Aeronautical)

North American DMR (Mobile)

Japanese DMR (Mobile)

INMARSAT Standard M (Land-Mobile)

U.S. Government Federal Standard 1016

NASA MSAT-X (Mobile Satellite)

U. S. Government Federal Standard 1015

Coding Algorithm

Pulse Code Modulation

(PCM)

Adaptive Differential PCM (ADPCM)

Low Delay Code Ex- cited Linear Predictive Coding (LDCELP)

Adaptive Predictive Coding

(APC)

Regular Pulse Excitation Long Term Prediction (RPE-LTP)

Multi-Pulse Linear Predictive Coding (MPLPC)

Vector Sum Excited Linear Predictive Coding (VSELP)

VSELP

Multi-Band Excitation (MBE)

CELP

Vector Adaptive Predictive Coding (VAPC)

Linear Predictive Coding (LPC-10)

Year Adopted 1972

1984

1992

1985

1991

1990

1992

1993

1993

1991

1991

1977

Table 1.1: Digital Speech Coding Standards


Recommendation G.711

Table 1.2: Some important ITU-T recommendations

Code Rate (kb/s) 64

16,24,32,40

16

8

bits used to represent LPC parameters while keeping the spectral distortion within

acceptable limits. The bits thus saved can be used for a better representation of the

excitation.

It has already been reported [86] that a spectral distortion of less than 1 dB

is required for transparent quantization of LPC parameters. Paliwal and Atal [86]

achieved transparent quantization of LPC parameters using a highly constrained VQ

structure - Split VQ. This was a bit surprising since multi-stage VQs have failed to

achieve transparent quantization of spectral parameters before; and it was clear that

split VQ was a constrained version of multi-stage VQ (MSVQ).

An analysis of the multi-stage VQ showed that a sequentially searched MSVQ has

to be severely constrained for the sequential search to be optimal. Clearly, a better

search could be performed and as we show in this thesis, M-L search provided the

best performance-complexity trade off in obtaining transparent quantization of LPC

parameters.

An 1800 bps spectral excitation coder was also implemented to show the effective-

ness of the new efficient LPC quantizer in achieving a moderate quality (better than

LPC-lOe) coder at a low bit rate.

Algorithm PCM

ADPCM

LD-CELP

ACELP

1.2.2 Original Contributions

The original contributions reported in this thesis are as follows.

For the first time M-L search was combined with MSVQ resulting in a very

efficient, low complexity, suboptimal VQ (section 4.5).


It was demonstrated for the first time that transparent LPC quantization could

be done using an MSVQ with large number of small stages. In fact, the memory

complexity was reduced to a total of only 60 codevectors for a quantizer that

achieved transparent quantization at 30 bitslvector (Fig. 4.11).

A method for designing LPC quantizers with very low computational complexity

was indicated, resulting in complexity lower than the only transparent quantizer

known at that time (split VQ by Paliwal and Atal) (sections 4.2 - 4.4).

A transparent LPC quantizer was designed at 22 bits/vector which was the

lowest rate transparent LPC quantizer at that time (Table 4.1).

MSVQ with M-L search was proved to result in a robust VQ with respect to

transmission errors, speakers, and languages. It was the first time that a VQ

was obtained with proven robustness (section 4.6).

In designing the harmonic coder, a 0-bit harmonic magnitude shape quantizer

was used that helped to achieve low bit rate for the coder (section 5.4.3).

A new geometric pitch detector. with low computational complexity was de-

signed. The pitch detector can provide locations of individual pitch pulses which

is useful in pitch synchronous algorithms (section 5.4.1).

After publication of the first set of results in ICASSP 1992, our work has been very

widely referenced (a partial list of citations is given in appendix D). Several companies

like Rockwell and Texas Instruments have our LPC quantizer integrated into their

products. Also, to the best of our knowledge, the new DoD standard 2400 bps coder

uses our LPC quantizer.

The following is the list of publications that resulted from the work reported in

this thesis.

1. B. Bhattacharya, W. LeBlanc, S. Mahmoud, and V. Cuperman. Tree Searched

Multi-Stage Vector Quantization for 4kb/s Speech Coding. ICASSP, pp. 1-105

- 1-108, San Francisco, March 1992.

2. W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman. Efficient

Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters

C H A P T E R 1 . INTRODUCTION 12

for 4 kb/s Speech Coding. IEEE Trans. Speech and Audio Processing, Vol. 1,

No. 4, pp. 373-385, Oct. 1993.

3. V. Cuperman, P. Lupini, and B. Bhattacharya. Spectral Excitation Coding of

Speech at 2.4 kb/s. ICASSP, pp. 496-499, Detroit, May 1995.

Chapter 2

A Brief Review of Speech Coding

Literature

There is a vast literature available on information theoretic aspects of coding but as

we point out, not much of it is directly relevant in the context of speech coding. Three

major speech coding techniques are also reviewed here and an attempt is made to find

out their lacunae to obtain pointers to a successful design of a low bit rate speech

coder.

2.1 Source Coding and Rate Distortion Theory

The main concern of source ,coding theory is how best to map source symbols to

channel symbols assuming a perfect channel. This involves assigning channel symbols

to source symbols such that the average symbol length is minimum. Consider a

discrete memoryless source with symbols {xl, xg, . . . , xM) and corresponding symbol

probabilities {P(xl) , P(xz) , . . . , P(xM)). Entropy (average information per symbol)

of this source is given by

Usually, the base of the logarithm is 2 and hence entropy is measured in bits/symbol.

If the channel alphabet is binary (0, I) , then we need a minimum average of H(x)

bits per symbol to encode this source. If the source is correlated and modelled as a

CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 14

stationary, ergodic, Markov process, the entropy is lower than that given by Eq. (2. I) ,

and can be written (for a first order process) as

where cj and ck are successive states for the Markov process, P(ck, cj) is the joint

probability of occurance of the state pair (ck, cj), and o k j is the transition probability

from state cj t,o state ck, i.e. a k j = P(ck, cj)/P(cj). It should be borne in mind that

an m-th order Markov process can be reduced to a first order process by considering

an m-th extension of the source alphabet and hence Eq. (2.2) applies to all stationary,

ergodic, Markov processes. The bit rate indicated by H(x) is the minimum bit rate

required to represent x without any distortion. Many times, the information needs to

be represented using a lower bit rate due to system constraints and a data compression

needs to be done. This is the problem of source coding where a set of source symbols is

to be mapped to a set of reproduction symbols with lower entropy. The rate-distortion

function R(D) gives the minimum bit rate at which information can be coded with

an average distortion of D or less. The distortion-rate function D(R) defines the

minimum distortion achievable for a given coding rate R.

Assume a discrete message source x with alphabet size M and a reproduction

alphabet y with alphabet size N. A deterministic mapping {x) H {y) of source

symbols to reproduction symbols can be completely specified by an assignment matrix

or by a table with entries denoted by Q ( j I i) to indicate that the source symbol x; is

mapped to the reproduction symbol yj. The probability that the symbol y j occurs is

given by

where Q ( j I i) represents the deterministic assignment {x) H {y). Note that if we

require the Q ( j I i ) to have the normalization

then the'function Q ( j I i ) behaves just like a conditional probability and, in fact,

is mathematically indistinguishable from a conditional probability even though the

process considered is deterministic. The function Q ( j I i) is called a conditional

assignment function.


A single-letter distortion measure [96] is given by an M x N matrix with elements

d(i, j) which reflects the cost if symbol x; is reproduced as symbol yj. The average

distortion, D, over all possible source and reproduction symbols can then be written

as :

where P( i , j) = Q ( j I i)P(x;) is the joint probability of occurrence of source symbol xi

and reproduction symbol yj. When D is given a numerical value, it is called a fidelity

criterion. The primary parameters of rate distortion theory are shown in Fig. 2.1.

Occurs with probability

Figure 2.1: The primary parameters of R-D theory

Given the input probability distribution p(x) and the distortion measure, the av-

erage distortion is a function of the conditional assignment function Q ( j I i). A

conditional assignment Q( j 1 i) is called D-admissible if it results in an average dis-

tortion that is upper bounded by D. We define a set of D-admissible assignments,

QD as -

QD = {Q( j I 2 ) : D 5 0). (2.6)

The mutual information between source messages and reproduced messages is

Using the relationship

P ( i , j ) = Q(j I i )P(xi)


and Eq. (2.3), the mutual information can be written as

Thus I(x, y ) is dependent on the conditional assignment function Q(-, .) and the input

probability distribution p(x).

The rate distortion function, R(D), is defined as the minimum of I (x , y ) over the

set of D-admissible conditional assignments QD that produce an average distortion

less than or equal to D, i.e.

R(D) = min I (x , y). Q E Q D

It is evident that in order to apply rate distortion theory to determine performance

limits of speech coders, the major difficulty encountered is in defining the terms in-

volved. It is not clear how to define what constitutes the source and reproduction

alphabets, and hence it is not possible to talk about the source probability distribu-

tion, entropy of the source, or a distortion measure between source and reproduction

symbols. It is generally observed that unvoiced speech can be coded in a perceptually

accurate manner at a very low bit rate [68] and that required for voiced speech is

usually relatively high. This shows that most probably the voiced portions of speech

carry more information compared to the unvoiced portions contrary to one's first im-

pression that unvoiced segments have a high information rate because of their lack of

an obvious structure. In fact, if one attempts to compute entropy of different speech

segments using quantized PCM speech and a Markov model, it is highly probable that

the results will show unvoiced segments as the main carriers of information.

De and Kabal [30] made some attempts to apply rate distortion theory to speech

coding using cochlear models and a perceptual distortion measure called cochlear

discrimination [29]. The performances of four different speech coders - 4.8 kb/s CELP,

8 kb/s VSELP, 16 kb/s wide-band CELP, and 32 kb/s ADPCM, were studied and

compared with their rate-distortion performance limits. The results [30] showed that

the perceptual quality obtained by the 4.8 kb/s, 8 kb/s, 16 kb/s, and 32 kb/s coders

can be achieved at 1.5 kb/s, 4 kb/s, 5.4 kb/s, and 20 kb/s respectively according to


the rate distortion curve computed by them. Considering the present day research

goals these results seem to be quite reasonable.

A review is presented below of three different speech coding philosophies along

with their merits and demerits.

2.2 Analysis-by-synt hesis Speech Coding

Analysis-by-synthesis, as the name implies, involves analysis and synthesis. This

further implies that these coders are parametric coders that require an analysis to

compute model parameters. Analysis-by-synthesis is a general approach in which some

or all of the model parameters are estimated by systematically searching a parameter

space for a close match between synthesized and original speech. The search is carried

out by starting with speech being synthesized using an initial set of parameters and

then changing the parameter set and resynthesizing the same segment of speech until

all points in the parameter space have been visited. The set of parameters that

produced synthetic speech closest to the original speech according to some chosen

distortion measure is transmitted to the decoder.

Essentially, the analysis-by-synthesis technique can be applied to any paramet-

ric speech coder which satisfy (or can be constrained to satisfy without making the

synthetic speech of unacceptable quality) the following two conditions.

1. The parameter space should be finite.

2. It should be possible to quantize the parameter space into a finite set of points.

It should be noted that since all parameters need to be quantized anyway before

transmission to achieve finite bit-rate, essentially all parametric coders can be imple-

mented in the analysis-by-synthesis fashion. The actual choice of which parameters

(if any) to estimate using analysis-by-synthesis technique depends on the ease with

which the parameter space can be searched and is determined by a complexity vs.

quality tradeoff. Also, for some parameters (e.g. synthesis filter parameters for a

source-filter model), direct computation techniques may exist obviating the need to

do a search.

It should be borne in mind that the distance computation (using a chosen distor-

tion measure) between original and synthetic speech during the search can be made


either in time domain or in a transform domain (e.g. frequency domain). The general

structure of an analysis-by-synthesis system is shown in Fig. 2.2.

Direct Analysis (optional)

Computed parameters

Parameter Code book

/ Synthesis /

- A-by-S Model parameters

Index selection

1 ..-..-..-..-..-..-. Selected index

computation 9 Encoder -..-..-..-. Decoder

Synthesis book

Figure 2.2: A Generalized Analysis-by-Synthesis System

Although analysis-by-synthesis coders belong to a general class as defined above,

the term usually refers to the more specific class of parametric coders that employ

a linear prediction based synthesis filter in a source-filter configuration. The first

practical A-by-S system was the multi-pulse LPC (MP-LPC) [5] where the excitation

sequence was modelled as a sequence of pulses whose positions were determined in an

A-by-S manner. After the optimal positions are determined, the pulse magnitudes are

computed. In Regular Pulse Excitation (RPE-LPC) [67], the excitation sequence is a

sequence of regularly spaced pulses where the position of the first pulse and the pulse

amplitudes are encoded. The most important and popular form of A-by-S coding is


speech

zero input

zero state

codebook

zero state zero state

Figure 2.3: Computational structure of the CELP coder

known as Code Excited Linear Prediction (CELP) coding.

The rudimentary structure of a CELP coder has already been discussed in Chap-

ter 1 along with a schematic diagram (Fig. 1.5). A computationally efficient structure

is obtained by pushing the perceptual weighting filter W ( z ) through the summation

sign giving rise to the weighted input speech signal and the weighted short term

synthesis filter %. This structure is shown in Fig. 2.3.

The computation of the synthetic vector is simplified in this model by separating

the zero-input response (ZIR) and zero-state response (ZSR) of the synthesis filter.

As shown in Fig. 2.3, only ij(n) depends on the code vector being filtered and u(n)


and r (n ) only depend on the filter parameters. Therefore, a target vector y(n) is

calculated as y(n) = s,(n) - u(n) - r (n) which is matched with $(n) to search for an

appropriate code vector. Since $(n) only constitutes the ZSR of the synthetic filter, it

can be computed as a matrix vector multiplication of a code vector with a fixed (for

the duration of a subframe) lower triangular toeplitz impulse response matrix H.

and N is the dimension of the subframe vector. The special structure of the impulse

response matrix facilitates a low complexity computation of the filtered code vectors.

Most of the research in CELP have been directed towards complexity reduction

and many different techniques have been investigated to that end. Computational

complexity can be reduced by introducing some structure in the stochastic codebook

albeit with some loss in performance due to suboptimali ty introduced by the structural

constraint. Several suboptimal codebook structures have been studied in an attempt

to reduce complexity. A widely used technique is the use of sparse codebooks where

most of the elements are zeros. Sparse codebooks were first independently proposed

by Davidson and Gersho [27] and Lin [71]. Lin [71] also suggested an overlapped

codebook technique where each code vector is a subsequence derived from a longer

sequence of random numbers. Each code vector is obtained by shifting a fixed length

selection window over the longer sequence by one or more samples. Substantial savings

in computation and storage can be obtained by this technique. Sparse codebooks can

also be combined with overlapped codebooks and elements of the codevectors can be

restricted to take on only binary or ternary values. The DoD FS-1016 4.8k CELP

coder is an example where a sparse, ternary, overlapped codebook is used [19].

Other structured codebooks leading to some complexity reduction are lattice code-

books [46] and algebraic codebooks [2]. In these structures, regularly spaced arrays are

used as codebooks obviating the need to store them. Since these codevectors can


be generated in an orderly fashion, there is a predetermined correspondence between

lattice points and binary words.

A different complexity reduction technique is used in VSELP [44, 431. The coder,

adopted as the North American standard (IS-54) for digital cellular communication,

contains two VSELP excitation codebooks, with 2M and 2N codevectors respectively.

These are constructed from a set of M and N basis vectors. In IS-54, both M and N

are 7 giving rise to 128 codevectors in each codebook. Gerson [43] also reported a 4.8

kb/s VSELP coder using a single excitation codebook with M = 10 basis vectors. The

following description assumes a single excitation codebook for brevity, the extension

to multiple codebooks is straight forward. Defining vm(n) as the mth basis vector and

u;(n) as the ith codevector, each from the VSELP codebook, then:

M

~ i ( n ) = C oimvm (n) (2.13) m=l

where 0 5 i 5 2M - 1 and 0 5 n 5 N - 1.

Thus, each codevector in the codebook is a linear combination of the M basis

vectors. The coefficients Ojm are equal to 1 if bit m of codeword i is a 1 and are

equal to -1 if the corresponding bit in the codeword is 0. This special structure of

the VSELP codebook lends itself to a fast search as only the basis vectors need to be

filtered since the filtered codevectors are formed as sums and differences of the filtered

basis vectors as the linear combination coefficients are restricted to either +I or -1.

The VSELP also uses an adaptive codebook as in CELP (Fig. 1.5). The adaptive

codebook is a sequence of past excitation, a suitable segment of which is used to form

the current excitation along with the contribution(s) from the excitation codebook(s) .

The adaptive codebook and the VSELP codebook(s) are jointly searched by searching

the adaptive codebook first and orthogonalizing the filtered basis vectors with respect

to the chosen adaptive codevector. In general this will be a highly computationally

intensive operation but is feasible for the VSELP structure.

There are many different ways codebooks have been structured (e.g. multi-stage

VQ, split VQ, etc.). Various structured codebook design techniques have been dis-

cussed in detail by LeBlanc and Mahmoud [70].


Q2 ............. ~ i ' %{ Analysis &-- m-3 Synthesis

Figure 2.4: Schematic diagram of a Transform Coder

2.3 Transform Coding

Transform coding, as the name implies, deals with the problem in a transform domain.

Speech is first transformed into a suitable set of parameters which are quantized and

inverse transformed to obtain decoded speech. In general, it is not necessary that

the speech signal itself be transformed and quantized but a parametric description of

the signal may be obtained first and it may be useful to quantize the parameters in

a transform domain instead of using a straightforward quantization. The parametric

representation would in general be derived from an appropriate model of the speech

signal. The general block diagram of a transform coder is shown in Fig. 2.4. Blocks

T1 to Tn are the transforms and Q1 to Qn are the respective quantizers. The analysis

block may be just an identity operator. Transform coders are useful when the elements

of the input vector are highly correlated to each other and a transform can achieve

decorrelation and energy compaction such that most of the signal energy is contained

in a subset of the transform coefficients. An adaptive bit allocation technique can

then be used for efficient coding. The quantizers Q1, Q2,. .., Qn may be scalar [113]

or vector '[22] quantizers.

An insight into the transform coding process can be obtained by considering the

simple case where each quantizer Q j is a scalar quantizer, and only one transform A

is applied to a block of input samples of length N. The transform equation is written

as

y = A x (2.14)

The minimization of the average distortion in a transform coder involves ( i ) choice of

an optimum bit assignment rule, and (ii) choice of an optimum transform A.

CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 2 3

The variances of the transform coefficients are different in general and the bit rate,

Ri (bits/sample), required to quantize the coefficients y; of variance a: such that the

average mean squared distortion is upper bounded by D; can be written as

The second term in the above equation is the rate distortion bound for i.i.d. Gaus-

sian variables and 6 is a correction term that takes into account the performance of

practical quantizers and any deviation from a Gaussian distribution. It is easy to

show that the optimum bit assignment for quantizing the transform coefficients for

minimum average distortion is given by

where is the average bit rate in bitslsample. With an optimum bit assignment, the

average distortion can be written as

Let R,, and Ryy be the covariance matrices for the input signal and the transform

coefficients respectively. Then,

det Ryy 5 n o; j=1

for any transform A, and

det R,, = det Ryy

for any unitary transform A. The variances a; are the diagonal elements of Ryy . We

also have

det R,, = n A j

where X j are the eigenvalues of R,,. Observing equations (2.17)-(2.20) it can be seen

that minimum distortion is achieved if the variances oj2 are equal to the eigenvalues

X j . The Karhunen-Loeve transform (KLT) has the desired property.


Assuming that the quantizer parameter 6 in Eq. (2.15) remains the same whether

time domain or transform domain samples are quantized, the coding gain of transform

coding over PCM can be written as

For unitary transforms, the signal variance a2 is equal to the average of the variances

of the transform coefficients: 1

a'= -xu; N j=1

Thus, the gain of transform coding over PCM is the ratio of the arithmetic and

geometric mean of the variances of the transform coefficients. The maximum gain is

achieved if the transform is KLT, and GKLT is equal to one only if all eigenvalues are

equal, i.e. if the signal process is white noise.

Comparing transform coding with predictive coding, it can be shown [28] that

where Gp(j) is the prediction gain of an optimal j th order predictor. The maximum

transform coding gain GKLT is just the geometric mean of the predictor gains. The

predictor gains are monotonically increasing with the order of the predictor. Hence,

the transform coding gain is always smaller than the predictor gain if a transform

coder with a block length of N is compared with a predictive coder employing Nth

order predictor. The asymptotic coding gain for transform coding is the same as that

for DPCM [23] and is equal to the spectral flatness measure of the given signal [59].

This means that transform coding may achieve the same degree of signal decorrelation

as linear prediction.

speech signal is essentially non-stationary. Therefore, one needs to compute the

KLT matrix for every block of samples being coded and transmit the transform matrix

to the decoder for optimal transform coding. This is a highly expensive operation

considering the computational complexity and resulting bit rate of the coder. It has

been shown [93, 1131 that the Discrete Cosine Transform (DCT) performs almost as

well as the KLT enabling one to use a fixed transform matrix. Zelinski and Noll [I131

have shown that adapting the bit assignment to local signal statistics gives an extra

SNR gain of 4 to 6 dB compared to a fixed bit assignment transform coder.


2.4 Sinusoidal Coding

An important class of coders, generally called sinusoidal coders, has emerged in recent

years as a promising choice for bit rates below 4 kb/s. These coders use a sinusoidal

representation of speech by expressing the synthetic speech as a sum of sinusoids.

M ( n )

i (n ) = Am (n) cos 0, (n) m = l

The best known systems based on sinusoidal coding are Sinusoidal Transform Coder

(STC), and Multi-Band Excitation (MBE) Coder. While STC uses a sinusoidal model

to synthesize both voiced and unvoiced speech, MBE uses the sinusoidal representa-

tion only for the voiced part of the speech. The unvoiced segments are synthesized in

the frequency domain for MBE while the voiced segments are synthesized in the time

domain using a sinusoidal model as in Eq. (2.24). The major difference between the

two techniques is the computation of the harmonic frequencies and their magnitudes.

In MBE, the frequencies and magnitudes are evaluated in a closed loop fashion in

the frequency domain as the solution to an optimization problem. The cost function

is defined as the squared error between the windowed speech spectrum and the syn-

thetic spectrum. If S,(w) is the windowed speech spectrum, Ak are the harmonic

magnitudes, and W(w) is the window spectrum, then the error to be minimized is

given as

Minimizing the above expression under the assumption of an orthogonal window (i.e.

' 2~ J_", W*(w - kwo) W(w - 2wo)dw = 0 for k # l ) , the harmonic magnitudes can be

where the asterisk (.*) indicates the complex conjugate. The optimal values of A k and

wo are jointly searched for to obtain minimum error. This procedure yields a very fine

estimate of pitch as well as spectral magnitudes.

In STC, on the other hand, pitch and harmonic magnitudes are estimated in an

independent fashion. This reduces the computational complexity of the algorithm but

also makes it error prone.


The other difference between the two approaches is the way the harmonic phases

are treated. From Eq. (2.24) it can be seen that the angle B,(n) is a function of

time n. In general this allows for arbitrary variation in the harmonic frequencies

(being derivatives of phases) and it is not necessary that a harmonic relationship is

maintained between them at all times. MBE allows a linear change in fundamen-

tal frequency over an analysis frame, i.e. it allows for a piecewise linear change in

pitch with time. This implies that in the MBE model, the harmonic phases change

quadratically with time. The STC approach however allows for a piecewise quadratic

change in pitch thereby allowing a cubic change in phase values.

The main difficulty in harmonic coders arises from the fact that the number of

harmonics within the band of interest varies with time thereby requiring techniques to

deal with the problem of quantizing a variable number of parameters with a constant

number of bits (at least in the case of fixed bit rate systems). Several solutions

have emerged recently that handle the issue with adequate quantization performance

[26, 741. A different approach to this problem has been to obtain a residual signal

with relatively flat spectrum through LP modeling. In this case the residual signal is

modelled as a sum of sinusoids and the problem of quantizing harmonic magnitudes

get reduced to a simple scalar quantization [I101 or can be achieved with a very simple

quantizer structure [25].

2.5 Relative Merits and Demerits of Different Cod-

ing Strategies

The three coding strategies discussed above have been the major focus of research in

speech coding. The huge success of CELP lies in the closed loop nature of its struc-

ture. The analysis-by-synthesis method successfully searches the parameter space and

brings the power of vector quantization to coding excitation sequences in a source-filter

parametric coder. Although use of the perceptual weighting filter takes advantage of

the human auditory perception characteristics and a source-filter model is used, CELP

coders attempt to match original and synthetic waveforms. This is dictated by the

Euclidean distortion measure used to select the optimal excitation in the absence of

any known perceptual distortion measure. Nevertheless, the structure is very flexi-

ble in the sense that it can accommodate any meaningful distortion criterion as they

become available.

The main demerit of CELP is that it places a large emphasis on time domain

behaviour of the synthetic speech (through selection of squared error as the distortion

measure) while our ears are not very sensitive to phase information which is inherently

retained by a time domain description at the expense of many bits. This is the reason

why CELP coders do not work very well below a bit rate of about 4.8 kb/s without

seriously degrading the quality of speech.

The other two coding structures, e.g. Transform Coding and Harmonic Coding

attempt to address the problem of obtaining good perceptual quality by modelling

speech in a perceptually meaningful way. Classical Transform Coding (ATC and

VTC) does not address the question of perception in a direct manner, instead they

focus on finding a suitable transform that will concentrate all the energy in a specific

number of bins and quantize those bins using an adaptive bit-allocation procedure.

Harmonic coders have been the main focus of research in recent years because

they address the issue of perceptual significance in a direct manner. This allows them

a parsimonious parametric description of the speech sound rather than the speech

waveform. The main difficulty encountered is of course the lack of detailed knowledge

about what is perceptually important. That such knowledge can reduce the number

of parameters to be considered for quantization can be demonstrated with a simple

example of a single tone. Viewed as a waveform, a proper description of the signal

must include three parameters - (a) amplitude, (b) frequency, and (c) phase at a

given time instant. Perceptually, only the first two parameters suffice as a complete

description of the sound of the tone. This does not apply however when a large

number of tones are involved. A classical example would be a single pulse. This can

be described as the sum of infinitely many tones added with a very definite relative

phases. The sound produced by this waveform is like a percussion. When the same

tones are added with random phases, the sound produced is like a hissing sound of

white noise. This brings us to the importance of correct phase modelling in sinusoidal

coders. Currently, phase modelling is the main problem in obtaining toll quality

speech with a low bit rate sinusoidal coder.

Chapter 3

Quantization of LPC parameters

It has been already pointed out in Chapter 1, section 1.2, that a significant number of

bits in a low bit rate speech coder goes to quantizing the LPC parameters, also called

the spectral parameters. The objective of designing a good quantizer is to minimize

the average distortion between unquantized and quantized sets of parameters with a

reasonable amount of complexity. An introductory presentation of scalar and vector

quantization is given in Appendix B.

Given many equivalent representations of spectral parameters (an introduction

to Linear Prediction and equivalent representations is given in Appendix A), and

different quantization techniques, there are several issues that need to be addressed

for designing an efficient quantizer for LPC parameters. Typically, the following

questions need to be addressed.

1. Which parameter representation should be used? Some of the popular choices

have been - (a) reflection coefficients, (b) log area ratios, (c) cepstral coefficients,

(d) arc sine coefficients, and (e) line spectral frequencies.

2. What quantization strategy to use - scalar, vector, or matrix quantization?

Should a hybrid strategy like partly scalar and partly vector quantization be

used?

3. What distortion measure should be used in designing the quantizer?

4. How should the bit allocation be done for scalar quantized parameters?

CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 29

5. Should any orthogonalizing transform be applied to the parameters before they

are quantized?

6. What should be the structure of the codebook if vector quantization is used?

Should it be a full search or a tree search codebook? Should it be a trained

codebook or a stochastic codebook? Should it be a single stage or a multi-stage

codebook?

7. Should the signal be classified before quantization and different strategies be

adopted for different classes?

Generally, the answers to these questions depend on the particular scenario in

which the quantizer needs to operate and the performance requirements it needs to

satisfy. A brief review of the past research in this ,area is presented below. i

3.1 Choosing an Appropriate Spectral Represen-

tation

Over the past several years, it has become quite clear that Line Spectral Frequencies

(LSFs) are probably the best candidates for quantizing the speech spectral enve-

lope. Before the introduction of LSFs by Itakura [56] in 1975, it was widely believed

that Log Area Ratios and Reflection Coefficients are the best candidates for spectral

quantization. Particularly, Viswanathan and Makhoul [I071 studied many equiva-

lent representations including (a) linear predictor coefficients, a;, (b) autocorrelation

coefficients of {a;), (c) cepstral coefficients of A(z), (d) poles of l/A(z), and (el reflec-

tion coefficients, k;; and showed that a transformation of reflection coefficients, Log

Area Ratios (LAR), were the best choice for quantization among the parameter sets

considered.

Tohkura and It akura [loll studied the spectral sensitivities of PARCOR (reflec-

tion) coefficients and their transforms for a 10th order linear prediction model. Specif-

ically the study was done to compare spectral sensitivities of Reflection Coefficients

(k;s), Log Area Ratios (LARS), and Arc Sine Parameters (ASRCs) for efficient scalar

quantization of these parameters. The study showed a monotonically decreasing sen-

sitivity for k;s with kl having the highest sensitivity which was twice that of kz. The

C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 30

spectral sensitivity of the first reflection coefficient was also speaker dependent whereas

those of the higher order coefficients were less dependent on the speaker. This is an

important observation and poses a problem for the robustness of trained quantizers.

Unlike the k;s, ASRCs and LARs show much less variation in spectral sensitivity -

lower order coefficients showing only marginally higher sensitivity compared to higher

order coefficients. The first order coefficient still showed a high dependence on speak-

ers and female speech was seen to have a higher sensitivity for first order parameters

compared to male speech.

A study of the effects of preprocessing showed that speaker dependence is decreased

by using short analysis windows (FZ 10 ms) and spectral smoothing by autocorrelation

windowing.

A compilation of different scalar quantization results from the literature is shown

in Table 3.1 where the quantization distortion is measured as spectral distortion in

dB, defined as

where AQ(ejw) is the quantized form of the inverse filter A(ejw). These results clearly

show that transformed coefficients have better quantization properties when compared

to actual reflection coefficients. For example, while a 3 dB average distortion was

achieved using 24 bits for reflection coefficients [60], a 2 dB average distortion was

obtained with 25 bits using line spectrum frequencies [34].

The main advantage of using LSFs for quantization was pointed out by Kang and

Fransen [62]. They showed that the effects of quantization error in LSFs are localized

around the respective LSFs in question, i.e., if a single LSF is disturbed, the spectral

error at far removed frequencies is practically zero. This is unlike reflection coefficients

where an error in one coefficient produces error in the entire spectrum. However, as

seen from Table 3.1, LARs and LSFs both appear to be suitable representations of

spectral parameters for quantization.

Another issue in choosing an appropriate representation of LPC parameters is

the interpolation properties of the particular representation used. In practical speech

coding systems it is not possible to have spectral information transmitted in a contin-

uous manner. Usually, the interval between two successive transmissions of spectral

information is of the order of tens of milliseconds. This means that while synthesizing

CHAPTER 3. QUANTIZATION OF LPC PARAMETERS

Parameter

RC [51] LAR [51] ASRC [51] ASRC [79]

ASRC [17]

RC [60]

LSF [loo]

RC [45]

LSFD [7] LSFD [7] LSFD [7] ASRC [7] ASRC [7] ASRC [7] LAR [7] LAR [7] LAR [7] LSF [34] LSF [34] LSF 1341 LSF [34] LSF [34] LSF [34]

No. of bits ~ S D (dB) 3.0 (max) 3.0 (max) 3.0 (max) 1.0 (avg)

1.8 (avg)

3.0 (avg)

1 .O (avg)

Not Measured

1.5 (avg) 1.1 (avg) 0.8 (avg) 1.4 (avg) 1.1 (avg) 0.8 (avg) 1.4 (avg) 1.1 (avg) 0.8 (avg) 3.7 (avg) 2.7 (avg) 2.0 (avg) 1.3 (avg) 1.0 (avg) 0.7 (avg)

Comments

Lower bound on number of bits required for a maximum spectral distortion Minimum deviation method [49], based on experimentally derived ~robabilitv densities Uniform sensitivity quantization

11071 Minimum deviation method [49], open test Nonuniform quantization, designed to minimize mean squared error Only voiced segments; impercepti- ble difference Bandwidth expansion is used. No significant difference observed be- -

tween different parameter choices

Adaptive quantization with backward adaptation

Table 3.1: Some early scalar quantization results

CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 3 2

speech, the spectral information needs to be interpolated for good results in order

to avoid steep changes across frame boundaries. Umezaki and Itakura [lo51 studied

the temporal variations of LPC parameters and found that the LSFs are particularly

suitable for interpolation. Atal, Cox and Kroon [7] did a detailed study of interpo-

lation properties of different LPC representations and could not find much difference

between LSFs, LARS, and ASRCs. Interpolation in all these domains always produces

a stable synthesis filter.

It is widely believed [86] that a spectral distortion of less than 1 dB (in quantizing

LPC parameters) is required to achieve perceptually transparent spectral quantiza-

tion. From Table 3.1 it is clear that at least 33 bits are necessary to achieve this goal

using scalar quantization.

3.2 Preprocessing

It was observed by Viswanathan and Makhoul [I071 that short-time spectral dynamic

range of speech signals is the single most important factor that affects the quantization

properties. There are two popular techniques that address this issue in completely

different ways. Atal and Schroeder [ll] also describes another technique to improve

the stability of the LPC parameters that is particularly important for finite precision

arithmetic. These techniques are -

i) pre-emphasis,

ii) bandwidth expansion,

iii) high frequency compensation.

Pre-emphasis reduces the spectral dynamic range by decreasing the general slope of

the spectrum [107]. This is done by passing the speech signal through a single zero

filter of the form 1 - cuz-l. An optimal value for cu is obtained by solving for the

pre-emphasis filter that "whitens" the signal. This is given by the first order linear

predictor, where r1 a = - ro'

C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 3 3

rl and ro being autocorrelation coefficients of the speech signal at lags 1 and 0 re-

spectively. In practice, researchers have used values of cr = 0.8 to 1.0 [88, 61.

3.2.2 Bandwidth Expansion

In bandwidth expansion, reduction in spectral dynamic range is obtained by increas-

ing the pole bandwidths of the linear predictor [107]. Often the pitch frequency Fo is

very close to the first formant frequency Fl causing an underestimation of the predic-

tor pole bandwidths [102]. This produces extremely high sensitivities to parameter

perturbation; a slight change in parameters produces a big change in spectral enve-

lope. An underestimated bandwidth also causes unnatural speech at the decoder.

Tohkura et. al. [I021 showed that application of bandwidth expansion before quan-

tization produces lower spectral distortion compared to bandwidth expansion of the

quantized spectra at the decoder.

Let the i-th root of A(z) = Cyz0 aiz-' be

Then the 2-th formant frequency, F, and the corresponding bandwidth, B; are given

by [51, 1021

where T is the sampling interval. Bandwidth expansion is achieved by replacing a;

in A(z) by crie-""Ti. The modified polynomial becomes

If poles of l /A(z) are at z;, then poles of l/Af(z) are at


Hence the new bandwidth is given by

For a sampling rate of 8 KHz and bandwidth expansion of 10 Hz, the expansion factor

Common values for bandwidth expansion are 10-15 Hz [86, 71.

3.2.3 High Frequency Compensation

The technique of high frequency compensation (HFC), introduced by Atal and Schroeder

[ll] compensates for the loss in high frequency components in sampled speech due to

the use of non-ideal anti-aliasing filters before sampling. The autocorrelation matrix

of the low-pass filtered speech is nearly singular (typical value of the LINPACK recip-

rocal condition estimate is 5.0 x This results in a non-unique solution for the

prediction coefficients since all practical computations are done using finite precision

arithmetic. That is, different sets of predictor coefficients can approximate the speech

spectrum equally well in the passband of the low-pass filter.

The ill-conditioning of the autocorrelation matrix can be avoided by adding to the

autocorrelation matrix another matrix proportional to the autocorrelation matrix of

high-pass filtered white noise. If the autocorrelation matrix of the speech segment

being analyzed is R , the modified autocorrelation matrix, R, is obtained as

where X is a small constant between 0.01 and 0.10, R, is the autocorrelation matrix of

high-pass filtered noise, and CYM is the minimum mean squared value of the prediction

error. High frequency compensation and bandwidth expansion by 10 Hz have been

used in all our computation of prediction parameters. The values used were


where we used the same values as Atal [ll] for p,, i.e., po = 318, pl = -114, pz = 1/16,

and pk = 0 for k > 2. The minimum residual energy c r ~ is computed from the LPC

coefficients a; computed from the uncorrected signal autocorrelation coefficients r,(i)

as 1

= x a;r,(i) (3.11) i=o

where 1, the order of the predictor, is usually smaller than the actual order, p, of the

predictors being corrected. Examples of spectral envelopes of speech computed for

X = 0 (uncorrected) and X = 0.05 is shown in Fig. 3.1.

-201 I I I I I I I I 0 500 1000 1500 2000 2500 3000 3500 4000

Frequency (Hz)

Figure 3.1: Spectral envelope of speech without (solid line) and with (dash line) high frequency compensation

3.3 Vector Quantization of LPC Parameters

Theoretically, vector quantization should yield better performance than scalar quan-

tization. One of the major problems in applying VQ to encode spectral parameters


is the very large codebook required to achieve a spectral distortion of less than 1 dB.

Codebooks of such sizes are impossible to implement in real time with the present

day technology. Practically, in most applications, it is very difficult to handle a sin-

gle stage codebook larger than 12 bits. Other problems related to the application of

VQ are possible talker dependence of the performance obtained with trained code-

books and sensitivity to transmission errors. Still, VQ has become the quantization

technique of choice for spectral parameters because of its efficiency at low bit rates.

In this section, we present a summary of previous results reported in literature for

application of VQ to spectral parameter quantization. Careful judgement is needed

while comparing different VQ simulation results because some of them are closed

test, i.e., in training, and some are open test, i.e., out of training. Another difficulty

is introduced by the fact that different researchers do not report their results in terms

of the same distortion measure and that makes a comparative analysis particularly

difficult. In the present discussion, the performances are compared in terms of spectral

distortion (dsD) expressed in dB. The spectral distortion has already been defined

earlier in Eq. (3.1). Other popular distortion measures (for two unity gain spectra)

used in the literature are

1. Likelihood ratio (with equivalent representations) [60]:

where the (:) symbol indicates a quantized variable, a is the residual energy for

the vector of quantized predictor parameters and a, is the minimum residual

energy obtainable using a predictor of order p. The k;s denote reflection coef-

ficients and R, is the autocorrelation matrix of speech signal of order p + 1. a

can also be computed as


where r,(n) and rx(n) are given by

2. Modified Itakura-Saito distortion 1421:

3. Squared error (Euclidean distance):

where x and i are unquantized and quantized parameter vectors respectively.

4. Weighted squared distortion:

where x and i are unquantized and quantized parameter vectors as before and

W ( x ) is an appropriate weighting function.

It can be shown [50] that for small distortions, the spectral distortion can be approx-

imated from the likelihood ratio as

dsD cz l ~ l o g , ~ e x JZ& (3.24)

= 6 . 1 4 2 6 . (3.25)

Buzo et. al. [17] attempted VQ of spectral parameters by designing a codebook for

r,, the autocorrelation of prediction parameters at the encoder and a corresponding

codebook of reflection coefficients at the decoder. They obtained a spectral distortion

of 1.8 dB (closed test) with a 10-bit full search codebook. It should be borne in mind

that a closed test performance measure can be very misleading and is not the right


I way to test codebook performance. Based on a similar approach but using dLR as the

distortion measure, Juang et. al. [60] obtained a spectral distortion of 2.37 db (closed

test) and 3.35 dB (open test) with a 10-bit codebook. The difference between the

open and closed tests is rather large and indicates that one has to be very careful in

choosing an appropriate training set. Wong et. al. I1081 reported a spectral distortion

of 2.56 dB (closed test) using a 10-bit codebook of autocorrelation coefficients.

In assessing the advantages of vector quantization over scalar quantization Juang

et. al. [60] made an important observation that vector quantization produces smoother

error spectral transitions compared to scalar quantization. This comparison helps in

understanding the difference in speech quality obtained using different quantization

techniques. They concluded that since the error spectra changes smoothly for VQ,

sustained sounds such as vowels will generally be distorted in a similar manner for

consecutive frames. From a perceptual standpoint, such consistent distortion from

vector quantization does not seriously introduce an extra effect such as warbles due

to frame transition.

Other than straightforward VQ of spectral parameters various innovations have

been tried by different researchers. Some of these techniques are described below.

3.3.1 Stochastic VQ

In stochastic VQ 16, 951, the LPC parameter vector is quantized using a Gaussian

codebook by transforming the uncorrelated codebook entries into vectors having cor-

relations similar to those of the LPC parameter vectors. An important advantage of

using a random codebook is that they provide robust performance across different

speakers and speech recording conditions. In practice, the vector to be quantized is

transformed to a vector with uncorrelated components and quantized using a random

Gaussian codebook.

Let a vector x of dimension N be transformed into a vector u with uncorrelated

components by using an orthogonal rotation with an N x N matrix A

For x with jointly Gaussian components, the optimal rotation A is given by a matrix,

whose rows are the normalized eigenvectors of J?,, the covariance matrix of x. This


is usually referred to as Kahrunen-Loeve Transform (KLT), and it can be applied to

some extent to non-Gaussian sources [75]. The covariance matrix is given by

where E(.) is the expectation operator and 51 = E ( x ) . r, can be decomposed into

where V is a matrix whose columns are the normalized eigenvectors of I?, and D is

a diagonal matrix whose elements are the eigenvalues of I?,. Therefore, the rotated

vector u is given by

u = VTx. (3 .29 )

It can be shown that the covariance matrix of u is the diagonal matrix D, which

means that u has uncorrelated components. In order for the transformed vector u to

have unity covariance matrix and zero mean, the following transformation is used

u = D'f2VT ( X - Z ) . (3 .30 )

In the stochastic vector quantization method, a vector x is quantized using codevectors

chosen from a codebook of zero mean, unity variance, Gaussian entries through the

transformation

2 = x + P V D ' I ~ U ~ . (3 .31 ) .

The scalar P is introduced to allow flexibility in matching the powers of x and 2. The

mean squared error minimized during the codebook search is given by

The simplification in the last step is afforded by the fact that V is unitary. The

optimum codebook gain ,f3 is computed from Eq. (3 .33 ) by setting dEk/dp = 0;

which, after some simplifications, becomes


F Salami et. al. [95] computed a long term covariance matrix r, from a large database

of LPC vectors and used the same covariance matrix for all speech signals. They

reported no improvement when r, was updated every LPC analysis frame. The

stochastic VQ technique was tried with LSF difference vectors (difference between

the present unquantized and previous quantized vectors) and LAR vectors. They

reported an average dsD of 0.8 dB for LSFD using 23 bitslvector, and an average dsD

of 1.0 dB for LAR using 28 bits/vector.

3.3.2 Techniques Exploiting Interframe Correlations

Selective encoding of sub-vectors

Papamichalis and Barnwell [88] considered a variable rate quantization of PARCOR

coefficients where parameter subvectors are transmitted depending on the change in

the analyzed parameter vector. They utilized interframe correlation of parameters

and observed that for many sounds, parameters do not change significantly from one

frame to the next. Several consecutive frames were analyzed at once and all possible

sequences of PARCOR coefficient vectors were examined before a selection was made.

They also observed that the leading coefficients were more perceptually significant

than the trailing ones and must be updated with a higher priority. Three different

distortion measures were investigated - (i) spectral distortion, (ii) mean square log

area ratio distance, and (iii) mean square inverse sine distance. In their study, mean

square LAR distortion performed better than spectral distortion. Up to a maximum

of 16 consecutive vectors were examined with a dynamic programming algorithm

in deciding which subvector should be quantized. It was noted that no significant

perceptual improvement was obtained beyond a depth of 6 stages.

Switched-Adaptive Interframe Vector Prediction

Switched- Adaptive Interframe Vector Prediction (SIVP) [ I l l ] considers the time se-

quence of LPC parameter vectors as a realization of a stochastic vector process. The

correlation between successive time indexed random vectors is modelled using a first

order predictor whereby an estimate of the n-th parameter vector is written as


where A is a p x p prediction matrix and x n - ~ is a zero mean vector at time index

(n - 1). If the vector components have non-zero mean then the mean is subtracted

before prediction. The prediction error vector en is given by

The optimum prediction matrix which minimizes the mean squared prediction error

is given by [24]

A = ColCy;' (3.37)

where 1 N

E ( - ) is the expectation operator and N is the number of vectors in the training set.

A schematic diagram of the SIVP technique is shown in Fig. 3.2. The classifier works

flag signal I -4 Classifier 1

Figure 3.2: SIVP coding system

on a codebook of instantaneous correlation vectors rn defined as

For every input parameter vector, rn is computed and an appropriate predictor Pi is chosen. The index of the chosen predictor matrix is transmitted to the receiver

as side information. The error vector en can be quantized using scalar or vector


quantization. Young et. a1 [ I l l ] noted that synthetic speech almost indistinguishable

from the original could be achieved with 26 bits/vector using SIVP combined with

scalar quantization. Same speech quality was obtained with 20 bits/vector when VQ

was used following SIVP.

Tree-Searched VQ with Interblock Noiseless Coding

Tree-Searched VQ (a constrained VQ technique described in the next section) with

interblock noiseless coding (TSVQ-IBNC) [84, 901 uses a tree search VQ to exploit

the correlation between vector components (intraframe correlation) and interblock

noiseless coding to exploit correlation between successive vectors (interframe corre-

lation). Phamdo and Farvardin [go] designed a tree search VQ where the encoder is

implemented by a tree-searched algorithm as shown in Fig. 3.3 for a 3-level codebook.

Here, ci is the codevector associated with the i-th node at the j-th level of the tree.

Figure 3.3: A tree-searched VQ for m = 3


Initially the encoder compares the source vector, x, with c: and c:, and depending

of the outcome, the encoder advances to one of the two nodes in the first level of

the tree. It then compares x with the two accessible codevectors from the present

node and advances to the nearest (lower distortion) node. The process continues till

the last level is reached where a final selection of the codevector is made. The m-bit

binary codeword is formed for an m level TSVQ by the path taken through the tree.

For vectors with little difference, as is the case for subsequent LSF vectors (because

of interframe correlation), the path taken through the encoding tree are very similar.

In fact the path map associated with the codevector of a frame have a sizeable prefix

in common with the path map of its predecessor. Let the length of the greatest

common prefix between two adjacent frames be represented by a random variable k.

Therefore, k = k implies that the codeword of the present frame has k consecutive

bits (in the most significant places) in common with the codeword for the previous

frame, 0 5 k 5 m. When k = m, the two codewords coincide exactly. In IBNC, the

value of k is provided to the decoder along with the remaining (suffix) bits since the

first k bits can be obtained from the previous decoded codeword. In fact, in this case

the (k + 1)-th bit can be obtained by taking the complement of the (k + 1)-th bit in

the previous codeword. Hence only m - k - 1 bits are required for encoding the suffix

(except when k = m when no bits are required).

Phamdo and Farvardin [go] used Huffman code to encode the value of k and

used a 13 level TSVQ for encoding the LSF vectors. For a frame rate of 100 Hz

(10 ms period), an average spectral distortion of about 4 dB was obtained with 9.34

bitslframe; and a spectral distortion of about 3.5 dB was obtained with about 11.4

bitslframe for a 22.5 ms frame period. Scalar quantization of the TSVQ-IBNC error

was done to achieve a spectral distortion of less than 1 dB, and about 24.5 bitslframe

were required for 10 ms frames and about 26 bitslframe were necessary for a 22.5 ms

frame.

3.4 Constrained (suboptimal) VQ

Vector quantization is a very powerful quantization technique and we already men-

tioned (see appendix B) that no quantizer can outperform a VQ. However, there are

significant computational and storage costs associated with a VQ. For a quantizer of

CHAPTER 3. QUANTIZATION OF LPC PARAMETERS I

D dimension k with resolution T bits per vector component, the codebook has

code vectors. The memory required to store the codebook as well as the computation

required to search the codebook with an additive distortion measure is proportional

to

kN = k2Tk

and grows exponentially with both r and k.

In the case of quantization of LPC parameters for speech coding, using a 10-th

order LPC model and 24 bits per vector, the codebook needs to have 224 or roughly

16.7 million vectors. The computational and storage complexity involved are really

very large. Since this is the order of resolution required in most practical applications

of VQ in LPC quantization, it is imperative that a suboptimal solution be used.

Traditionally, the suboptimal solution is a VQ that is constrained structurally or

otherwise to reduce the computation and storage requirements to a tractable size.

There are many suboptimal VQ techniques one might consider for quantizing LPC

parameters. Only those that lead to a fixed rate coder are considered here. Variable

rate VQ techniques like pruned tree-structured VQ and entropy constrained VQ were

not considered in this work. Also, only memoryless VQ techniques are considered in

this study.

Some of the suboptimal VQ techniques are

i ) tree structured VQ [I?, 75, 421,

ii) classified VQ [42],

iii) product code VQ [94, 17, 421,

iv) basis vector VQ [44, 431,

v) multi-stage VQ [17, 421, and

vi) partitioned VQ (split VQ) [42, 861.

A brief description of the above techniques is presented below. A good review of

different constrained VQ techniques can be found in [42].


3.4.1 Tree Structured VQ

Tree structured VQ is a very effective way to reduce search complexity in vector quan-

tization but the price is paid in terms of a large storage complexity and performance

degradation. The search is performed in stages and in each stage a large number

of codevectors are eliminated from the search. The structure of a tree structured

codebook is shown in Fig. 3.4.

The encoder first searches the root codebook C* and finds the minimum distor-

tion test vector (code vector). For a balanced m-ary tree, the index i indicates the

codebook to search in the next stage of codebooks. The next search of codebook C;

yields the next stage index j. Assuming the next stage is the last stage (as shown

in Fig. 3.4), a search of codebook C i j produces the code vector that is the quantizer

output.

The decoder does not need to have the test vectors and is identical to a conventional

VQ. However, if a progressively better approximation is sought where the quantized

vector is desired to be updated after each stage is searched, the complete sequence

of indices may be transmitted and in this case the decoder will need to have all the

stage codebooks as well.

An m-ary tree with d stages is said to have a breadth m and a dep th d. If the

codebook size is N = md, then only md distance computations are required instead

of md distance computations as in unstructured codebook.

Although the search complexity is quite low, the storage complexity is high com-

pared to an unstructured codebook. In addition to storing the md code vectors, the

test vectors for each stage of the tree must be stored (at least in the encoder). The

number of nodes in stage k is mk-', hence the total number of nodes is

Since each node stores m test or code vectors, the total number of vectors to be stored

For a binary tree, the search complexity is reduced by a factor of 2d/2d while the

storage complexity is 2(2d - 1) vectors - slightly less than double the storage required

Figure 3.4: A tree structured VQ

C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 47

for an unstructured VQ of same size. We already know that a large number (> 220)

of code vectors is needed to quantize LPC vectors using an unstructured codebook.

Use of a tree structured codebook would lead to a further increase in the storage

complexity of the VQ and is not an attractive alternative to consider.

3.4.2 ClassifiedVQ

Classified VQ (Fig. 3.5) is similar to a two stage tree structured VQ where the first

stage of the tree structured VQ is replaced by a classifier. The classifier produces a

codebook index for the codebook to be used and a search of the selected codebook

produces a codevector index. Both of these indices are transmitted to the decoder

where the code vector representing the quantized vector is retrieved from the indicated

codebook.

x , { Codebook 1 codeyctor index

I

T codebook index #

Figure 3.5: Classified VQ

he classifier is designed to partition the input space according to some statistic

of the signal being quantized and it may not always be easy to determine the best

way to design a classified VQ. The codebooks can have different sizes depending on

which class they represent, and tolerable distortion for that class.

The storage complexity of classified VQ is at best the same as that of an un-

structured VQ since the codebooks C1 to C L must include the same code vectors at

least once to have the same reproduction alphabet as an unstructured VQ. Search

complexity is reduced by L times only and there are no good ways of classifying LPC

vectors into L classes when L > 3 to 6.


3.4.3 Product Code VQ

A product code VQ is a collection of quantizers each of which is applied to a feature

vector derived from the vector to be quantized. The features are defined in such a

way that the collection of all the features completely define the vector.

Given a vector x of dimension k > 1, let f; indicate the function that extracts the

i-th feature. That is,

4; = fi(x), i = 1,2 ,..., Nf (3.42)

where 4; is the i th feature and Nf is the number of features being extracted. Then,

it should be possible to define a reconstruction function r(.) such that

A separate quantizer is designed for each of the 4;s in product code VQ.

Each feature vector could be easier to quantize because it takes on values in a more

compact region of k-dimensional space or has a lower dimensionality. If the features

could be defined such that they are independent of each other, the coding complexity

can be greatly reduced without any performance penalty.

The quantizers are in general dependent on each other, i.e., the reproduction value

for one feature vector depends on the reproduction values of other feature vectors. If

so desired, independent quantizers can be used for some (or all) of the feature vectors

with some degradation in performance.

Two common product codes are shape-gain VQ [94, 421 and mean-removed VQ

(or mean-residual VQ) [42]. Shape-gain VQ was first used by Sabin and Gray [94]

where the gain and spectral shape of the short term filter were jointly quantized using

one scalar and one vector quantizer. They reported an improvement in performance

compared to the gain separated VQ (another product code with independent quan-

tizers) introduced by Buzo et. al. [17]. In mean-removed VQ, the mean of the vector

elements is subtracted from each element and the mean and the residual vector are

quantized together. This is equivalent to removing the DC bias from a signal and

making it zero-mean. Both mean-removed VQ and shape-gain VQ decompose the

input vector into one scalar and one vector component.

For quantizing LPC coefficients, one still needs a large codebook of vectors even

after removing the gain or the mean from the LPC vector. So, it is not expected to


lead to significant reduction in search and storage complexity to make it an attractive

technique for LPC quantization.

3.4.4 Basis Vector VQ

A basis vector VQ has a codebook where all vectors in the codebook are linear com-

binations of a smaller number of basis vectors:

k l

A particularly interesting basis vector codebook is used in the VSELP coder (see

Chapter 2, page 21 for a brief description) where the linear combination coefficients

(hi's) are restricted to values of -1 and $1. This leads to an enormous reduction in

search complexity of the codebook. The storage complexity is also very low as only

the basis vectors need to be stored.

Basis Vector Design

The error function minimized in a basis vector VQ can be written as

where M is the number of basis vectors and cn = xEl bjn)v; is the chosen code vector.

Minimizing with respect to y,, m

Each code vector can be written in terms of a selection matrix, B,, and a column

vector of stacked basis vectors V


When this is substituted in the expression for distortion over the entire training set

and the total distortion is minimized with respect to V , the optimal basis vectors for

the given training set are obtained as

Performance of basis vector V Q

To determine the suitability of basis vector VQ for LPC quantization, a basis vector

VQ with resolution of 10 bitslvector was designed using 20000 LAR vectors obtained

from recordings from FM radio stations.

As mentioned earlier, the coefficients, b;s, are restricted to take up values from

the set (-1, $1) in VSELP. In general, if the coefficients are restricted to two values,

they need not be equal to -1 and +1 but could be any set of two scalars known to

the decoder. It may be pointed out here that only the ratio of the two numbers in the

set is significant. It is easy to show that the sets {a, b] and {ka, kb) are equivalent in

the sense that the constant k can be absorbed in the gain term.

The performance was measured on 3987 test vectors outside the training set and

minimum average spectral distortion of 3.86 dB was obtained with the coefficient set

{-0.7,l) compared to 3.61 dB for a full search unstructured VQ.

The performance of basis vector VQ was also measured under channel transmission

errors by simulating uniformly distributed random bit errors at error rates of 1% and

5%. The results are given in Table 3.2. The performance of basis vector VQ under

n Error Probability I Full Search VQ I Basis Vector VQ [I

Table 3.2: Channel error performance of Basis Vector VQ

pe 0.00

channel error conditions is seen to be very poor and it was not studied further.

SD (dB) 3.61

SD (dB) 3.86


3.4.5 Multi-Stage VQ

Multi-Stage VQ is the main subject of study in this thesis and is discussed in detail in

Chapter 4. We just mention here that in multi-stage VQ, each reproduction vector y,

is obtained by summing up one code vector from each stage of a multi-stage codebook:

where L is the number of stages and cr is the j th code vector from the kth stage

codebook.

3.4.6 Partitioned VQ (Split VQ)

In partitioned VQ, the parameter vector to be quantized is partitioned into a number

of subvectors of fixed, predetermined dimensions, and each subve'ctor is coded with

an independent VQ. Let x = [xl, x2,. . . , x,lT be a parameter vector of dimension p

to be quantized using a split VQ scheme. Then x is as

where x; is a subvector of dimension I ; such that

The scheme is shown in Fig. 3.6. The partitioning of the vector is equivalent to using

a product VQ where each split VQ codeword is a vector in R'1 x R'2 x - - x R'L. Paliwal and Atal [86] used split VQ on LSF vectors and obtained a spectral dis-

tortion of 1.03 dB using 24 bitslvector and a weighted Euclidean distortion measure.

The spectral distortion was 1.19 dB for the same code rate when a simple Euclidean

distortion measure was used. It is easy to show that split VQ is a particular case of


Figure 3.6: The Split VQ Scheme

Multi-Stage VQ. To demonstrate it by example, let the vector x of dimension p be

partitioned into two sub-vectors - xl of dimension l1 and x2 of dimension 12, such

that l1 + 12 = p.

where

The vector xl is quantized to il with a codebook C1 of dimension l1 and the vector

x2 is quantized to i2 with a codebook C 2 of dimension 12. If c,! is the chosen code

vector from codebook C1 and cf is the code vector chosen from codebook C2, the

quantized vector li: is given by

The same reproduction alphabet can be achieved by a multi-stage codebook in

which each stage codebook is derived from the corresponding split VQ codebook by


extending each stage code-vector to dimension p by adding 0 elements at appropriate

positions. For the current example, the equivalent multi-stage codebooks can be

derived as

where

and Nl and N2 are the number of codevectors in codebooks C1 and C2 respectively.

Now, the quantized vector i can be written as

This shows that partitioned VQ is merely a constrained version of multi-stage VQ

where some elements of each stage code-vector are forced to be zeros.

Although partitioned VQ is a constrained version of the already suboptimal multi-

stage VQ, an important advantage of partitioned VQ is that each codebook is con-

strained into subspaces that are orthogonal to each other. This makes the quanti-

zation error (measured in squared Euclidean distance) additive with respect to the

quantization errors from each partitioned vector:

This important property allows each partitioned codebook to be searched indepen-

dently and it is guaranteed that the best reproduction vector within the reproduction

alphabet will be found irrespective of the order in which the search is made. In this


sense, the search procedure for a partitioned VQ is always optimal since a sequential

search of all partition codebooks is equivalent to an exhaustive search of the codebook.

Since multi-stage VQ is less suboptimal than split VQ, it is expected that a lower

distortion would be obtained with a multi-stage VQ compared to a split VQ at the

same code rate.

Chapter 4

Multi-Stage VQ of LPC

Parameters

There are several forms of constrained vector quantization techniques as mentioned

in Chapter 3, but multi-stage VQ (MSVQ) has several advantages. MSVQ is simple

to implement and efficient search strategies can be found, as will be described here,

that help to reduce the search complexity appreciably without sacrificing much of

performance compared to a full search, unstructured VQ.

A multi-stage VQ consists of a set of triples,

where L is the number of stages,

is the i-th stage codebook, Q% the mapping used with the i-th stage codebook and

is the corresponding partition of Rn such that Q i ( x ) = cj, if and only if x E S;,.

The number of code vectors in C i which equals the number of cells in Pi is denoted

by N;. The code vectors comprising the codebook Ci and the cells comprising the

partition Pi are indexed with the subscript j,, where j; is a member of the i-th index

set J; = {1,2, . . . , N,). In practice, each Qi(.) is realized as a composition of an

C H A P T E R 4. MULTI-STAGE VQ OF LPC PARAMETERS 56

encoder mapping Ei (.) and a decoder mapping I>'(-) , viz., Q;(x) = Di (Ei (x)) . The

i-th encoder mapping • ’ 5 Rn H Ji is defined as Ei(x) = j;, if and only if x E Sji. For each source vector, the indices produced by the encoder mappings of each stage

are concatenated to form an index L-tuple

Each L-tuple jL is a product code word and is an element of the Cartesian product of

the stage-wise index sets

jL E J1 x J2 x . . . x JL.

The decoder parses the received L-tuple code word and the decoder mappings D' :

Ji t+ C' recover from each stage-wise index j; the corresponding code vector ci. The

quantized representation i of the input source vector x is formed by summing up

exactly one vector from each codebook,

Here c t is the n-th code vector from the k-th stage. The size of the reproduction

alphabet in an MSVQ is

N = r I f= ,~ ' . (4.2)

Usually, the number of code vectors in each stage is an integral power of 2 and

where r; is the resolution of the i-th stage in bits/vector. We will indicate the structure

of an MSVQ by mentioning the resolution of each stage starting from the lowest order

stage to the highest order stage. The parameter representation used will also be

mentioned. Hence a codebook will be named as

where m

to indicate that it has kl stages having 2'1 code vectors each, followed by k2 stages

having 2'2 code vectors each and so on, the total number of stages being L. If any

CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 57

of the k;s are equal to 1, it will be omitted for brevity. For example the codebook

LSF-12+12 is a two stage codebook, each stage having a resolution of 12 bits/vector,

i.e., each stage has 4096 code vectors and they are Line Spectral Frequencies (LSF).

Similarly the codebook LAR-4 + 6 + 2 x 4 is a 6-stage codebook of LAR vectors

with first stage having 16 code vectors, second stage having 64 code vectors and

subsequent four stages having 4 code vectors each. The total number of bits required,

R, to quantize a vector using a given codebook can be easily found out by computing

the sum indicated in the name of the codebook.

R = r l x k l + r 2 x k 2 + - . . + r , x k,.

In an unstructured VQ, each reproduction vector can be placed anywhere in the

input sample space independent of other reproduction vectors, but the MSVQ is

structurally constrained and the reproduction vectors are not all independent of each

other. It is worthwhile to take a look at the MSVQ structure as it provides insight for

choosing a proper search algorithm. For a two stage MSVQ, the 2nd stage vectors are

added to each of the first stage vectors (Fig. 4.1) to form the reproduction vectors.

Hence, the pattern of reproduction vectors around each of the vectors in the first stage

is the same and the input space is filled with a repeating pattern of a set of vectors.

In other words, the higher order codebook defines a tile, and placement of the tile is

governed by the next lower order codebook. This is also evident if we write the set of

reproduction vectors Y = { y;; i = 1, . . . , N } as

where N = Nl N2. In the case of codebooks with more than two stages, the codebook

for the highest stage defines the smallest tile. When these tiles are placed according to

the vectors in the next lower order codebook, a larger tile is obtained which is placed

according to the vectors of the next lower order codebook and so on. The difference

between this tiling of input space by the reconstruction vectors and traditional tiling

of space is that in case of MSVQ, the tiles can overlap each other and they do not fill

up the entire space.

Traditionally, a multi-stage VQ is searched in a sequential manner, the basic idea

being that the sum at each stage provides a closer approximation to the input vector

over the sum at the previous stage. The quantization process using a multi-stage


(a) First stage codevectors (b) Second stage codevectors

(c) The final reproduction alphabet

Figure 4.1: Structure of a two-stage two dimensional VQ

Figure 4.2: A sequentially searched multi-stage VQ


VQ with sequential search is shown in Fig. 4.2. The quantizer (C1, Q1, P1) quantizes

.1 P

I

the source vector x1 and (Cp+ l , Qp+l, P p + l ) quantizes the error vector (also called

P

Q'

the residual vector) x P + l = x p - Qp(xp) from the preceding stage (Cp, Qp, Pp) for

1 5 p 5 L. The vectors x p and the quantizer mappings QP(.) are related according to

where xL+' is the error from the last stage as well as the total error from all stages.

The quantized vector, a', is given by

It is quite obvious that this procedure is suboptimal and the best reconstruction

vector is not going to be chosen all the time. Several examples of this can be found

in the figure (Fig. 4.1) shown earlier.

The performance of a sequentially searched MSVQ is suboptimal due to three

reasons.

i) The codebook structure is constrained and is not flexible enough to pro-

duce an optimal set of reproduction vectors for a given input pdf (or

training set). This restriction comes solely from the structure and does

not depend on the availability of a suitable technique to compute optimal

reconstruction vectors. In other words, if a suitable technique was avail-

able to compute a set of optimum reconstruction vectors for a given input

C H A P T E R 4. MULTI-STAGE VQ OF LPC PARAMETERS 6 0

distribution and output alphabet size, we would not, in general, be able

to design a set of MSVQ codebooks that will result in the same set of

reconstruction vectors.

ii) The sequential search algorithm is suboptimal and fails in many cases to

find out the best reproduction vector for a given input vector and a multi-

stage codebook. This is particularly true if the number of stages is large

(2 3).

iii) The sequential design procedure that is commonly used in designing an

MSVQ is suboptimal. In other words, most of the times there exists

an MSVQ with the same alphabet size that will perform better than a

sequentially designed one for the same input distribution.

It is believed that the following requirements need to be met to achieve transparent

quantization of LPC parameters [86] :

1. the average spectral distortion should be less than 1 dB,

2. the number of outliers with spectral distortion of 2 dB or more should be less

than 2 percent,

3. there should be no outlier with spectral distortion above 4 dB.

The fact that the split VQ was the first multi-stage VQ to achieve this goal using the

smallest number of bits (24 bits) at that time, clearly shows a lack of understanding of

multi-stage structures - since a less constrained multi-stage structure should perform

better than a split VQ. The main reason why split VQ outperformed a less constrained

MSVQ is the use of sequential search which was inadequate for the MSVQ structure.

4.1 Suboptimality of Sequential Search

The suboptimality of sequential search can be readily seen in Fig. 4.3 where partitions

of the input space has been shown for a two-stage MSVQ. The first stage codevectors

'A parameter is said to be quantized transparently when the speech quality produced by a coder using quantized and unquantized parameters are perceptually indistinguishable from each other.

CHAPTER 4. MULTISTAGE VQ OF LPC PARAMETERS

Figure 4.3: Voronoi regions for a two-stage MSVQ


are marked with a solid square, the second stage codevectors are shown as arrows, and

the reproduction alphabet is shown with solid circles (e) and crosses (x). Reproduction

vectors with a common parent or predecessor are marked with the same symbol. ( 1 ) The two first stage codevectors partition the space into two regions marked R,

and RY) and the final reproduction vectors partition the space into regions marked as (2) Rig , where i , j = 1,2, i # j. The regions corresponding to the reproduction vectors

having a common predecessor are also shown with the same shade. All regions have

been drawn assuming the nearest neighbour rule. It can be seen that for a sequential

search, all input vectors in R?) will be quantized to one of the reproduction vectors

in the dark shaded regions. Therefore, all input vectors in the white shaded region

R$ will be mapped to a suboptimal reproduction vector. It is easy to see that all

input vectors in the white shaded regions in R?) and all input vectors in the dark

shaded regions of RY) will be quantized to a suboptimal reproduction vector. In fact

the conditions under which sequential search is optimal are quite severe as shown in

the next section.

4.1.1 Opt imality conditions for sequential search

Before stating the optimality conditions for a sequential search, we define a quantity

called the predecessor of a reproduction vector.

Definition 4.1 Let yj be a reproduction vector for an L-stage VQ, such that

where 1; is the indez of the codevector from the i-th codebook di) . The k-th predecessor,

Y:, of yj is defined by

The null vector 0 can be considered the 0-th predecessor of all reproduction vectors.

It should be noted that more than one reproduction vector can have the same set of

lower order predecessors and the highest order predecessor is the reproduction vector

itself. Now we can state the optimality conditions for sequential search.

CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 6 3

Theorem 4.1 The necessary and suficient condition that a multi-stage codebook is

optimally searched using a sequential search procedure is that for all reproduction

vectors yj7 j = 1, . . . , N and any input vector x,

4x7 yj) < 4x7 Y,) * d(x7 Y;) < 4x7 Y*) Vk, V Y ~ # Y:,

where y: is the k-th predecessor of y j .

Proof: We prove the sufficiency condition first. Given an L-stage MSVQ,

C = {c(";i = 1, ..., L) and

d(x, y j ) < d(x7 yi) * d(x, Y:) < d(x7 Y:); Vk7 VY: # Y:

for any input vector x , suppose that sequential search is not optimal.

That is, 3 y, such that

d(x, Y m ) < d(x, yn)

where yn is the reproduction vector found through sequential search. Let the vectors

y, and y, have common predecessors till stage p - 1 where p = 1 for no common

predecessor. Since y, is found through a sequential search,

But from Eq. (4.10) and Eq. (4.11),

This is clearly contradictory to Eq. (4.12) for y,k = y;. Hence, no such ym exists that

satisfies Eq. (4.11), or in other words, sequential search is optimal for this codebook.

c his proves the sufficient condition.

Now we prove the necessary condition. Given an L-stage MSVQ, C = {c("; i =

1 , . . . , L) and the fact that sequential search is optimal for this codebook, let yn be the

nearest reproduction vector for a given input vector x . That is, Q(x) = y,. Hence,

Assume that for this codebook

CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS

Therefore, 3y; such that

for some k, yk # y;k.

From the fact that the codebook was searched sequentially, and Eq. (4.17), the

kth predecessor of y, will be chosen over the kth predecessor of y, while searching

stage k. Since the kth predecessor yk will be discarded, none of its successors can be

chosen in a sequential search. This contradicts our original assumption that y, was

found through sequential search. This proves the necessary condition.

4.2 Search Strategy

The performance of a multi-stage VQ can be improved by using a multi-candidate

search procedure. The basic idea of this procedure is to retain multiple candidates

instead of one best candidate in the search of each stage. Two different approaches

are possible - (a) Growing Tree search, and (b) M-L Tree search [4].

In the Growing Tree search, MI candidates corresponding to the lowest MI dis-

tortions, d(x, cli ), i = 1, . . . , MI, are retained from searching the first stage. For each 31

of these vectors in the first stage, the second stage is searched and M2 candidates are

retained from each search. Thus at the end of searching the second stage we have

MI x M2 code vector pairs as possible candidates for the final reproduction vector.

The search is continued till the last stage with Mj candidates retained from each

search of the j-th stage. After having searched the last stage, we are left with IIf=, Mi

candidate reproduction vectors for the input vector. The one having the lowest dis-

tortion among all these candidates is chosen as the final reproduction vector. This is

shown in Fig. 4.4 where each rectangle represents a codebook search and each small

circle represents a code vector retained.

In the M-L Tree search method, MI candidates are retained from the first stage

and the second stage is searched MI times, once for each candidate from the previous

stage, so that we have Ml x N2 distortion values for as many vector pairs searched.

Out of these MI x N2 vector pairs, M2 vector pairs corresponding to the lowest M2

1 CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS

Figure 4.4: Growing Tree search of a three stage VQ (MI = 5, M2 = 3, and M3 = 2)

distortions

d(x, (ci: + c:;)), i = 1, . . . , M2,

are retained for searching the next stage. For each of these M2 vector pairs, the third

stage with N3 code vectors is searched giving rise to M2 x N3 distortion values,

Out of these, M3 vector triplets corresponding to the lowest M3 distortion values

are retained for searching the next stage and so on. That is, Mj vector j-tuples are

retained at the end of searching the j-th stage and when all the L stages have been

searched, the best reproduction vector is chosen as the one with lowest distortion from

the ML vector L-tuples. This is shown in Fig. 4.5 where, as before, each rectangle

represents a codebook search and each small circle represents a code vector retained.

It is not difficult to see that none of these methods are guaranteed to find the

best possible reproduction vector for any given input vector but on the average, these

methods find a better reproduction vector most of the time compared to a sequential

search. An example of failure of the Growing Tree search and the M-L Tree search

is shown in Fig. 4.6. In this figure, the vectors chosen in the Growing Tree search

method are shown with thicker lines with the first stage code vectors in solid and the

second stage code vectors in dashed lines. Both Ml and M2 were equal to 2 in this

example. For the Growing Tree search, the final reproduction vector is chosen from


Retain M2 best pairs

Retain M3 best triples

0 j o ; 0 ; . . . . . .

I ( 0 ; o ; o ; . . . * * * c )

. . . 0 io i 0 ; . . . ; 0

Select triple with minimum distortion

Figure 4.5: M-L Tree search of a three stage VQ


Figure 4.6: Failure of multi-candidate search in a 2-stage VQ (MI = M2 = 2)

{ Y I , Y ~ , Y ~ , Y ~ ) and for M-L Tree search the choice is made from {yl,y2). In either

case, the best reconstruction vector yb is not found.

The Growing Tree search is guaranteed to perform not worse than a sequential

search for every input vector because the sequential search is a special case of a

Growing Tree search where the number of candidates retained at every stage, M j ,

is equal to 1. The M-L Tree search, on the other hand, is not guaranteed to do so

because some paths are pruned early in the search and are eliminated from the choice.

An example with a 3-stage codebook where the M-L Tree search fails to select the

best reconstruction vector yb is shown in Fig. 4.7. The candidates selected by M-L

search method at each stage are shown with thick lines and different stages of the

codebook are shown with different line styles. Note that in this example, a sequential

search would have found the best reconstruction vector.


A

A I

I I

.-A :' \

\ \ \

4

\ \ \

\ \ \ \

4

A I I

A .A I .:' \

I \ \ \

4

\

4 \ 4 \ \ \

\ \

\ \

4 4

Figure 4.7: Failure of M-L search in a 3-stage VQ ( M I = M2 = M3 = 2)


4.2.1 Search Complexity

The computational complexity of the search for a reproduction vector corresponding

to a given input vector is called the search complexity. The actual search complexity

for a given search technique is a function of the distortion measure used, but as a first

step towards computing the actual search complexity, we will compute the number of

distance computations involved for both the Growing Tree and the M-L Tree search

methods. We consider an L-stage MSVQ with N, code vectors in stage i, each code

vector having a dimension p, and Mj to be the number of candidates we wish to retain

at stage j .

Growing Tree Search Complexity

For the Growing Tree search, the number of candidates retained at any stage must

be less than the number of code vectors available at that stage. The first stage is

searched only once involving a distance computation for Nl vectors. The second

stage is searched MI times and number of distance computations involved is M1N2.

Thus, the total number of distance computations, NdG, for the entire search is given

by

where Mo = 1.

M-L Tree Search Complexity

For the M-L Tree search, the number of candidates available at each stage is dependent

on the number of candidates retained at the previous stage and it is sometimes possible

to specify an M, where Mj candidates are not available (particularly when a single

M is specified for all stages). In that case, all the available candidates are retained

for consideration during search of the following stage. While searching stage j , the

number of distance computations done is Kj-1 Nj where Kj-1 is the actual number of

candidates retained from the previous stage. So, if Mj > I<j-lNj, the actual number

of candidates retained at the j-th stage, K j = Ii'j-l Nj. The total number of distance


computations, NdML, for the entire search is given by

NdML = NI + K 1 N 2 + K 2 N 3 + " ' + KL-1NL L-1

= C KiN,+, (4.19) a=O

where, 1i-0 = 1 and

Ki = min (Mi, Ki-1 N,) . (4.20)

It can be seen that the search complexity increases only linearly with Mj and L

for the M-L Tree search, but increases exponentially for the Growing Tree search.

Hence, M-L Tree search is preferable in most practical applications. It should be

remembered though, that the Growing Tree search performs better than the M-L

Tree search because for every input vector, if M-L Tree search can find the best

reproduction vector, so can the Growing Tree search; whereas the converse is not

true. In this thesis, we explore the properties of M-L Tree search only, since Growing

Tree search is impractical in most cases due to high search complexity.

We define the search

where Nd is the number

complexity, Cs, in logarithmic scale, as

CS = ~ o ~ ~ ( N ~ N M A c ) (4.21)

of distance computations and NMAC is the number of arith-

metic operations for a single distance computation. In counting arithmetic operations,

multiply-accumulate is considered a single operation and separate additions and sub-

tractions are neglected.

The MSVQ codebook structure is sufficiently complex that no lower complexity

search algorithm other than a full search of all reproduction vectors exists today that

will always be able to find the best reproduction vector. A full search of course defeats

the purpose of having an MSVQ in the first place, as the search complexity becomes

equal to that for an unstructured codebook with the same resolution as the MSVQ

in bits/vector. Our experiment shows that M-L search very quickly approaches the

performance of a full search with relatively small values of Mi. Since there are many

possible ways M;s can be chosen, we have kept it constant over all stages to limit

the choices to a reasonable number. Fig. 4.8 shows the result of M-L search on LSF

vectors using a LSF-6+6 codebook. The full search performance is shown as the


Figure 4.8: Performance of LSF-6+6 MSVQ with M-L search

horizontal line. It can be seen that performance very close to full search is obtained

with M = 8.

Based on the above observations, we chose M-L Tree search for further investiga-

tion in this thesis.

4.2.2 Detailed Analysis of The Search Complexity

Weighted Mean Squared Distortion

Let y4 be one of the reproduction vectors selected as a candidate at the k-th stage.

That is 1 k y: = Cl, + . . . + Clk

where cfp is a selected code vector from p t h stage. While searching the (k + 1)-th

stage, a candidate reproduction vector is formed by adding a code vector c:'' from

the (k + 1)-th stage to the candidate vector yf from the previous stage. Thus, an


approximation to the input vector while searching the (k + 1)-th stage is,

k k + l f k + l = y, + c j .

The weighted mean squared distortion (WMSE) at stage k + 1 is given by

k + ~ ) ~ w ( X - yi - c j d(x, f k + l ) = (x - y, - c , 3 I.

where W is a symmetric weighting matrix. Expanding,

d(x, f k+l ) = [(x - yf)T - (c:")~] W [(x - y:) - c:"]

= (U;)~W(U;) - ~ ( u ; ) ~ w c ; + ~ + ( c : + ~ ) ~ wc:+1 T

= (u:)~w(u;) + (c;+l - 2u:) wc?+ l ,

k where u;k = x - yi . The first term is d(x, y,k) and is already known from computations in the previous

k stage except for the first stage. While searching the first stage, yf = 0 and u, = x

and it takes (P2 + p) multiply-adds to compute it once. The second term requires

(p2 + p) multiply-adds for every stage. Using Eq. (4.19) for the number of distance

computations and the definition of search complexity (Eq. (4.21)), we can write

where K; is given by Eq. (4.20).

In most cases, the weighting matrix W is a diagonal matrix. In this case, a

matrix-vector multiplication requires p multiply-adds instead of p2 multiply-adds and

consequently the factor (p2 + p) in the above equations get replaced by 2p. Hence, for

a diagonal weighting matrix, the search complexity is given by

The above estimate of complexity is correct only for a fixed weighting matrix W, but

usually, for perceptually significant distortion measures, W may be a function of the


input vector x and has to be computed once for each x . However, this is not very

significant in comparison to the computation required for distortion computations.

We will neglect this extra computation required and use Eq. (4.28) as our estimate of

search complexity.

It should be noted that the search complexity for split VQ with weighted mean

squared distortion measure is

where p; is the dimension of the pth codebook.

Mean Squared Distortion

For mean squared distortion, the weighting matrix, W = I, and we have

k T k d(x, ix+l) = (ui ) (u, ) + (c:+' - 2uj) Cj k+l

The first term again is known from computations for the previous stage except for

the first stage and takes p multiply-adds. The second term also takes p multiply adds T

neglecting the subtraction, or if the code vector energies, (c:) c:, are precomputed

and stored along with the codebook then there is no subtraction involved and it takes

exactly p multiply-adds for the second term. Recalling equations (4.19) and (4.21)

again, we can write the search complexity as

4.3 Codebook Design

The codebooks were designed using the generalized Lloyd's algorithm as outlined in

Appendix B. For designing stage k, the training vectors were the error vectors from

the previous stage.


where c;; is the codevector selected from the p t h stage while quantizing x,. Centroid

computation for the weighted mean squared error is done as explained in the next

subsection.

4.3.1 Centroid Computation

Let x, be the vectors in a partition whose centroid c is to be computed. The total

distortion, D, over all vectors in the partition is given by

In most cases Wn is a symmetric matrix, and

The vector c yielding the minimum value of D can be found by a variational technique,

for example by computing the derivative of Eq. (4.34) with respect to c and setting

it to zero. This gives

If Wn is also diagonal, then let

and

C(wnxn) = [bl, b2, . - - 7 bplT-

Then,


4.3.2 Outlier Weighting

One of the problems in VQ of LPC parameters is that some input vectors are poorly

represented and result in a spectral distortion much larger than the average. These

are called outliers. The outlier performance can be significantly improved by adding a

number of copies for each outlier to the training sequence. Equivalently, the weight of

the outlier may be increased during centroid computation by multiplying the weights

by (SD) ' , r > 1, and retraining the codebook. It was found that this approach does

not increase the average SD significantly. A value of r = 2 was used in designing

the codebooks. This resulted in an increase of average SD by about 0.01 dB and a

reduction of the 2-4 dB outliers by about 1% for the LSF-4x6 codebook.

4.4 Choice of Parameter Representation and Dis-

tance Measure

It has already been pointed out in the previous chapter that LSFs are the best param-

eter representation to use for vector quantization. However, LARS give performance

close to LSFs and may result in lower implementation complexity. We will investigate

both LARS and LSFs and choose one of the parameter sets for more detailed inves-

tigation. For measuring codebook performance we will use a perceptually significant

distortion measure such as Spectral Distortion, but this is a very expensive distortion

measure to use for codebook search in implementation. For this reason, we will use

mean squared error (MSE) and weighted mean squared error (WMSE) for the code-

book search. For the M-L search procedure, the final selection of the reproduction

Gector can be made using SD without incurring too much of computational cost.

Various weighting matrices for LSFs [86,91,62,104] were evaluated in this work. It

was found that the weighting in [62, 1041 performed slightly better than the weighting

in [86] and significantly better than the weighting described in [91]. The weighting

matrix entries are given by [62, 1041


where

~ ( f i ) = fi < fcrit

1 - f - f i t ) fcrit < fi < where ~ ( i ) is the group delay of the ratio filter (Eq. (A.49)) at a frequency correspond-

ing to the frequency fi of the i-th LSF, fcr;t = 1000 Hz, TcTit = 1.375 ms, rmaX = 20

ms, and the sampling rate f, = 8000 Hz.

The weighting function (Eq. (4.39) and (4.40)) originally proposed by Kang and

Fransen [62] is derived from their study of spectral sensitivities of LSFs and perceptual

considerations. It has already been shown (Fig. A.4) that the group delay of the

ratio filter is large near spectral peaks. Kang and Fransen showed that the spectral

sensitivities of LSFs are proportional to the square root of the corresponding group

delays, hence the weighting coefficients can be written as

where k is the constant of proportionality. The weighting coefficients are generally

normalized to a value between 0 and 1 and the normalized coefficients would then be

This weighting function (Eq. (4.42)) does not take into account our hearing sensitivity

which is high at spectral peaks and low at spectral valleys [35]. The group delay is low

for flat portions of the LPC spectrum (Fig. A.4) and particularly for an absolutely

flat spectrum where all LSFs are equally spaced, the group delay equals 11/8000

s or 1.375 ms assuming a 10th order LPC and a sampling rate of 8000 Hz. The

weighting function is assumed to be linear for group delays below 1.375 ms. The final

'modification to the weighting function comes from a model, u(fi), of our gradual loss

in hearing resolution with increasing frequency.

Various weightings for LARs were attempted. Static weightings proportional to the

spectral sensitivities [ lol l of LARs were tried with no consistent decrease in spectral

distortion or number of outliers.

The performance of all codes were evaluated using root mean square spectral

distortion (SD) between x and 2 (implemented as in [7] and [86]) givenby


LAR-SD +

LSF-WMSE - - LSF-SD

- 0

-A

12 13 14 15 16 17 18 Search Complexity

Figure 4.9: Performance comparison of LAR-6x4 and LSF-6x4 codebooks with M-L search

where no and nl correspond to 125 Hz, and 3.1 KHz respectively. In practice, an

N = 256 point FFT was utilized to compute A(ejZTnlN) and A,(ejZTnIN) and no and

nl were 4 and 100 respectively.

Log area ratios and line spectral frequencies were compared for different multi-

stage VQ configurations using spectral distortion and search complexity as the criteria.

The training database consisted of a total of 374,317 vectors and the test database

consisted of 121,200 vectors unless otherwise noted, all extracted from English speech.

Fig. 4.9 shows a typical result for a configuration having four stages of 6-bit codebooks.

The test vectors for this plot were a subset (FM-train) of the training set. For the

curves marked LAR-SD and LSF-SD, the final choice of the reproduction vector was

based on SD, and for the curve marked LSF-WMSE, weighted squared distortion was

used for all the choices. It can be seen that (Fig. 4.9) a system based on LSFs achieve

a spectral distortion of 1 dB at a search complexity about 4 to 8 times lower than

the LAR based system. However, in a real time system, computation of LSFs is more

difficult than the computation of LARS. In Fig. 4.9 and the subsequent figures, the

points marked on the SD versus complexity curve correspond to M = 1,2,4,8,- . . etc..


LSF- 12~2(split) LSF- 12x2 - LSF-8x3 -K-

LSF-6x4 LSF-4x6 +

Search Complexity

Figure 4.10: Spectral distortion of M-L Tree searched MSVQ at 24 bits/vector for various configurations

The number of candidates retained at each stage were the same for all stages. It is

evident that the performance of the LSF codebook is better than the LAR codebook,

hence only LSF codebooks were used for further study.

4.5 Performance and Complexity Trade-offs

In the traditional sequentially searched MSVQ, the design has been oriented toward

the largest implementable codebooks and the smallest number of stages. For exam-

ple, a quantizer using 24 bits per vector would typically be implemented using two

12-bit codebooks each having 4096 code vectors. Increasing the number of stages for

sequentially searched codebooks leads to a quick degradation of performance. The

introduction of tree search for multi-stage VQ leads to a significant increase in per-

formance, particularly for configurations having a relatively large number of small

codebooks.

Fig. 4.10 shows spectral distortion, SD, versus search complexity, Cs, for four

multi-stage VQ configurations and one split VQ configuration using LSFs. One of the


10 12 14 16 18 20 Search Complexity

Figure 4.11: M-L Tree search performance versus search complexity for rates 22-30 bitslvector

best configurations in terms of the trade-off between complexity and performance in

Fig. 4.10 is LSF-6x4, which achieves a spectral distortion of about 1 dB at a complexity

more than 8 times lower than LSF-12x2. Moreover LSF-6x4 requires storing only 256

codevectors as compared to 8192 codevectors required by the LSF-12x2 configuration.

In all configurations shown in Fig. 4.10, there are no 4 dB outliers and the percentage

of outliers greater than 2 dB are under 1% for SD < 1 dB. The LSF-12~2(split)

VQ used the same partitioning of the LSF vector (3 in the first stage and 6 in the

second stage) as in [86] and obtained virtually equivalent performance as in [86]. The

LSF-4x6, LSF-6x4, and LSF-8x3 codes all obtained superior performance compared

to split VQ at lower computational complexity and much lower memory requirements.

It is interesting to note that using 6 stages with only 16 codevectors in each stage (96

vectors total), imposes a structural constraint which degrades the performance less

than partitioning the vector into two sub-vectors and coding each with a 4096 level

full search code.

Fig. 4.11 shows spectral distortion versus search complexity at rates of 22-30

bitslvector. Note that the 28 bitslvector system (LSF-4x7) has very low complexity


at a spectral distortion of 1 dB and a memory requirement of only 112 code vectors.

The following conclusions can be drawn from Fig. 4.11.

1. As the number of stages increase and the number of codewords per stage de-

crease, more is gained from the M-L algorithm as M is increased.

2. The 22-bit (LSF-11x2) code at M = 2 obtains virtually identical spectral distor-

t ion compared to t he 24- bit split VQ code, albeit with a larger search complexity.

Code LSF- 11x2

LSF-12~2(split) LSF-6x4 LSF-4x6

LSF-2x13 LSF-4x7

Table 4.1: Different MSVQ configurations obtaining an average SD performance of approximately 1 dB. The bit rate is given in bits/vector.

Bit Rate 22 24 24 24 26 28

The complexities and rates required to obtain near 1 dB average SD for vari-

ous codes are shown in Table 4.1. The LSF-4x7 code offers relatively low memory

and computational complexity with the possibility of obtaining lower SD by using a

larger value of M. Virtually every MSVQ code considered obtains lower memory and

computational complexity than split VQ as expected.

4.6 Robustness Issues

SD (dB) 1.04 1 .04 1.04 1 .04 1 .03 1 .05

There are good intuitive reasons to believe that increasing the number of stages will

lead to improved robustness on noisy channels and across different talkers and lan-

guages. In this section we present the results related to robustness issues.

4.6.1 Effect of Language and Input Spectral Shape

% Outliers

Vector quantization has often suffered from robustness problems whereby the perfor-

mance of the VQ may be quite poor on data not represented in the training sequence.

2-4 dB 0.67 0.53 0.47 0.59 1.49 0.80

> 4 dB 0.00 0.00 0.00 0.00 0.01 0.00


Table 4.2: Spectral Distortion Performance over Different Languages and Input Spec- tral Shapes: (a) German, (b) Italian, (c) Norwegian, (d) Noisy English, and (e) TIMIT-test speech data base

Table 4.3: Percentage of Outliers (2-4 dB) for Different Languages and Input Spectral Shapes: (a) German, (b) Italian, (c) Norwegian, (d) Noisy English, and (e) TIMIT- test speech data base

Tables 4.2 and 4.3 show the average SD performance and percentage of outliers re-

spectively for test sets of (a) German, (b) Italian, (c) Norwegian, (d) Noisy English,

and (e) TIMIT-test speech data bases. The foreign language database includes IRS

weighted speech which was used for testing codecs in the CCITT 16 kb/s low-delay

competition.

It can be seen that higher rate codes involving smaller codebooks at each stage and

larger number of codebooks are more robust than the lower rate codes using smaller

number of relatively larger codebooks at each stage.

A plot of SD versus Cs is shown in Fig. 4.12 for LSF-6x4 and LSF-3x9 codes.

Note that the spread in SD around the 1 dB distortion region is much smaller for the

LSF-3x9 code compared to the LSF-6x4 code.

It is apparent that robust VQ can be accomplished by adding suitable structure

to the code while impairing average performance only slightly. Possible explanation

Cs 16.32 16.92 15.13 13.35

Code LSF-12~2(split)

LSF-6x4 LsF-8+6~3

LSF-3x9

% outliers (2-4 dB) Bits/vector

24 24 26 27

(e) 0.53 0.29 0.11 0.23

M -

32 8 8

(a) 1.40 1.14 0.26 0.35

(b) 0.56 1.07 0.47 0.43

(c) 1.70 1.28 0.78 0.85

(d) 2.69 4.35 2.51 1.52


0.8 I I I I I I I I

10 11 12 13 14 15 16 17 18 Search Complexity

Figure 4.12: Performance of two codes over different languages and input spectral shapes

for the improved robustness of the structured codebooks are the weak dependence

between code vectors and the training set, and the ability of structured codebooks

to produce spectra not present in the training set. These properties are particularly

important given the fact that both the training and the test set may not be represen-

tative of outliers present in natural speech.

4.6.2 Performance in the presence of channel errors

Good performance in the presence of channel errors is critical for a robust codec.

Distortions due to channel errors can be reduced without using redundancy bits by

appropriately assigning binary code words to each vector in the reproduction alphabet.

For scalar quantization, this can be achieved by assigning a Gray coded binary number

to each of the output levels sorted by magnitude. This concept was extended by Zeger

and Gersho [I121 to vector quantization and was called pseudo-Gray coding. Pseudo-

Gray coding is a locally optimal algorithm which effectively reduces the expected

distortion due to channel noise.


If c; and c j are the transmitted and received code vectors, the average distortion

due to channel error can be written as

where b is the resolution of the codebook in bits/vector, qm is the probability that

exactly m bits are in error during transmission, p(c;) is the probability mass function

of the codevectors, and Nm(i) is the mth neighbourhood of i defined as

where H( i , j) is the Hamming distance between indices i and j . The total cost, F(c;) ,

of a given codevector measuring its contribution to the total distortion when it is

selected by the encoder is defined as

The total cost D of the codebook is minimized by first ordering the codevectors

according to decreasing individual cost F(ci) and then switching the position of each

codevector in sequence with the one that yields the greatest reduction in the total

cost.

The performance of multi-stage VQ with pseudo-Gray coding is studied in this

section. The scalar quantizer in DoD CELP (FS-1016) was sorted and encoded ac-

cording to Gray code while each stage of the multi-stage VQ was pseudo-Gray coded

(based on mean squared error) using the algorithm by Zeger and Gersho [112].

The error performance in terms of average spectral distortion, percentage of out-

liers within 2 and 4 dB, and percentage of outliers above 4 dB is shown in Tables 4.4

and 4.5.

The increase in the number of outliers is a much better indication of degradation in

performance in the presence of channel errors than average spectral distortion since

errors occur relatively infrequently but may cause a very large (and very audible)

spectral error. The results for the split VQ code are comparable (although slightly

better in outlier performance) to that reported by Paliwal and Atal 1871, although the

scalar quantizer performance (especially at an error rate of was much better


FS-1016 (scalar) LSF-12x2 (split)

LSF-12x2 LSF-3x9

LSF-12+10 LSF-6x4

LSF-Gx4(unsorted)

Rate (bits/vector) 34 24 24 27 22 24 24

Table 4.4: Average Spectral Distortion for Different Error Rates and Codes

FS-1016 (scalar)

LSF-12x2 (split)

LSF-12x2

LSF-3x9

LSF-12+10

LSF-6x4

LSF-Gx4(unsorted)

Rate 2-4 dB > 4 dB 2-4 dB > 4 dB 2-4 dB > 4 dB 2-4 dB > 4 dB

34 11.4 0.01 11.4 0.02 11.8 0.19 15.3 1.7

Table 4.5: Percentages of Outliers for Different Error Rates and Codes


than that obtained in [87]. Possibly the scalar quantizer indices were not Gray coded

Tables 4.4 and 4.5 show that while VQ based systems have lower average spectral

distortion and lower 2-4 dB outliers even with transmission errors, scalar quantization

may lead to lower 4 dB outliers particularly at high error rates.

Performance was only marginally better for the LSF-6x4 pseudo-Gray coded code

compared to the LSF-6x4 unsorted code. Although the first stage had inherent ro-

bustness since it was initialized using a splitting procedure [33] the subsequent stages

were designed randomly and had no such structure.

4.7 Improved Codebook Designs for Multi-Stage

Although the codebooks reported in this chapter were designed in a sequential manner,

other design strategies exist that lead to a better codebook design. In sequential

design, the error minimized while designing each stage is not the final quantization

error but a partial reconstruction error till that stage since it assumes all subsequent

stages to be populated by zero vectors. A stage-(k + 1) code vector is computed as

where the summation is carried over all training vectors in the cell being considered

and u; is the partial reconstruction error

Here cP, is the code vector selected from the pth stage while quantizing x,. J P


4.7.1 Iterative Sequential Design

In iterative sequential design [69], called joint design by some authors [20, 131, each

stage is designed with all other stages fixed. The error function minimized is the total

quantization error and the stage-(k + 1) error uk is now obtained as

Design of each stage is iterated till convergence and then all stages are redesigned till

all stages satisfy a preset distortion improvement limit. The starting codebooks for it-

erative sequential design is usually derived from a sequential design. The performance

improvement obtained through iterative sequential design is very small compared to

the improvement obtained through M-L search. For codebooks with many stages, the

convergence of the iterative sequential design procedure can be extremely slow as only

a single codebook is optimized after each pass over the training sequence.

4.7.2 Simultaneous Joint Design

Simultaneous joint design of multi-stage codebooks updates all code vectors in ea.ch

stage simultaneously [69]. Ideally, a full search of the codebook is done for each

training vector and the error vectors at each stage (computed from codevectors cho-

sen from previous and subsequent stages as in iterative sequential design) are used

to compute new centroids for that stage. The sequential nature of search is lost

and whereas sequentially designed codebooks have monotonically decreasing average

energy, it may not be the case after few iterations of simultaneous joint design. Mono-

tonic convergence of the simultaneous joint design procedure is guaranteed if a full

search procedure is used. It has been shown experimentally [69] that convergence

is achieved even if M-L search is used instead of a full search. Since M-L search is

sequential in nature, the codebooks need to be re-ordered according to average energy

before repartitioning of the training sequence in each iteration.


4.8 Recent Developments in MSVQ

In a recent paper 1131, Barnes and Frost derived necessary conditions for optimality

of full search Direct Sum Codebooks2. They also presented conditions that need to be

satisfied by an optimal sequential encoder using a direct sum codebook and noted that

it was impractical to do optimal sequential encoding in general since the complexity of

such a coder generally exceeds that of an exhaustive search of the direct sum codebook.

A design algorithm for jointly optimizing all codebook stages (similar to the iterative

sequential design mentioned earlier) was also presented. M-L search was suggested

as an attractive alternative to full search and experimental results were obtained for

memoryless Gaussian and Laplacian sources as well as second order Gauss-Markov

sources using 2-stage and 4-stage MSVQs and a squared error distortion function.

The results showed that performances for codebooks jointly optimized with M-L

search approached those for codebooks that are jointly optimized using exhaustive

search. They also noted that the memory complexity of an MSVQ for Gauss-Markov

sources was lower than that of a full search single stage VQ with equivalent perfor-

mance. Performance of MSVQ with large number of stages was not investigated. This

work was done independently of our work as evidenced by the date of publication of

the paper.

Some recent work reported on entropy constrained MSVQs [66] show that entropy

constrained MSVQs can obtain small improvements over fixed rate MSVQs. The

interested reader is directed to a recent publication [14] that presents a detailed review

of the state of the art in multi-stage vector quantization (not necessarily related to

quantization of LPC parameters).

2Historically, the term Multi-Stage VQ always implied a sequential search representing a successive approximation approach. More recently, the term Direct Sum Codebook is being used to acknowledge the fact that the codebooks may not be sequentially searched and the whole reproduction alphabet may be available for search at any time. It is easy to see that for an exhaustive search, the order of the codebooks is irrelevant.


4.9 Summary

The salient features of this chapter have been the following:

The suboptimality of sequential search has been made evident by deriving the

conditions under which a codebook can be optimally searched in a sequential

manner. A multi-candidate search technique has been proposed to mitigate the

problem and it has been shown that performance close to full search can be

obtained with much lower complexity.

The strength of the M-L search technique has been demonstrated by designing

codebooks with large number of stages which many researchers did not believe

to be possible. For a 30 bits/vector coder, the storage complexity was only 60

vectors compared to more than 220 vectors required in a full search codebook.

The computational complexity was extremely low as well.

A transparent quantizer at 22 bits/vector has been designed for 10th order LSF

vectors. It was the lowest bit rate transparent quantizer for linear prediction

parameters at that time.

It has been shown that multi-stage codebooks with more number of stages tend

to be more robust against variation in languages and input spectral shapes.

Multi-Stage VQs have also been demonstrated to be robust against random

channel errors.

We have shown here that MSVQ with M-L search performs better than split

VQ with lower complexity.

Chapter 5

A Low Rate Spectral Excitation

Coder

In order to test the proposed Multi-Stage Vector Quantization of LPC parameters in

a speech coding application, a low rate speech coder was developed. We have already

discussed quantization of LPC parameters in the previous chapters. To build a speech

coder, the next step is quantization of the excitation.

Past experience has shown that sinusoidal modelling performs very well in low

bit rate applications. Sinusoidal modelling of the LPC residual has several advan-

tages over sinusoidal modelling of speech itself. As pointed out in Chapter 2, a

major problem of sinusoidal coding is quantization of harmonic magnitudes. Since

the LPC residual has a relatively flat spectral envelope compared to the speech signal

(Fig. 5.1)) the harmonic magnitudes for modelling LPC residual may be quantized

very efficiently.

, In this chapter we discuss the development of a sinusoidal synthesis model for the

excitation which is used along with the multi-stage VQ of LSFs in implementing a

low bit rate speech coder.

We also introduce a novel 0-bit harmonic magnitude quantization technique that

has been demonstrated to work well giving good quality synthesized speech at 1800

bps.

As will be discussed below, determination of the correct value of pitch is very

important for harmonic coders. We present a new geometric pitch determination

technique that also determines the positions of pitch pulses and can be quite useful

CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER

1 I I I I I I I 500 1000 1500 2000 2500 3000 3500 4000

Frequency (Hz)

- ?i a 0 3 C .- C

P I

Figure 5.1 residual

LPC residual 20 I I I I I I I

0 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz)

Magnitude spectrum of a voiced speech segment and corresponding LPC

in pitch synchronous algorithms.

5.1 Introduction

Sinusoidal coders attempt to synthesize speech or prediction residuals as a sum of

sinusoids produced by a bank of harmonic oscillators,

i ( n ) = A, ( n ) cos 0, ( n )

where M is the number of sinusoids, and A,(n) and 0,(n) are the amplitude and

phase functions for the m-th sinusoid. Usually, the sinusoids considered are harmonics

of a fundamental, the pitch frequency. Different models allow birth and death of some

harmonic frequencies in time and they also allow small deviations of the frequencies

from integer multiples of the fundamental. The number of harmonics, M, is a function


of the pitch frequency wo and is given by

where P is the pitch period in number of samples.

The problem of estimating and coding the parameters of the sinusoidal model can

be broken into two subproblems:

i) estimation and coding of the harmonic magnitudes, and

ii) estimation and coding of the phase functions.

Since each harmonic is a pure sinusoid, each of the phase functions can be derived from

the corresponding frequency function by simple integration, and thus only the initial

phases are sufficient along with the frequency functions to completely specify the

phases at all times. The amplitude functions, however, are completely independent

and must be estimated and coded for proper reconstruction.

5.2 Architecture of a Very-Low Rate Spectral Ex-

citation Coder

The general conceptual structure of a spectral excitation coder is shown in Fig. 5.2.

The speech signal s (n) is filtered through an analysis filter A(z), whose coefficients are

obtained through linear prediction analysis of the signal, to generate a residual e(n) .

The residual is analyzed for harmonic components and the harmonic magnitudes,

A, (n), the harmonic phases, 4, (n), and the harmonic frequencies, w, (n) , are derived.

These harmonic parameters along with the LPC parameters a ( n ) are passed on to

the decoder.

The decoder reconstructs an estimate, ;(n), of the residual by summing up si-

nusoids of the given magnitudes and phases. The estimated residual is then passed

through a synthesis filter whose coefficients are obtained from the LPC parameters

received by the decoder.

In a real speech coder, all parameters required to be transmitted to the decoder

need to be quantized. This leaves us with the problems of quantizing harmonic mag-

nitudes, phases, and frequencies along with quantization of LPC parameters. The


Analysis m Encoder

Decoder

n

A(z)

Figure 5.2: A conceptual schematic of a spectral excitation coder

MSVQ presented in Chapter 4 was used to quantize the LSFs (transformed LPC pa.-

rameters), and quantized LSFs were used to obtain the filter coefficients for both the

analysis and synthesis filters. The analysis and quantization of the sinusoidal model

parameters are presented in the following sections.

e(n) '

5.2.1 Treatment of Unvoiced Segments

It is obvious from our presentation so far that the sinusoidal model can suitably

represent all signals that are periodic in nature. The voiced speech signal, being

quasi-periodic in nature, is a perfect candidate for sinusoidal modelling. The unvoiced

segments have no such periodicity.

The problem is solved by considering each unvoiced segment as a single period of

Harmonic Analysis Arn(n>, @rn(@9 q,@1)

CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 9 3

a periodic signal. As a consequence, all unvoiced segments are synthesized using the

same fundamental frequency. As we will see later, synthesis is done one frame at a

time and the pitch period is set equal to the synthesis frame length for all unvoiced

segments.

5.3 Computation of the Unquantized Residual

The sinusoidal model parameters are computed from a harmonic analysis of the un-

quantized residual. This section outlines the procedure used to generate the unquan-

tized residual.

To generate the unquantized residual e(n), s (n) is passed through A ( z ) (Fig. 5.2)

which is a time varying filter. Usually, an assumption is made on the quasi-stationarity

of speech and the filter parameters are determined once every L samples. For a low

bit rate coder, the analysis interval typically varies from 20 ms to 40 ms. Since our

coder was targeted to operate at less than 2000 bps, we chose an analysis interval of

40 ms which corresponds to L = 320 for a sampling rate of 8000 Hz.

The analysis window needs to be small enough not to violate the assumption of

local stationarity and at the same time every speech sample needs to be included at

least in one analysis window. Usually, overlapping analysis windows are used to main-

tain smooth transition between LSFs computed for successive analysis frames. Since

the analysis frame size chosen here was already quite large (40 ms), non-overlapping

analysis windows with length equal to that of the analysis frame were chosen.

The large analysis frame length can give rise to an abrupt change in the filter

response and therefore the filter coefficients need to be interpolated between measure-

, ment points. It is already known [7] that LSFs are an excellent choice for interpolation

of the short term filter parameters. Ideally best results are obtained if filter parame-

ters are updated at every sample but this gives rise to a large complexity in the coder.

Subjective quality tests were done to choose an appropriate interpolation interval. It

was found that no quality difference could be perceived when the interpolation inter-

val was shortened below 2 ms. The LSFs were linearly interpolated every 16 samples

in our coder and held constant between interpolation points as shown in Eq. (5.2)

CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 94

where L is the analysis frame length, I is the interpolation interval and n, 0 5 n 5 L,

is an index within an analysis frame. The organization of the analysis frames, windows

and interpolation points are shown in Fig. 5.3.

LSF Computation

ws \- - - I -

n t e o a t e ,,Late Estimate LSFs LSFs LSFs

Compute Pitch and Harmonic Magnitudes

- synthesis frame

Figure 5.3: Analysis of SEC parameters

5.4 Estimation and Quantization of Harmonic Pa-

rameters

The harmonic parameters are derived from an analysis of the unquantized residual.

The unquantized residual is analyzed over synthesis frames. It should be noted that

the analysis frame used to compute LSFs and the unquantized residual is not related to

the synthesis frame as quantization of the residual through sinusoidal modelling is an


independent problem. The harmonic parameters and the short term filter parameters

can be updated at different rates as long as they are updated at the same instants of

time as they were computed during analysis.

The choice of synthesis frame size depends on the rate at which harmonic param-

eters vary in speech since each harmonic parameter is estimated at synthesis frame

boundaries and is interpolated linearly within the frame. The synthesis frame length

was chosen to be a submultiple of the LSF analysis frame for convenience. Preliminary

experiments showed that reducing synthesis frame size below 10 ms (80 samples) does

not produce significant improvement in speech quality. For low pitch talkers (P 2 80),

this implies determination of harmonic parameters more than once per pitch period.

For the average male or female talker, an 80 sample synthesis frame implies that

harmonic parameters are estimated once every 2-3 pitch periods. In our coder, a

synthesis frame size L, of 80 samples corresponding to 10 ms was chosen.

5.4.1 Pitch Estimation

Correct pitch estimation is very important in harmonic coders as the fundamental

frequency is determined by the pitch frequency. MBE employs a closed loop esti-

mation of pitch by jointly estimating the harmonic magnitudes and pitch through

minimization of a spectral estimation error function. Closed loop pitch estimation is

very expensive in terms of computation and therefore an open loop pitch estimation

is used here.

The more popular open loop pitch estimation procedures (Normalized Autocorrela-

tion Method, SIFT) [65, 76, 77, 781 use autocorrelation of the speech or residual signal

to measure similarity between the original and shifted versions of a speech/residual

segment. The distance function minimized in these procedures is

where N is the analysis frame length, T is the shifted distance, and ,8 is a scaling

factor to take into account changes in signal energy with time. The optimum value

for ,f3 is found by setting the partial derivative dE(r , P)/d(P) to 0 and solving for /3.

This gives


Substituting the optimum P in Eq. (5.3)) the distance function to be minimized be-

comes

This is equivalent to maximizing the square of the normalized autocorrelation function

Maximizing square of autocorrelation can give wrong results as correlation may be

negative. Since only positive correlations are of interest, the function maximized

should be

Rn( r ) is computed for Pmin 5 r < P,,, where Pmin and PmaZ are the minimum and

maximum pitch values of interest, and the value of r that maximizes the function is

the value of pitch.

In a real implementation, given a pitch estimation window of length N,, the func-

if no sample outside the estimation window is to be used. An alternative formulation

with negative shifts gives the function to be maximized as

Autocorrelation based pitch estimators have a high computational complexity and

frequent occurrences of pitch doubling/halving. A geometric pitch detector was devel-

oped that works on peak detection on both speech and the residual signal. Subsequent

pitch intervals are marked, and one is able to obtain estimates of subsequent pitch

cycles. These pitch periods may be averaged if one desires to obtain an average pitch

period. The algorithm also maintains a track length of pitch values that fall within a

threshold around the previous pitch value. Usually, the track length increases through-

out a voiced segment and is reset to zero at the beginning of an unvoiced segment. A

default pitch value of 0 is returned for unvoiced frames. The decoder, upon receiving


a value of 0 for pitch, replaces it by the synthesis frame length (80 in our coder) for

synthesizing an aperiodic sequence.

The heart of the pitch estimator is a peak detector that detects peaks in the

speech/residual signal within a pitch estimation window. These peaks are subse-

quently examined and those that mark a pitch cycle are retained. Autocorrelation

computations are performed'when peak detection fails or is deemed to be unreliable.

A tracking procedure is used to minimize the occurrence of pitch doubling/halving.

The Peak Detector

Different segments of speech/residual can have larger positive or larger negative peaks.

Usually, one voiced segment has large peaks of one polarity and the peak polarity can

change for different voiced segments (phonemes). Generally, detecting peaks that

represent pitch pulses is a non-trivial problem, but it becomes easier if one is able to

detect one pitch pulse successfully and then search around that peak. In general, the

sample with the largest magnitude within the pitch estimation window represents a

pitch pulse if the sample in question is not a boundary sample. In case the sample

with the largest magnitude is on the boundary, the search for the maximum can be

restricted to a smaller part of the window till a qualifying peak is found.

The peak detection procedure is described below where all operations are carried

out within a pitch estimation window of length N,.

1. Find maximum sample value, smax, and minimum sample value, s,;,, for the

signal s (n) and save the indices.

2. if Ism;,l > IsmaXl, multiply all samples by -1 (reverse the polarity of all samples

so that the largest peak is positive) to obtain a sign corrected signal x(n). Choose

the appropriate index from the two indices retained in the previous step as the

index, i,,,, to the largest value in the sign corrected signal.

All subsequent references to the signal will mean the sign corrected signal x(n).

3. If the maximum sample is an end sample (i,,, = 0 or N, - 1), then do the

search again ignoring o N , samples from the end where the maximum sample


was found. For example, if i,,, = 0, repeat the search for maximum within

samples cuN, to N, - 1. a is a suitable constant less than 1. In the pitch

detector implemented, a was 1/3. Iterate the step till a valid peak is found.

4. Once the largest peak in the signal is found, search for other peaks to the left

and right of the largest peak making sure that no successive peaks are separated

by less than Ng samples where Ng, called the guard period, is a function of the

previous value of average pitch, P-l, and the number of consecutive voiced

segments, called the tracklength. If two peaks are found within Ng samples,

retain the larger of the two.

5 . If the next peak is separated by more than Ng samples from the previously

retained peak, keep it if it is larger or exceeds z,, times the previous peak

value, where Tlow is a preset threshold for qualifying lower peaks.

The peak detection algorithm is applied to both the speech and the corresponding

residual signal and the number of pitch cycles obtained from both signals are counted.

If the number of pitch cycles obtained is less than an expected minimum number of

pitch periods (determined from the previous average pitch value), the value of the

threshold, Tlow, for qualifying lower peaks, is reduced and the peak detection procedure

is applied again. If TIOW goes below a lower limit T,;,, and still the required minimum

number of cycles have not been found, the segment is declared unvoiced and the pitch,

P, is set to 0. It should be noted that the number of pitch cycles obtained from the

residual and the speech signal may not be the same.

Pitch Computation

Once the values of possible pitch periods, p(n), (differences between successive peak

indices) within a pitch estimation window has been found for both the residual and

the speech signal, periodicity of the peaks is checked by computing a mean normalized

sample standard deviation defined as

where


N, is the number of possible pitch periods and p is the vector of possible pitch periods.

We will use the subscript s to indicate parameters related to the speech signal,

and the subscript e to indicate parameters related to the residual signal.

The normalized standard deviations - u,(p,) for the speech signal, and a,(pe)

for the residual signal - are computed along with the respective mean values p(p,)

and P ( P ~ ) . Parameters derived from the speech signal are given priority over those determined

from the residual for pitch estimation and an average pitch is computed through a

complex heuristics using a sequence of tests. The sequence of tests performed for

computing average pitch is described in Appendix C.

It has been found that the use of both speech and residual waveform is very helpful

as peaks are sometimes more easily discernible in one compared to the other. Pitch

values from 20 to 140 are considered in this algorithm and the pitch analysis is done

over an analysis window of 240 samples (30 ms).

The algorithm has been found to be very robust under normal recording condi-

tions. No characterization of the algorithm was done under noisy recording conditions.

Figure 5.4 shows typical results from an autocorrelation based pitch detector (labelled

p) with no pitch tracking and the pitch contour obtained from the geometric pitch

detector (labelled p_g) described here which uses pitch tracking. Although the geo-

metric pitch detector uses both speech and residual signals, this does not lead to an

increase in complexity for our coder as the residual is already available and does not

need to be specially computed just for pitch determination.

Figure 5.5 shows a typical plot for a voiced segment where all peaks in the speech

and the residual signal have been marked as detected in this algorithm.

5.4.2 Modelling of Harmonic Phases

There has been two basic approaches to phase modelling in the past. In one approach

[54], the harmonic frequency is assumed to be varying linearly with time between

measurement points on the speech signal, and in another [82, 831, the variation in

frequency is assumed to be quadratic. In the original work by Griffin [54], fundamental

frequency and phases of all harmonics were measured every 20 ms and linear frequency

interpolation was used to satisfy four boundary conditions - two at the start and two


40 1 I I I I

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 sample no

Figure 5.4: Performance of the geometric pitch detector.

at the end of a measurement interval. If $, and q5, indicate the measured and

predicted phases for the m-th harmonic, and w, indicate the measured frequency of

the m-th harmonic, then the boundary conditions are

where L is the interval between measurements.

Obviously, a linear change in frequency, which gives rise to a quadratic interpola-

tion model for phase, cannot satisfy all four conditions as a quadratic has only three

coefficients. Griffin [54] allowed the model frequency track to be raised or lowered

by a small amount Awm to force the phase matching conditions at the expense of

violating the frequency matching conditions. He proposed a model where the phase


I I

0 50 100 150 200 Sample No.

Figure 5.5: Pitch pulses marked by the pitch detector.

is given in the continuous time domain by

and 1 L

a m (t ) = urn (0) + [urn ( L ) - wm (O)] - L + Awm.

In the second basic approach, McAulay [82] used a cubic phase interpolation model

which could actually satisfy all four boundary conditions. His model can be written

giving rise to a quadratic frequency change over the interpolation interval

The parameters a,, b,, and c, can be solved for the given boundary conditions

(Eq. (5.11) - Eq. (5.14)) and the following results are obtained.

CHAPTER 5. A LOW RATESPECTRAL EXCITATION CODER 102

Since the measured phase $,(t) has an uncertainty of 2 s Mm, where M, is an inte-

ger, the term $,(L) is replaced by $,(L) + 2sMm in the above equations. The value

of Mm is determined from a frequency smoothness criterion such that the functional

is minimized. The functional f (-) measures the deviation of the frequency track from

a constant frequency (which will result in d28,/dt2 being zero).

Although the quadratic interpolation model fails to satisfy all boundary conditions,

it is more popular because it needs a smaller number of parameters and can even be

simplified further requiring no transmission of parameters other than the pitch period.

This is achieved by replacing Aw, in equation 5.16 by a suitable random number.

In the discrete time domain, if $,(n) denotes the measured phase of the mth

harmonic, and Om(n) denotes the model phase used in the synthesis equation, the

quadratic phase interpolation gives rise to the following equations.

where n = N is the end of the current frame or equivalently, the beginning of the next

frame. If dm(n) denotes the predicted phase from an assumption of linear frequency

change over the frame, then

For a coder completely based on predicted phase, the measured phase $,(N) in

equation 5.24 is replaced by the predicted phase dm(N) from equation 5.25 making

Awm = 0 as expected. In Griffin's original MBE coder [54,52], the difference, $m (N) -

dm(N), between measured phase and predicted phase was quantized and transmitted

for every harmonic. In the IMBE (INMARSAT-M) [32] system, the quantity $ , (N) -

dm(N) is modelled as

$m(N) - dm(N) = Arm (5.26)

i CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER

where r, is a random number uniformly distributed between [-x, x], and X is the

fraction of unvoiced energy in the frame being synthesized, estimated as the ratio of

the number of unvoiced bands to the total number of frequency bands (usually between

6 and 12) used. Thus, encoding of phases is avoided in the IMBE system.

Numerous studies have been made using the sinusoidal synthesis model for speech

[3, 16, 54,55, 80, 82, 83,851. All of them use a phase model based on a predicted value

of phase and add a random component to it depending on a voicing probability that

measures how close the speech frame is to being voiced. The voicing probability is

measured based on a goodness of fit of the sinusoidal model [83] or from a normalized

autocorrelation coefficient at the pitch lag [25].

2 4 6 8 10 12 14 16 18 harmonic number, m

Figure 5.6: Difference between measured and predicted phase changes for a voiced frame.

The difference between the measured phase changes, A$, = $ , ( N ) - $,(0),

for each harmonic and predicted phase changes, Ad, = d,(N) - r$,(O), is plotted

against the harmonic number m in Figure 5.6 for a voiced frame. It can be noted that

the deviation from the predicted phase increases with frequency. Phase changes for

harmonic components can be measured over an unvoiced segment if an assumption

is made about the fundamental frequency (pitch period). As already pointed out

earlier, our coder uses a pitch period equal to the synthesis frame length in order

i r CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER

to synthesize an unvoiced segment. Under this assumption, the phase changes for

each harmonic over an unvoiced frame is shown in Fig 5.7. The difference between

5 10 15 20 25 30 35 40 harmonic number; m

Figure 5.7: Difference between measured and predicted phase changes for an unvoiced frame.

measured and predicted phase changes for unvoiced speech is random and does not

exhibit any pattern.

In modelling phase changes for each harmonic, the measured phase changes, A+,(N),

in Eq. (5.24) is replaced by the predicted phase change, Aq5,(N), plus a random phase

rm - A$m (N) = A ~ J ~ ( N ) + rm (5.27)

where sg 9 for voiced speech

sgXr for unvoiced speech

Here s, is an empirical constant called phase-scatter gain, X is a uniformly distributed

random number between [-I, 11, and M is the number of harmonics for the frame

being synthesized. In our coder sg was set to 1.0.


5.4.3 Estimation and Quantization of Harmonic Magnitudes

The harmonic magnitudes can be estimated in several ways. In MBE [54], the mag-

nitudes are derived from the spectrum of windowed speech. Since the windowed

spectrum depends both on the magnitudes and the pitch, both harmonic magnitudes

and pitch are estimated as

the harmonic magnitudes

ming window at least 2.5

to

a solution to a joint optimization problem. In STC [82, 831,

are measured from a periodogram obtained with a Ham-

times the average pitch period and normalized according

In the TFI technique [97], the harmonic magnitudes are measured from a pitch sized

DFT of the speech segment with a rectangular window.

The use of pitch sized DFT to compute harmonic magnitudes is particularly at-

tractive for sinusoidal synthesis systems because it can provide with an exact synthesis

of the frame if unquantized harmonic magnitudes are used along with unquantized

DFT phases. This is shown below starting from the definition of DFT.

Let x(n) be the speech signal and P be the pitch period in number of samples.

Also let X(k) be the P-point DFT of x(n) obtained as

Then the signal x(n) can be written as

If P is odd, the frequency sampling points are as shown in Fig. 5.8(a) and all samples

can be paired with their complex conjugate term except for the one at w = 0. The

case when P is even is shown in Fig. 5.8(b) and similar pairs can be formed except

for the samples at w = 0 and w = T. We can represent x(n) for these two cases as

follows.

Case P odd:


Case P even:

(a) P odd (b) P even

Figure 5.8: Frequency sampling points for a P-point DFT

P-I - 1 2

= -X(O) + x IIX(k)ll cos (%kn+ d k ) P k=l

P-1 7

x ( 0 ) 2~ - - + C IIX(k)ll p cos (pkn + 4,)

P k=l

where dk = arg[X(k)].

Assuming that the signal energy is practically zero at w = 0, it is evident that the

signal x(n) can be exactly represented as a sum of harmonics

where M = L f ] and each harmonic magnitude is given by


except for k = P/2 when P is even in which case,

Quantization of harmonic magnitudes poses a special problem in sinusoidal syn-

thesis systems since the number of harmonic magnitudes varies with pitch. Therefore,

the number of magnitudes to be quantized changes with the talker and also within a

sentence spoken by one person. IMBE uses a very elaborate coding scheme to code

this variable dimension magnitude vector with the same number of bits. Recently,

the problem has been addressed by two techniques known as the variable dimension

vector quantization (VDVQ) [26] and the Non-Square Transform Vector Quantization

(NSTVQ) [74].

As has been shown earlier in this chapter (Fig. 5.1), the residual signal has a

relatively flat spectral magnitude and an elaborate quantization scheme may not be

required for reasonable speech quality at a low bit rate. A novel 0-bit quantization of

harmonic spectral shape is used here along with scalar quantization of the energy of

the magnitude vector in quantizing the harmonic spectral magnitude vector.

The coder measures and transmits the value of pitch using 7 bits that can encode

128 pitch values. The pitch value in our coder is allowed to vary between 20 and 140

giving rise to 121 different values. Thus there are 7 unused codes that can be used to

transmit other information whenever a pitch value is not transmitted. We transmit

a pitch code of 0 for unvoiced speech segments. The decoder, upon receiving a zero,

sets the pitch to the synthesis frame length and also takes note of the fact that the

segment was unvoiced. If a non-zero pitch code was received, then the segment was

voiced.

Once the V/UV classification of the segment to be synthesized is known, the

decoder uses two harmonic spectral shape templates - one for voiced and another

for unvoiced speech - to obtain harmonic magnitudes for synthesis by sampling the

templates at appropriate sampling points given by

where F, is the sampling frequency in Hz and P is the pitch period in number of

samples.


The templates were created from two sets of training vectors, one for voiced and

another for unvoiced speech. Each vector in the training set was 257 points long

corresponding to the frequency range of [0, a]. Each training vector was created as

follows:

i) Obtain one pitch cycle of voiced segment or synthesis frame size (80 sam-

ples in our coder) of unvoiced segment.

ii) Zero pad to N(= 512) samples and compute FFT magnitudes.

iii) Take first N / 2 + 1(= 257) elements of the FFT magnitudes to form the

FFT magnitude vector.

iv) Compute norm of the FFT magnitude vector and normalize the vector by

dividing by the norm.

v) Compute logarithm to the base 10 for each element of the normalized

vector.

The template is then computed as the centroid of all normalized log FFT magnitude

where S, is a training vector of FFT magnitudes.

Now, the quantization of harmonic magnitudes can be described as follows:

i) Given the harmonic magnitude vector A and the quantized pitch P (that

has embedded V/UV information), choose the voiced or unvoiced template

of log spectral magnitudes, c .

ii) If the value of P was less than P,;,, replace the value by the length of the

synthesis frame.

iii) Compute log harmonic magnitude shape vector x = {x; : i = 1,. . . , M )

as follows:


0 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz)

Figure 5.9: Log magnitude spectrum templates for voiced and unvoiced speech

0 k j = k , + l

0 if kj > N/2+ 1, ki = N/2 + 1

~ i = ~ ~ , + ( ~ ~ , - c k , ) ( f ; - k ; ) + r ~ X , w h e r e { c ~ : 1 = 1, ..., N/2+1) are

elements of the template vector c, r, is the magnitude randomization

gain and X is a uniformly distributed random number in the range

[-I, I]. The random component is added to avoid excessive similarity

of spectra for consecutive speech segments.

iv) Compute harmonic magnitude shape vector y as y; = lox1, i = 1 , . . . , M .

v) Compute gain, g = w. llyll

vi) Scalar quantize g using bg bits to i j .

vii) Compute quantized harmonic magnitude vector, A = ijy.

A value of 0.1 to 0.2 for r, gave good results. The gain g was quantized by using

a uniform quantizer on loglog. It should be noted that the quantization scheme

presented above needs to have identical random number sequences generated both at

the encoder and the decoder. This can be easily achieved by using identical random 1


number generators at the encoder and decoder, and initializing them with the same

seed.

An alternate quantization scheme that works as well and does not need synchro-

nized random number generators is as follows.

Encoder:

i) Given the harmonic magnitude vector A and the quantized pitch P (that

has embedded V/UV information), choose the voiced or unvoiced template

of log spectral magnitudes, c.

ii) If P < P,;,, replace the value by the length of the synthesis frame.

iii) Compute log harmonic magnitude shape vector x = {xi : i = 1, . . . , M }

as follows:

0 f; = ~ i / p

ki = [fij

0 kj = k i + l

0 if kj > N / 2 + 1 , kj = N / 2 + 1

xi = ck, + (ck, - cb,)(fi - ki), where {cr : 1 = 1,. . . , N/2 + 1) are

elements of the template vector c.

iv) Compute harmonic magnitude shape vector y as y; = 1OX1, i = 1 , . . . , M .

v) Compute gain, g = llAll llvll '

vi) Scalar quantize g using b, bits to 6 .

Decoder:

i) Given the value of i) (that has embedded V/UV information), choose the

voiced or unvoiced template of log spectral magnitudes, c.

ii) If i) < Pmin, replace the value by the length of the synthesis frame.

iii) Compute log harmonic magnitude shape vector x = {xi : i = 1 , . . . , M }

as follows:


f; = N ~ / P

k; = Lf;] 0 k j = k ; + l

if kj > N/2 + 1, kj = N/2 + 1

xi = ck, + (ck, - cki)(fi - k;), where {cr : 1 = 1 , . . . , N/2 + 1) are

elements of the template vector c.

iv) Compute harmonic magnitude shape vector y as y; = loxt , i = 1, . . . , M .

v) Add random components to the log magnitude shape vector x to generate

randomized shape vector x'.

where r , is the magnitude randomization gain and X is a uniformly dis-

tributed random number in the range [-I, 11.

vi) Compute randomized magnitude shape vector y' as yl= lox:, i = 1, . . . , M .

vii) Compute quantized harmonic magnitude vector,

5.5 An 1800 bps Spectral Excitation Coder

The schematic diagram of an 1800 bps Spectral Excitation Coder is shown in Fig. 5.10.

At the encoder, the speech spectral envelop is estimated every 40 ms using 10th order

LPC analysis. The LPC coefficients are transformed to LSFs and quantized using an

MSVQ with M-L search [15, 691 at 24 bits/vector. An eight stage MSVQ with eight

vectors per stage is used with M = 29 to provide robust transparent quantization of

LSFs. The quantized LSFs are interpolated and the analysis filter A(z) is updated

once every 2 ms (16 samples) in computing the residual signal e(n) . The original

speech and the computed residual are used to determine pitch P using a geometric

pitch detector described in subsection 5.4.1. The pitch P is quantized to P using 7

bits as follows.

P = Pmin - 1 +Pi, (5.42)


Encoder Decoder

Figure 5.10: A Low bit rate Spectral Excitation Coder

is the pitch code transmitted to the coder. P,;, = 20 in our implementation. The

value P; = 0 is reserved to indicate an unvoiced frame.

The harmonic magnitude vector is computed once every 10 ms (80 samples) by

applying a pitch sized rectangular window on the unquantized residual and applying

DFT on the windowed signal. The harmonic magnitude gain g is computed from the

harmonic magnitude vector and sampled voiced or unvoiced log magnitude template

depending on the value of @ as described earlier. The gain g is quantized to 6 using

a scalar logarithmic quantizer with 5 bits (using a uniform quantizer on loglog) as

follows:


g,,, was 20,000 and g,;, was 1.0 in our implementation.

The information transmitted to the decoder are the quantized LSFs, the quantized

pitch (which includes V/UV information) and the quantized harmonic magnitude

gain.

The decoder computes the harmonic magnitudes from the quantized magnitude

gain, the magnitude randomization gain, and the voiced/unvoiced magnitude tem-

plates stored at the decoder. It also computes the phases for all harmonics using the

phase prediction and dispersion model described earlier.

A magnitude randomization gain, r, = 0.2, was found to give good results and

was used in our implementation. The phase scatter gain s, (see Eq. (5.28)) was set

to 1.0.

The quantized residual produced by a sinusoidal oscillator bank using the quan-

tized harmonic magnitudes and phases is passed through a synthesis filter l/A(z) to

produce synthesized speech. The coefficients of the synthesis filter are updated from

the quantized LSFs which are interpolated every 2 ms in synchronization with the

interpolation at the encoder.

The bit allocation for the 1800 bps coder is shown in Table 5.1.

Pitch Exc. Gain J 1

-.

)I Harmonic Mags I 0 I 0 I 0 n - , I I

TO^ a1 1800 I

Parameter

LSFs

Table 5.1: Bit Allocation for the 1800 bps coder

5.5.1 Evaluation of Coder Performance

Bits 24

The performance of the 1800 bps coder was evaluated using an informal Mean Opinion

Score (MOS) test. Three codecs - IMBE at 4150 bps, LPC-lOe at 2400 bps and our

harmonic coder at 1800 bps were used in this test. Each codec was used to encode 8

Updates

1

Rates (bps)

600


sentences - 4 male speakers and 4 female speakers. Coded sentences were played in

random order following the original 16- bit PCM sentence through stereo headphones.

Seven participants took part in the informal MOS test which provided 56 ratings

for each codec. The results of the test are shown in Table 5.2. The 1800 bps SEC

performed better for female speech compared to male speech. This is expected since

the number of harmonics in male speech is generally larger than in female speech and

the higher harmonics display a random phase change that is difficult to model. The

1800 bps SEC performed significantly better than 2400 bps LPC-lOe and scored more

than 0.3 MOS points higher. IMBE was used as an anchor to validate the MOS scores.

IMBE obtained a MOS score of about 3.5 which is a generally accepted score for IMBE

indicating that the test scores are reliable. The major distortion experienced in the

SEC coder was in nasal sounds. Significant improvement of quality for nasal sounds

should be possible by using a nasal harmonic magnitude template and a classifier.

This, however, was not investigated in this work.

Table 5.2: MOS results

5.6 Conclusions

'

In this chapter we presented a harmonic coder a t 1800 bps that used an MSVQ with

M-L search for quantization of LPC parameters developed earlier. The MOS obtained

for this coder was significantly higher than the LPC-lOe at 2400 bps. The nasal

sounds were most audibly distorted showing that the harmonic magnitude shapes for

nasal sounds are different from non-nasal voiced sounds. This shows that a possible

improvement in speech quality can be made by using a separate template for nasal

sounds.

Coder IMBE

LPC-lOe SEC

Variance Rate (bps)

4150 2400 1800

Mean Opinion Score Female

0.27 0.27 0.22

All 0.36 0.43 0.29

All 3.46 2.65 2.98

Male 0.44 0.48 0.33

Male 3.33 2.41 2.84

Female 3.58 2.88 3.13

Chapter 6

Conclusion and Future Direct ions

The major focus of this thesis has been efficient quantization of LPC parameters and

a low bit rate (1800 bps) coder has been implemented using the LPC quantization

technique developed here. The structure of a multi-stage vector quantizer has been

analyzed and various search strategies for a multi-stage vector quantizer codebook

along with their inherent problems have been presented.

It has been shown that a multi-candidate search with an appropriate distortion

measure not only provides a lower quantization distortion but it does so at a lower

computational complexity. Several multi-stage structures have been studied and their

performances have been presented. It has been shown that as the number of stages

increase and the number of codewords per stage decrease, more is gained from the

M-L algorithm as M is increased. It has also been shown that transparent coding

of LPC parameters can be done using 22 bits per frame at the same computational

complexity as the 24-bit split VQ [86] which has been considered to be the lowest rate

transparent LPC quantizer so far. At 24 bitslframe, the M-L search technique can

achieve transparent quantization at much lower search complexity.

The performance of MSVQ codes have been studied under channel error condi-

tions and codebook ordering using pseudo-Gray coding. It is shown that while VQ

based systems have lower average spectral distortion and a lower percentage of 2-4

dB outliers even with transmission errors, scalar quantization may lead to a lower

percentage of 4 dB outliers particularly at high error rates.

The robustness of each multi-stage structure has been studied and it has been

found that robust VQ can be achieved by adding suitable structure to the code while

CHAPTER 6. CONCLUSION AND FUTURE DIRECTIONS 116

impairing average performance only slightly. Possible explanation for the improved

robustness of the structured codebooks are weak dependence between code vectors

and the training set, and the ability of structured codebooks to produce spectra not

present in the training set. These properties are particularly important given the fact

that both the training and the test set may not be representative of outliers present

in natural speech.

The low rate spectral excitation coder at 1800 bps uses a novel technique for 0-bit

spectral shape quantization for the excitation signal by hiding the V/UV information

in pitch values. This was achieved because the range of quantized pitch values (20 5 P 5 140) do not require the full 7-bits used to represent them.

A new geometric pitch detector was also developed that has much lower compu-

tational complexity compared to an autocorrelation based pitch detector. It can also

provide estimates of individual pitch periods making it suitable for use with pitch

synchronous algorithms.

There are several possibilities for improvement in both the LPC quantizer and

the Spectral Excitation Coder presented here. First, the value of M used in the

M-L search need not be the same for each stage of the codebook. In fact M for

each stage can be chosen from the average quantization distortion from the previous

stage. This should provide some reduction in computational complexity for the same

quantization performance. Next, techniques for selecting the MSVQ structure can be

studied. The codebook size at each stage can be estimated during the design of the

stage by studying the distribution of training vectors at that stage. This of course

applies only to a sequential design of the codebook.

The spectral excitation coder presented in this thesis does not work very well for

nasal sounds. This can be improved by using more spectral templates at the expense

of a small increase in bit rate.

Appendix A

Linear Prediction

Linear Prediction (LP) has become the most widely used method of speech signal

analysis and synthesis since its introduction to speech [9, 781. The success is usually

attributed to the fact that most sounds from the vocal tract can be modeled by an

all pole structure. Another reason for its success is the nature of human perception

that we perceive spectral peaks rather than spectral nulls. This explains why LP

models work even in the case of nasal and other sounds with spectral zeros. The

LP modeling of speech factorizes the problem of speech coding into two independent

coding problems - coding of the spectral parameters representing the vocal tract and

coding of the excitation to the vocal tract. Out of many possible candidates for

spectral estimation, LP based estimates have been most useful because it addresses

the spectral estimation problem from a deeper point of view, the Maximum Entropy

Principle.

A. 1 Conceptual Formulation

Let x be a stochastic process and let X( t ) be a column vector of L independent

measurements from L realizations of x. Then

Linear Prediction involves predicting this vector from a set of p previous measurements

T X(t - m) = [xO(t - m), xl(t - m), . . . , z ~ - l ( t - m)] , l l m l p . (A.2)

I

i APPENDIX A. LINEAR PREDICTION

The forward (in time) predicted value is then written as

P

Rf ( t ) = C am(t)X( t - m). m=l

Writing the past observations in matrix form,

X ( t ) = [X( t - l ) , X( t - 2 ) , . . . , X ( t - p) ] ,

Eq. (A.3) can be written as

~ f ( t ) = X(t)a( t )

where a(t) = [a&), a2(t), . . . , ap(t)lT.

The prediction error vector, e( t ) , can then be written as

The prediction error energy is then given by

(A. 7)

Let &(t) be the set of parameters that minimize the prediction error, i.e.

S(t) = min [a(t)] . 4 t )

(A.9)

From Eq. (A.7) and (A.8)

Setting partial derivatives with respect to ST(t) equal to zero,

APPENDIX A. LINEAR PREDICTION 119

Equation (A.12) is known as the normal equation. Now, we can write the forward

predicted value as

Rf ( t ) = X( t )&( t )

= X ( t ) ( x T ( t ) x ( t ) ) -l x T ( t ) x ( t )

= P ( t ) X ( t ) ( A . 14)

where

P ( t ) = X ( t ) ( x T ( t ) x ( t ) ) -I x T ( t ) , P ( t ) E RLxL. (A.15)

The following theorem [38] shows that P ( t ) is a projection operator and ~f ( t ) is the

projection of X ( t ) on the space spanned by the columns of X ( t ) which is the subspace

of past observations.

Theorem A . l Let W be a subspace of Rn. There is a unique n x n matrix P such

that for each column vector b in Rn, the vector Pb is the projection of b on W . The

projection matrix P can be found by selecting any basis { a l , a2,. . . , a k ) for W and

computing P = A(ATA)- lAT, where A is the n x Ic matrix having column vectors

al ,a2, . . . ,ak.

The corresponding minimum prediction error vector is given by

where P L ( t ) is the orthogonal complement of P ( t ) .

The term X T ( t ) X ( t ) in Eq. (A.12) is an estimate of the autocorrelation matrix

of the stochastic process x and the term X T ( t ) X ( t ) is an estimate of a vector of

autocorrelation coefficients since each element in the matrix X T ( t ) X ( t ) and the vector

XT ( t ) X ( t ) is an ensemble average over L realizations. Extending the averaging over

all realizations of x and invoking the properties of stationarity, the normal equation

( A.12) can be written as

R,a= v (A.17)

APPENDIX A. LINEAR PREDICTION

where

The autocorrelation matrix being positive definite, always has an inverse, hence the

predictor coefficients can be computed from the above equations as

In speech coding, the speech waveform is assumed to be a realization of a stationary

stochastic process. Further, the autocorrelation matrix is estimated from this single

realization under the assumption of ergodicity.

From a discrete signal processing point of view, defining

the prediction equation can be written as

and the prediction error can be written as

where

(A. 24)


Figure A.l: Linear Prediction Model

A(z) is called the inverse filter. The speech analysis and synthesis models can then

be shown as in Fig. A.1.

Comparing with the source-filter model of speech production (Fig. 1.3), the syn-

thesis filter can be easily identified as the l/A(z) block. It also gives us a way to

compute an excitation E(z) , given the synthesis filter and the speech, X ( z ) , to be

synthesized. This clearly shows the utility of LP analysis that breaks up a signal x(n)

into a signal e(n) and a filter A(z) which can be quantized independently of each

other.

A.2 Equivalent Represent at ions

There are many different equivalent representations of the filter A(z). Some very useful

ones are Reflection Coefficients, Log Area Ratios, and Line Spectrum Frequencies.

They are all derived by considering a backward predictor along with the forward

predictor. For ease of notation, let us write o, = -a;, i = 1,2, . . . , p . Then eq (A.24)

can be written as

(A. 25)

where a0 = 1. The forward prediction error for a p-th order predictor is then written

as


The backward prediction error results from predicting the past from the future and is

written as

where pp+l = 1. It can be shown that a relationship exists between the coefficients

a; and the coefficients pi: one is just the time reversed sequence of another [39]. The

reflection coefficients (also known as PARCOR coefficients) are then defined as

A useful property of the reflection coefficients (RCs) is that the set of reflection coeffi-

cients {kl . . . kp ) for a p t h order predictor is a subset of the coefficients {kl . . . kp, k,+l )

for the (p + 1)-th order predictor. This is not true for the linear prediction coefficients

ay's. However, a recursive relation may be derived between the coefficients of different

orders of prediction.

where B, (z) is given by -

These reflection coefficients are closely related to the acoustic reflection coefficients for

the stepped cylinder model [92] of the vocal tract (Fig. A.2). One important property

of the reflection coefficients is that they are always within the interval [- 1, +I]. Thus

the synthesis filter described by the reflection coefficients can be easily checked for

stability when the coefficients are quantized. This cannot be accomplished by a simple

observation of the Linear Prediction coefficients.


Figure A.2: Stepped cylinder model of the vocal tract

Although RCs are restricted to values between f 1, the spectral sensitivity for

quantization largely differs between the region near f 1 and that around 0. Therefore

some equivalent representations are often used that have better quantization proper-

ties. These are Log Area Ratios (LARS) and Arc Sine coefficients (ASRCs) defined

as follows.

and

1 + k, LAB, = log -

1 - k,

ASRC, = sin-'(k,)

(A.31)

(A. 32)

Log Area Ratios and Arc Sine Coefficients have more uniform sensitivity properties

making themselves more suitable for a simple uniform quantization, and being derived

from Reflection Coefficients can be easily checked for instabilities in the synthesis filter.

Even then, their relationship with the formant structure of the vocal tract transfer

function l/A(z) is not very straightforward. In fact quantization error of only one

coefficient affects the whole spectral envelope. Also, being derived from Reflection

Coefficients, these are essentially parameters operating in the time domain as the

autocorrelation coefficients. A different representation of the all-pole filter called Line

Spectral Frequencies (LSFs) (i.e., resonant frequencies with an infinite Q or discrete

frequencies) was introduced in 1975 by F. Itakura [56]. It is worthwhile to note that in

this representation, the vocal-tract-filter parameters are frequency domain parameters

as they are in the channel vocoder and the formant vocoder.

The LSFs are obtained from the LP coefficients through a transformation de-

scribed below. The main advantage of using LSFs for quantization is the fact that

quantization error in one coefficient results in spectral distortion only around the


neighbourhood of that frequency. Other advantages are its better behaviour when

linearly interpolated and the fact that they may be more readily quantized in accor-

dance with properties of auditory perception to save bits (i.e., coarser quantization of

the higher frequency spectral components).

Prediction coefficients may be transformed into LSFs through the decomposition

of the impulse response of the LPC-analysis filter into even and odd time sequences

(Fig. A.3) [62]. This decomposition is reversible because the original pulse response

can be obtained by half the sum of the even and odd time sequences. It is easy to

show that both the even and odd time sequences have roots along the unit circle in

the complex plane. These roots, being on the unit circle, denote resonant frequencies

with infinite Q. Hence, these sequences may be expressed by LSFs.

(a) Impulse response of 10-th order LPC analysis filter 1

(b) Time-shifted and time-reversed waveform of (a)

1

(c) Sum of waveforms (a) and (b), the sequence is even symmetric

(d) Difference of (a) and (b), the sequence is odd symmetric

Figure A.3: Transformation of Predictor coefficients to LSFs

APPENDIX A. L INEAR PREDICTION 125

Let us assume that Ap(z) is given and construct two (p + 1)th order predictors

P(z ) and Q(z) under the conditions kp+l = 1 and kp+l = - 1 respectively, i.e.

Then, from the recurrence relations (Eq. (A.29)),

(A. 33)

(A.34)

(A.35)

(A. 36)

The relationship of these equations with figure A.3 is easily recognized as B,(z) =

z-("+~)A,(z-') [42]. The arguments of the complex roots of the difference filter P ( z )

and the sum filter Q(z) are called the Line Spectral Frequencies (LSFs), therefore

conversion of LP coefficients to LSFs is, in essence, finding roots of these two filters.

A.2.1 Computation of Line Spectral Frequencies

The polynomials Q(z) and P(z ) being symmetrical and anti-symmetrical, respectively,

have roots at z = $1 and/or z ='-I which can be removed by polynomial division

i) p = even

ii) p = odd

(A.39)

(A. 40)

Now, Gl(z) and G2(z) are symmetric of even order. Since the roots of P(z ) and Q(z)

alternate in position on the unit circle, and P(z ) always has a root at z = +1 (w = O ) ,

the lowest LSF corresponds to a root of Gl(z).


Let the order of GI (z) be 2M1 and the order of G2(z) be 2M2. Then MI = p/2,

M2 = p/2 for p = even, and MI = (p + 1)/2, M2 = (p - 1)/2 for p = odd. Explicitly

showing the symmetry of the polynomial coefficients

Evaluating on the unit circle,

where

The roots of Gi (w) and G',(w) are the LSFs.

There are several ways LSFs may be computed. The most common method of

computation is by Kabal and Ramachandran [61]. In this method, a mapping of

x = cos w allows one to express these polynomials in terms of Chebyshev polynomials:

cos m u = Tm (x) (A .44)

where Tm(x) is an mth order Chebyshev polynomial in x. Thus, the equations involv-

ing cosines (Eq. (A.43)) can be expressed in terms of Chebyshev polynomials.


This series lends itself to an efficient evaluation [61] which bypasses an expansion in

powers of x. The mapping x = cosw maps the upper semicircle in the z-plane to the

real interval 1-1, $11. Therefore all the roots x; lie between -1 and +1, with the root

corresponding to the lowest frequency LSF being the one nearest to +l. The series

Gi(x) is evaluated first on a grid close to $1 for the lowest frequency LSF and then

proceeds alternately on Gi(x) and Gh(x) looking for all the LSFs. Once the roots

{x;) of Gi(x) and G',(x) are determined, the corresponding LSFs are given by

Other techniques of solving for LSFs starting from LP coefficients are given by

Soong and Juang [99], and Kang and Fransen [62].

Some important relationships of the Line Spectral Frequencies with speech for-

mants and one of the root finding techniques can be understood by rewriting the

expressions for P(z) and Q(z) as follows

where

The filter R,(z) is called the ratio filter.

The ratio filter is an all-pass filter. It can be readily seen that when the phase

angle of the ratio filter is a multiple of 27r radians, the amplitude response of the

diflerence filter, P(z), is zero. On the other hand, when the phase angle of the

ratio filter is an odd multiple of 7 r , the amplitude response of the sum filter, Q(z)

is zero. The relationships between the LPC spectrum, the zeros of P(z) and Q(z),

the phase function of the ratio filter, and the group delay of the ratio filter can be

seen in figure A.4. The x-axis is frequency in Hz in all the plots. The real root of

the difference filter at $1 and that of the sum filter at -1 can also be seen on their

amplitude responses as zeros at 0 Hz and 4000 Hz respectively. The horizontal line

in (c) represents 7r radians. The sampling rate was 8000 Hz for this plot.

It can be seen that the group delay of the ratio filter is large near speech formant

frequencies, and LSFs are close together. Since the phase of the ratio filter has to


(a) Amplitude response of the LPC synthesis filter

(b) Amplitude response of P(z) and Q(z)

"0 500 1000 1500 2000 2500 3000 3500 4000 (c) Phase response of ratio filter

(d) Group delay of ratio filter

Figure A.4: Plots showing relationships between LSFs and other parameters.


change by 7~ from one LSF to the next, it is obvious that the group delays are going

to be larger whenever LSFs are closer together. These also define the location and

bandwidth of the formants.

Now we show that the spectral estimate for a finite data sequence must be an

all-pole spectrum from a maximum entropy point of view.

A.3 Maximum Entropy Principle

Given a set of autocorrelation coefficients for a stationary process, the best estimate

for the underlying spectrum is given by the Maximum Entropy Principle which says

that the spectrum that maximizes the entropy is the best estimate. Fundamentally, it

gives a procedure for constructing the spectrum from a finite set of data points without

making any constraining assumptions about the signal. It can be shown that spectral

estimates based on maximizing entropy provides maximum spectral resolution with

minimum spectra1 splatter [l]

Let S ( w ) be a power spectral density function. As with continuous probability

distributions, the entropy can be properly defined only as a difference with respect to

some other spectrum So(w). However, assuming a reference white power spectrum of

unit power density, So(w) = 1, we can write

H = - riT log S(w)dw. ( A . 50)

The constraints satisfied by S ( w ) are the given autocorrelation coefficients, i.e.,

where z = eJw. Using Lagrange multipliers and calculus of variations, we obtain the

Euler equation -

where M

F = - log S ( W ) + C XkS(w) (ejwk + e-jWk k=O )

This gives,

( A . 54)


Using FejCr-Riesz theorem [89, page 2311,

This clearly shows S(w) to be an all pole spectrum as obtained by linear prediction.

This coincidence of Linear Prediction and Maximum Entropy spectral estimation is

of course not accidental. Two key factors contribute to their similarity.

1. The definition of entropy contains a logarithm which is required so that the

entropies of independent events add.

2. Maximizing the entropy requires differentiation with respect to S , leading to the

l/S(w) term in Eq. (A.54) and a finite Fourier series. Thus the reciprocal power

spectrum as a function of frequency must be a band limited function which is

precisely the property of an all-pole function.

Thus linear prediction actually provides the best spectral estimate in the absence of

any prior information about the data sequence.

Appendix B

Quantization

Quantization is the process of mapping a large (possibly infinite) set of points in a

metric space to a smaller and finite set of points in the same space. An N-point

quantizer is defined as a mapping Q : X H C where X is the input set and

is the output set or codebook with size ICI = N. For the special case of X R, where

R is the real line, Q is called a scalar quantizer and the output points are simple

scalars also referred to as output levels or reproduction values. When X is non-scalar,

the quantizer is called a vector quantizer.

B .1 Scalar Quantization

An N-point scalar quantizer partitions the real line (or a subset of it) into N segments

or cells, R;, i = 1,2,. . . , N. The i th cell is given by

where y; are the output levels of the quantizer. It follows from this definition that

Ui R; = X and R; n Rj = 4 for i # j. The output values being scalar, we assume that

they are indexed such that

Y1 < Y2 < " ' < YN. P3) Then the cells R1 and/or RN may be unbounded depending on whether the input

space X is unbounded or not. The unbounded cells in a quantizer are called overload

APPENDIX B. QUANTIZATION 132

cells and the bounded cells are called granular cells. All the overload cells together

form the overload region and all the granular cells together form the granular region.

A scalar quantizer is regular if

a) each cell R; is an interval of the form (x; -~ , xi) together with one or both

of the endpoints, and

The values x; are called boundary points and for a regular quantizer they satisfy the

inequality

Xo < 91 < X l < 92 < X2 < ' - . < YN < X N w4) A typical symmetric regular quantizer, Q(x), is shown in fig. B.1. The horizontal

segments in Q(x) are called treads and the vertical discontinuities are called risers.

Figure B.l: A typical mid-tread scalar quantizer

A quantizer can be decomposed into two independent operators working in succes-

sion - an encoder, E, and a decoder, V. The encoder is a mapping E : X w Z, where

Z is the set of positive integers, and the decoder is the mapping V : Z H C. Thus if

Q(x) = y; then &(x) = i and V(i) = y,. This is the same as saying Q(x) = V(E(x)).

Sometimes a quantizer is assumed to generate both the index i and the output value


y;, and a decoder is sometimes referred to as an inverse quantizer. In a communica-

tion system, only the index i is transmitted and the actual output value y; can be

obtained through a table lookup procedure at the receiving end.

B. 1.1 Performance Measures

The quantization process can be modeled as the addition of a random noise component

e = Q(x) - x to the input sample as indicated in figure B.2. Since the quantization

Quantization Noise e

Figure B.2: Additive noise model of quantization

error is modeled as a random variable, a measure of the performance of a quantizer

must be based on a statistical average of some function of the error. Most common

is the mean squared distortion measure defined as the expectation of e2

where p(x) is the probability density function of the input variable x . Frequently, the

performance of a quantizer is specified by a signal to noise ratio (SNR) defined as

SNR = 10 logl,(02/D) (B.6)

where 02 = E(x2) is the variance of the input signal. For high resolution quantization,

it is useful to write the average distortion as a sum over all quantization cells. For a

given input random variable, x, and a quantizer, Q = {y;, R;; i = 1,2, . . . , N), the

average distortion can be written as

APPENDIX B. Q UANTIZATION 134

For a regular scalar quantizer, this can be written as

N

D = C J z i (y; - x ) ~ ~ ( x ) dx. ;=I "1-1

For large N, each interval R; can be made quite small (with the exception of the

overload intervals R1 and RN) and it is reasonable to approximate the pdf p(x) as

being constant within each interval R;. Approximating p(x) by p(y;) when x is in Ri,

and p(x) = 0 in the overload regions, the above equation can be simplified to

where A; = x; - xi-1 is the length of the interval Ri and y; = (x; + ~ ; - ~ ) / 2 . The

reason for choosing y; as the centroid of Ri is given in subsection B.1.3, page 138. For

the special case of uniform quantization where the decision boundaries are equally

spaced so that A; = A, the step size of the quantizer, the mean squared error can be

further simplified to " 2 N-1

so that

It can also be shown that in this case

so the quantization error and the input signal are correlated. The average granular

distortion can also be written as

where y is the loading factor defined by A = 2oy/N, and r = log, N is the quantizer

resolution in bits. This expression shows that the distortion goes to 0 exponentially

as r + w. Now, the SNR can also be written as

SNR = 10

= K + 6.02r


where P = l / y is the loading fraction, and IC = 1010gl,(3P2). This shows that for

the high resolution case, the SNR increases by about 6 dB for each additional bit

used for quantization. The above expression for SNR assumed a negligible overload

distortion which is not true for a loading factor less than 2 or 3. The total average

distortion D, can then be written as a sum of the granular distortion, D,, and the

overload distortion, D,.

It can be shown that for a symmetric quantizer satisfying

D, is a function only of the loading factor y and does not depend on a for a fixed y.

For the high resolution case, the granular distortion for a general class of regular

nonuniform scalar quantizers is given by the Bennett's integral.

where, X(x) is the point density function of the nonuniform quantizer.

B.1.2 Robust Quantization

An important concept in quantization is robust quantization, i.e., designing quantizers

whose performance is independent of the input signal pdf, p(x). To discuss robust

quantization, we have to introduce a model for treating nonuniform quantization -

known as the compander model. It can be shown [42] that any regular nonuniform

quantizer can be represented as a nonlinearity F(x), called the compressor, followed

by a regular uniform quantizer and an inverse nonlinearity F-'(x), called the expandor

(Fig. B.3).

APPENDIX B. QUANTIZATION

UNIFORM QUANTIZER -

Compressor I

Expandor

Figure B.3: Compander model of nonuniform quantization

The characteristic F ( x ) is a monotonically increasing odd function of x, ranging

from values -V to +V where V is the overload level of the quantizer. Every nonuni-

form quantizer can be modeled in this way with a suitable choice of F(x). It can

be shown that for large N, the average distortion of a nonuniform quantizer can be

written as

where g(x) = F1(x) is the compressor slope function. If the slope function is chosen

then Eq. (B.18) reduces to

so that the SNR a2 /D reduces to a constant 3N2/b2 which is independent of p(x).

Integrating Eq. (B.19) gives the compressor function as

for x > 0, where c is a constant. This shows that a logarithmic compressor would give

robust performance. It should be borne in mind that Eq. (B.18) neglects overload

noise and SNR will begin to drop when the input power level becomes large enough.

Also, the curve just computed (Eq. (B.21)) is not in fact realizable, since F(0) is

not defined. To circumvent this problem, a modified compressor curve is used which

behaves well for small values of x and retains the logarithmic behaviour elsewhere. A

compressor curve widely used in speech quantization is the p-law curve given by

APPENDIX B. Q UANTIZ4TION 137

for x > 0. For p >> 1 and px >> V, F ( x ) approximates Eq. (B.21). The p-law com-

panding is used in PCM systems in the United States, Canada, and Japan. Another

robust logarithmic characteristic is the A-law given as

The A-law characteristic is used in European PCM telephone systems. The parame-

ters p and A control the degree of compression measured by the companding advantage

which is the slope of the compressor curve at the origin and the typical values in use

are p = 255 and A = 87.6. F ( x ) being an odd function, its value for x < 0 is given by

F (x ) = -F(-x). p-law and A-law companding have been adopted as standards for

PCM coding of speech by ITU-T (formerly known as CCITT) in their G.711 recom-

mendation where the logarithmic characteristics are approximated by piecewise linear

functions.

B.1.3 Optimum Quantization

For a given input pdf, and a fixed number of quantization levels, N , an optimal quan-

tizer is one that produces minimum average distortion. In other words, a quantizer

Qopt is optimal if and only if

E[d(Qopt(x), x)] 5 E [d(Q(x), x)], V N-level quantizers Q. (B .24)

The problem of finding an optimum quantizer does not, in general, have a closed form

solution. However, necessary conditions for optimality can be derived that allow us

to use iterative algorithms to design optimal quantizers in some cases. The necessary

conditions are found by factoring the problem into two interdependent problems and

solving them independently. The subproblems that we address are -

1. for a given decoder, what is the optimal encoder; and

2. for a given encoder, what is the optimal decoder.

It can be shown that for a given decoder, the optimal encoder must satisfy the nearest

neighbour condition. That is, for a given codebook (set of output levels), C, the

partition cells satisfy

APPENDIX B. Q UANTIZATION

This is the same as saying that

Q(x) = y; only if d(x, y;) 5 d(x, yj) V j # i. (B .26)

For a given decoder, this condition is also a sufficient condition for optimality of an

encoder for any distortion measure that satisfies the properties of a distance function.

It should be noted that the nearest neighbour rule does not assign boundary points

to a specific region. Heuristics are used to assign boundary points to either of the

neighbouring regions. A simple way to resolve the ambiguity would be to assign the

boundary point to the cell on its left (right).

For the second subproblem, i e . finding an optimal decoder for a given encoder it

can be shown [42] that the decoder must meet the centroid condition. In other words,

given a partition {R;) and a distortion measure d(x, y) , the optimal codebook for a

random variable x is given by

y; = centroid(R;)

= argmin E[d(x,c)lx f R;].

For the case of squared error distortion measure, the centroid is given by the condi-

tional mean of the random variable x given that x is in the region R;.

Substituting Eq. (B.28) in Eq. (B.8), the average distortion can be written as

It can be seen that an analytical solution for x; is extremely difficult except possibly for

a very small N. Lloyd [73] and Max [81] independently discovered the necessary con-

ditions for optimality for mean square distortion measure (Max derived the necessary

conditions for a kth absolute mean square error criterion including k = 2) and came

up with effective algorithms for computing the optimum solution. The algorithms

known as the Lloyd (Method I) algorithm and the Lloyd (Method 11)-Max algorithm

iteratively computes the boundary points and the output levels while simultaneously

satisfying both necessary conditions. We will describe the Lloyd algorithm here in a


form that has been generalized for non-scalar quantization as well. To design an N-

level scalar quantizer using Lloyd's algorithm, one must start with an initial codebook

of N output levels. Then the Lloyd iteration works in two steps -

1. Given a codebook, find the optimum partition using the nearest neighbour prin-

ciple;

2. Compute new codebook for the newly computed partition using the centroid

rule.

The iteration is continued until the change in average distortion falls below a preset

limit or reaches zero. It can be seen that each of the steps above reduces the average

distortion and the algorithm is guaranteed to converge. Note that the Lloyd itera-

tion produces a codebook satisfying the necessary conditions of optimality but does

not guarantee that an optimal quantizer will be produced. Sufficient conditions for

optimality were derived by Fleischer [37]. He showed that if the probability density

function, p,(x), satisfies ,-I2

for all x, then there exists only one quantizer that can satisfy both the necessary

conditions. This guarantees that quantizers designed using Lloyd's iteration for dis-

tri butions satisfying Eq. (B.30) are indeed optimal.

Another approach to optimal quantizer design is using Bennett's high rate ap-

proximation (Eq. (B.18)) to compute average distortion and minimizing it over all

possible compressor slope functions g(x) having a constant area under it. This yields

a result that the optimum compressor slope function is proportional to the cube root

of the pdf [40]

gopt(x) = cl [ p ( ~ ) ] l ' ~ . (B.31)

By integrating we obtain the compressor characteristic

x Fopt(x) = c1 1 [p(a)]lf3 d a for x > 0 (B. 32)

where cl is a constant such that F ( V ) = V. It should be noted that Eq. (B.32) gives

the optimum quantizer for a given value of the overload point, V. A separate one-

dimensional minimization can be done for the best value of V. Computation of the


minimum mean squared error obtained from quantizers designed with this approach

leads to values in good agreement with Max's tabulations (for a Gaussian pdf) even

for values of N as small as 6 [40]. Smith [98] studied optimal quantization based on

Laplacian pdf and found that the optimum compressor has the form

V( l - e-mx) F ( x ) = 1 - e-mv ' for x > 0

This is called the m-law compressor. While an optimum quantizer can give much

higher SNR for a given bit rate, the performance degrades fast as the input power

levels are changed. On the other hand, a robust quantizer like the one based on p-law

companding maintains a reasonably high SNR over a broader range of input power

levels.

The optimal quantization discussed above were based on the average distortion for

a given number of output levels as the cost function. However, from a user's viewpoint,

a better quantizer is one that uses the lowest number of bits per sample for a given

maximum distortion; or equivalently, for a given number of bits/sample, obtains the

highest SNR. This is the same as minimizing the output entropy for a given average

distortion. It has been shown [47, 1091 that if entropy coding is used after quantization

then the uniform quantizer performs better than the best non-uniform (Lloyd-Max)

quantizer.

For a given bit rate B (bits/sample), the SNRs for high rate quantization neglect-

ing overload conditions for different quantization strategies are as follows 1401.

Uniform quantization followed by entropy coding:

SNR = 6B - 1.50

Best non-uniform quantizer followed by entropy coding:

SNR = 6B - 2.45

Best non-uniform quantizer only:

SNR = 6B - 4.35

Uniform quantizer with loading factor of 4:

SNR = 6B - 7.3


It is evident that a uniform quantizer followed by entropy coding gives the best per-

formance.

So far we have only considered scalar quantization without memory. For a source

with memory, where

better performance can be obtained by quantizing the difference between the sample

and its predicted value. These are called predictive quantizers and they provide a

performance improvement compared to a simple scalar quantizer by reducing the

variance of the signal at the input to the quantizer. Predictive quantizers have been

studied in great detail (e.g. [42, 111) and techniques like DPCM and ADPCM have

been very popular: ADPCM has also been adopted by ITU-T as a toll quality speech

coding standard at 32kbps in their recommendation G.723.

Much of the benefits of scalar predictive quantizers can be obtained by quantizing

a block of samples together. This is not only true for sources with memory but

also holds for a memoryless source. This is a fundamental result from Shannon's

rate distortion theory. The process of quantizing a block of scalars is called vector

quantization (VQ). The vectors quantized with a VQ (the term VQ is used both to

mean the process of vector quantization as well as a vector quantizer) need not be

collection of samples from a scalar process but can be samples from a vector process

as is the case for quantization of vocal tract spectral parameters.

B .2 Vector Quantization

A k-dimensional vector quantizer is a mapping Q : X I+ C, where X is a k-

dimensional metric space or its subset and C = {yl, y2,. . . , y N ) is a finite set of

vectors from the same space. In speech coding we are particularly interested in the

case where X = Rk. A VQ can be decomposed into two mappings: an encoder &

which assigns to each input vector x = (xo, 21,. . . , ~ k - 1 ) ~ a channel symbol &(x) in

some channel symbol set M, and a decoder 2) assigning to each channel symbol cr in

M a value in a reproduction alphabet C. The channel symbol set is often assumed to

be a space of binary vectors for convenience, e.g., M may be the set of all 2R binary

R-dimensional vectors.


If C has N elements, then the quantity R = log2 N is called the rate of the

quantizer in bits per vector and r = Rlk is known as the resolution or code rate in

bits per vector component.

An N point vector quantizer partitions Rk into N regions or cells R; for i E Z r

{1,2, . . . , N). The i th cell, defined by

is sometimes called the inverse image or pre-image of y; under the mapping Q and

denoted more concisely by R; = Q-l(y;). It follows that

so that (R;} form a partition of Rk. For k = 1, a VQ degenerates to a scalar quantizer.

A vector quantizer is called regular if

a) each cell, R;, is a convex set, and

b) for each i , y; E R;.

Just as in scalar quantization, a cell that is unbounded is called an overload cell and

all overload cells together form the overload region. A bounded cell is called a granular

cell and all granular cells together form the granular region.

A VQ is not merely a generalization of scalar quantization. It can be shown that

no coder can do better than a VQ. The following theorem is due to Gersho [42].

Theorem B. l For any given coding system that maps a signal vector into one of

N binary words and reconstructs the approximate vector from the binary word, there

exists a vector quantizer with codebook size N that gives exactly the same performance,

i.e., for any input vector it produces the same reproduction as the given coding system.

The reason vector quantizers outperform scalar quantizers in jointly quantizing a num-

ber of scalars (vector components) can be attributed to four interrelated properties

of vector components [75]:

a) linear dependency (correlation),

b) nonlinear dependency (statistical dependency),


c) pdf shape, and

d) dimensionality (giving rise to a choice of cell shape for k > 1).

Although linear dependency or correlation can be removed by proper choice of the

basis vectors (as in KLT), statistical dependency cannot be removed and prevents

factoring out of the joint pdf into independent pdf's facilitating independent scalar

quantization of each vector component. Even if scalar quantizers could be designed

for marginal probability densities of each component, it could very well spend bits

quantizing regions of zero probability. A vector quantizer can take advantage of the

joint pdf and partition the space accordingly. Figure B.4 clearly shows this point for

k = 2. If the vector components XI and x2 are quantized independently using scalar

Figure B.4: A uniform joint pdf over a rectangular region (shown shaded) along with the marginal pdf's

quantizers designed using their marginal pdf's, the area quantized by the pair of scalar


quantizers is shown as the dotted square. The pdf is assumed uniform over the shaded

rectangle and zero outside. It is clearly seen that bits will be spent unnecessarily in

quantizing a large region of zero probability. The case of two independent scalar

quantizers can be considered as a special case of vector quantization of the vector

x = (xl, x2) where the vector quantizer is given by

where Ql and Qg are the scalar quantizers for xl and x2 respectively. Such a vector

code is called a product code and the VQ will be called a product VQ because the

overall VQ is formed as a Cartesian product of smaller dimension VQs.

A particular advantage of VQ that is not always very evident stems from the fact

that k > 1. This gives rise to a wide choice in cell shapes. It can be shown for a

two-dimensional VQ that for a uniform pdf and neglecting edge effects, hexagonal

cells give a lower average distortion compared to square cells with the same number

of cells covering any given area [75] .

B.2.1 Vector Quantizer Performance

A vector quantizer is always defined over a metric space so that there exists at least

one valid distortion (distance) measure in that space. A distortion measure is essential

in designing a VQ as one goal of such a design effort is minimizing some distortion. If

d(x, y ) is the distance between the original vector x and the reconstructed vector y,

the the performance of the VQ may be quantified by the average distortion D defines

as

D = E[d(x, Y )I. (B .42)

In practice, the measure of performance is the long term sample average or time

average - 1 d = lim - d(x,, y,). (B.43)

M-.m M ,=I

If the vector process X is stationary and ergodic, the sample average in Eq. (B.43)

tends in the limit to the expectation in Eq. (B.42). In particular, if a VQ partitions

the input space into L regions,


where P ( x E &) is the discrete probability that x is in R;, ~ ( x l i ) is the conditional

multidimensional probability density function (joint pdf of vector components) of x

given that x E R;, and the integral is taken over all vector components of x .

It is obvious that the distortion is dependent on the distance measure used. The

most common and widely used measure of distortion is the squared erroror the squared

L2 norm. This is defined as

We will only talk about real vectors here; for complex vectors, the transpose operator is

replaced by a conjugate transpose operator. The distortion measure used depends on

the physical interpretation of the vector and the vector components and a number of

different distortion measures have been explored in designing quantizers for the LPC

coefficients (and its various equivalent representations) [50]. A detailed review of the

relevant distance measures will be presented in a later chapter. Any distortion measure

that displays the above form of a summation over distortions due to individual vector

components is called an additive or single letter distortion measure. One general form

of an additive distortion measure is the pth power of the popular Lp norm and is

defined as k

Another distortion measure of interest is the quadratic form of the error vector

e = (x - y) or the weighted squared distortion defined as


where W is a symmetric weighting matrix that takes into consideration the weighted

contributions from individual vector components to the total distortion and contribu-

tions from their interactions. In the simplest case, W is a diagonal matrix that helps in

placing different emphasis on each vector component for distortion computation. For

the familiar squared error measure, W = I, where I is the identity matrix. If W is not

symmetric then the distance measure is not symmetric as well and d(x, y ) # d(y, x).

This does not strictly qualify as a proper distance function. Sometimes the weighting

matrix W is made a function of the input vector x and the distortion measure is no

longer symmetric in x and y. The requirement of symmetry is relaxed to obtain a

more perceptually significant distortion measure. In the general case, the weighted

distortion measure looks like

where W ( x ) is symmetric and positive definite for all x .

B.2.2 Optimum VQ

An optimal VQ for a given input distribution and a distance measure, d(x, y ) , is

defined as

Qopt = {GPt : IZk H 2; Dopt : Z H C) (B.49)

(B. 50)

for all choices of Q(.). Usually, for all signals and distortion functions of interest,

the error surface shows multiple local minima and no special characteristics can be

associated with a global minima, if any. So, as in the case of scalar quantization, we

can only specify the necessary condition for an optimal encoder given a decoder and

vice versa.

Necessary Conditions of Optimality

For a given set of output codevectors, C = {cl,. . . , cN), an optimal encoder partitions

the input space such that each cell satisfies the nearest neighbour condition


Thus, given a decoder, the encoder is a minimum distortion mapping such that

d(x, Q(x)) = min d(x, c;). c i E C (B.52)

In case of a tie where more than one code vectors are equidistant from the input

vector x , the tie is broken with a heuristic. A common heuristic is to choose the code

vector with the minimum index.

The second necessary condition of optimality is the centroid condition which de-

fines the decoder given a partition of the input space and a distortion measure d(x, y).

That is, for a given partition {R; : i = 1, . . . , N; R; = X), the optimal code vec-

tors satisfy

(B.53)

(B. 54)

In other words,

E[d(x, ci)/x E R;] = min u E[d(x, u) lx E R;]. (B.55)

The above definition is valid for a discrete distribution as well, and can be evaluated

from the pmf (probability mass function) of the input distribution. Generally, input

distributions are not known and VQs are designed from what are called training

sets. A training set is a finite collection of sample vectors generated from the source

distribution in order to represent the statistics of the source with a finite set of data.

Usually, the training vectors are generated independently from the source. This gives

rise to a discrete model for the source where each of the M vectors in the training set

has a probability of l /M and the probability P ( x E R;) is estimated from the number

of training set vectors inside R,.

A third necessary condition for a codebook to be optimal is called the zero proba-

bility boundary condition. Consider the set

From the nearest neighbour condition, R; c R:. The set Bi = RI - R, is called the

boundary of R;. The zero probability boundary condition states that


or equivalently,

P(x : d(x, c;) = d(x, cj) for some i # j) = 0. (B .58)

This condition is automatically satisfied when the input distribution is continuous

but may be violated in the discrete case when one or more of the training vectors are

equidistant from multiple code vectors. The above condition states that the resulting

quantizer may not be optimal even if the nearest neighbour and centroid conditions

are satisfied. A better quantizer may be found by breaking the tie in a different way

and proceeding with more iterations.

In simple words, a vector quantizer obeying the necessary conditions of optimality

is a mapping that partitions the input space into convex regions and allocates one

vector, the centroid of the region, to each vector in the region and the boundary

between two neighbouring regions is the perpendicular bisector of the line joining the

corresponding centroids (Fig. B.5).

Sufficiency Conditions

As pointed out earlier, it is impossible to derive sufficient conditions of optimality.

Here we would like to make more explicit what is meant by optimality of a VQ. Since

the nearest neighbour condition is necessary for optimality, let us assume that it is

satisfied for any given codebook C. That is, a VQ is uniquely defined by the codebook

as the partition always follows the nearest neighbour rule. Now, the average distortion

D is a function of the codebook only and the quantizer is locally op t imal if every small

perturbation of the code vectors does not lead to a decrease in D. A quantizer is called

globally op t imal if no other codebook exists that produces a lower value of D.

It is widely believed that if a codebook satisfies the Lloyd conditions (the necessary

conditions mentioned above), it is indeed locally optimal although no theoretical

derivation of this result has ever been obtained. For the discrete case such as a

sample distribution produced by a training set, however, it can be shown [48] that

a vector quantizer satisfying the necessary conditions is indeed locally optimal under

mild restrictions. This comes from the fact that in the discrete input case, a slight

perturbation of a code vector will not alter the partitioning of the (countable) set of

input vectors as long as none of the training vectors lies on a partition boundary. Once

the partition stays fixed, the perturbation causes a violation of the centroid condition


Figure B.5: A vector quantizer satisfying the necessary conditions


with an accompanying increase in D. Thus under these conditions, a quantizer that

satisfies the necessary conditions will be locally optimal. It is worth pointing out that

locally optimal quantizers can be very suboptimal in the global sense.

B.2.3 VQ Design

The problem of designing a codebook with N vectors each with dimension k for a

given distribution or training set has no general solution, but the Lloyd's algorithm

described in the section on Scalar Quantization can be generalized to the vector case

for iterative improvement of a given codebook. We will describe the algorithm only

for the case of unknown distribution as that is the most common situation.

Generalized Lloyd Algori thm:

S t e p 0 Initialization: Given

a) number of levels = N;

b) distortion threshold = 6 2 0;

c) initial codebook Co;

d) training sequence 7 = {ti; i = 0, . . . , n - 1);

set m = 0 and D-l = oo

S t e p 1 Given Cm = {c;; i = 1 , . . . , N), find minimum distortion partition

of the training set:

t j E S; if d(tj , c;) 5 d(t j , cl) Vl

Compute the average distortion

S t e p 2 If 5 r , halt with Cm as the final codebook.


Step 3 Find the optimal codebook cent(P(C,)) = {cent(S;); i = 1, . . . , N ) for

P(C,). Set - cent(P(C,)). Replace m by m + 1 and go to step 1.

Here, cent(.) stands for the centroid of a set.

The most popular technique of obtaining the initial codebook of size N is known as

the splitting algorithm (also known as the LBG algorithm) introduced by Linde et. al.

[72]. The algorithm starts with a codebook of size 1 and creates a larger size initial

codebook by splitting as described in the algorithm below.

LBG Algorithm:

Step 0 Initialization: Set M = 1 and define Co(l) = cent ( l ) , the centroid of the

entire training set.

Step 1 Given the reproduction alphabet Co(M) containing M vectors {c;; i = 1, . . . , M ) ,

"split" each vector c; into two close vectors c; + E and c; - E , where E is a fixed

perturbation vector. The collection of { c ; + E, c; - 6 ; i = 1, . . . , M ) has 2M

vectors. Replace M by 2M.

Step 2 Is M = N ? If so, set Co = E(M) and halt. Co is then the initial reproduction

alphabet (codebook) for the N-level quantization algorithm. If not, run the

generalized Lloyd algorithm for an M-level quantizer on C(M) to produce a

good reproduction alphabet Co(M), and return to step 1.

Note that the splitting algorithm always results in a codebook of size N where N is

an integral power of 2.

Appendix C

Pitch Computation Algorithm

The sequence of tests performed for computing average pitch is described in the

following list. If any test succeeds, the pitch value computed there is accepted as the

pitch period and subsequent tests are not performed. In the following, Cs and Ce are the number of peaks detected from the speech and residual signal respectively.

The values of the different constants used in the following description are chosen

empirically and the values used in our 1800 bps coder are given at the end of this

section. All variables used are defined in Chapter 5, section 5.4.1.

Test if all four conditions below are satisfied and set pitch, P = p(p,) if true.

If a,(ps) > El1 is also satisfied, compute pitch, P, using autocorrelation.

Test if all four conditions below are satisfied and set pitch, P = p(pe) if true.

APPENDIX C. PITCH COMPUTATION ALGORITHM 153

If a,(pe) > El1 is also satisfied, compute pitch, P, using autocorrelation.

Test if all four conditions below are satisfied and set pitch, P = min(p(p,), p(pe))

if true.

If min(a,(p,), a,(pe)) > El2 is also satisfied, compute pitch, P, using autocor-

relation.

Test if both the conditions below are satisfied and set pitch, P = p(p,) if true.

If a,(p,) > El1 is also satisfied, compute pitch, P, using autocorrelation.

Test if both the conditions below are satisfied and set pitch, P = p(pe) if true.

If o,(pe) > C12 is also satisfied, compute pitch, P, using autocorrelation.

Test if all the four conditions below are satisfied.

If also Ip,(O) - pe(0) 1 < Ptol, compute pitch, P, using autocorrelation within

the range Pcand f Ptol, where Pcand = P-l if tracklength > tracklength,;,, else

Pcand = p,(O). Otherwise, set pitch, P = 0.

APPENDIX C. PITCH COMPUTATION ALGORITHM

Test the following two conditions and proceed if both are true.

- If Ip,(O) - pe(0)l I Rowtol, look for more peaks at intervals of p,(O) and

compute u,(p,). If a,(p,) < C,, set P = ~ ( p , ) , otherwise set P = 0.

- If Ips(0) - p,(0) 1 > Plo,tol, look for a pitch cycle pe(k) in the residual such

that (P-1 - pe(k)l 5 PtOl. If such a pitch cycle pe(k) is found, compare

it with the pitch cycle p,(O) found from the speech waveform. If (pe(k) - p,(0) 1 5 fiowtol, then P = p,(O). Otherwise assume that all other pitch

cycles, pe(i), that are larger than pe(k) to be equal to multiple pitch cycles,

and break them into n divisions, where n = Ipe(i)/pe(k) + 0.51. Assume

all pairs of successive smaller pitch cycles as broken parts of a single pitch

cycle and merge them to form valid pitch cycles. Compute o,(pe) after all

breaking and merging is done and if the following conditions are true then

set P = p,(O), otherwise set P = 0.

- If no pitch cycle was found in the residual signal that satisfied IP-l -

pe(k)l < Ptol, compute pitch using autocorrelation.

0 Reverse the roles of the speech and the residual signal in the previous procedure

and do exactly the same.

If all previous steps failed and the pitch is not determined yet, but more than

one pitch periods were detected in the speech signal, i . e . C, > 2, then search

for a pitch period satisfying IP-l - p,(i)l < Ptol. If such a pitch period is found,

examine all other pitch periods and do merginglsplitting as required. Compute

a,(p,) and set P = ~ ( p , ) if a,(p,) < C,, otherwise set P = 0. If the pitch was

not set to 0 here, and tracklength 5 tracklengthmi,, check for pitch doubling

using autocorrelation.

APPENDIX C. PITCH COMPUTATION ALGORITHM 155

If no pitch cycle p,(i) satisfying IP-l - p,(i)l 5 Ptol could be found, the peak

picking algorithm has failed. If the pitch of the last voiced segment was P>,

then use autocorrelation to check for pitch values in the range P_", f Ptol. If

P> > 2Pm;,, then also check for pitch values in the range P>/2 f Ptol using

autocorrelation.

If the pitch could not be determined so far, set P = 0.

If the pitch, P, as determined above was outside the interval [P,;,, P,,,], it is set to

zero.

If the pitch is finally determined as zero, but the pitch for the previous segment,

P-l, was non-zero and autocorrelation was not used in the computation so far, auto-

correlation method is used to confirm that P = 0.

The following values of the empirical constants were used in the pitch detector for

our 1800 bps coder implementation.

Constant

x u

El1 El2 c m i n

Ptor Plowtol

P m a z

P m i n

tracklengt h,;,

Value

0.17 0.1 0.08

2 5 2

140 20 2

Table C.l: Values of empirical constants used in 1800 bps coder

Appendix D

List of Citations

1. R. Hagen. "Robust LPC Spectrum Quantization - Vector Quantization by a

Linear Mapping of a Block Code," IEEE Trans. Speech and Audio Processing,

Vol. 4, No. 4, pp. 266-280, July, 1996.

2. A. McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan. "A 2.4

kbits/s MELP Coder Candidate for the New U.S. Federal Standard," Proc.

IEEE Int. Conf. on Acoustics Speech and Signal Processing, pp. 1-200 - 1-203,

Atlanta, May 7-10, 1996.

3. W. LeBlanc, C. Liu and V. Viswanathan. "An Enhanced Full Rate Speech

Coder for Digital Cellular Applications," Proc. IEEE Int. Conf. on Acoustics

Speech and Signal Processing, pp. 1-200 - 1-203, Atlanta, May 7-10, 1996.

4. C.F. Barnes, S.A. Rizvi and N.M. Nasrabadi, "Advances in Residual Vector

Quantization: A Review," IEEE Transaction on Image Processing, Vol. 5, No.

2, pp. 226-262, Feb., 1996.

5. P. Lupini and V. Cuperman. "Nonsquare Transform Vector Quantization,"

IEEE Signal Processing Letters, Vol 3, No. 1, pp. 1-3, Jan., 1996.

6. F. Kossentini, M.J.T. Smith and C.F. Barnes. "Necessary Conditions for the

Optimality of Variable-Rate Residual Vector Quantizers," IEEE Trans. Infor-

mation Theory, Vol. 41, No. 6, pp. 1903-1914, Nov., 1995.

APPENDIX D. LIST OF CITATIONS 157

7. R. P. Ramachandran, M. M. Sondhi, N. Seshadri, and B. S. Atal. "A Two

Codebook Format for Robust Quantization of Line Spectral Frequencies," IEEE

Trans. Speech and Audio Processing, Vol. 3, No. 3, pp. 157-168, May, 1995.

8. D. Chang, Y. Cho and S. Ann. "Efficient Quantization of LSF Parameters

using Classified SVQ with Conditional Splitting," Proc. IEEE Int. Conf. on

Acoustics Speech and Signal Processing, pp. 736-739, Detroit, May 9 - 12, 1995.

9. H. P. Knagenhjelm and W. B. Kleijn. "Spectral Dynamics is More Important

than Spectral Distortion," Proc. IEEE Int. Conf. on Acoustics Speech and

Signal Processing, pp. 732-735, Detroit, May 9 - 12, 1995.

10. E. Shlomot. "Delayed Decision Switched Prediction Multi-Stage LSF Quanti-

zation," Digest of papers, IEEE workshop on Speech Coding for Telecommuni-

cations, pp. 45-46, 1995.

11. B.F. Johnson and N. Farvardin. "A Finite-State Two-Stage Vector Quantizer

for Coding Speech Line Spectral Parameters," Digest of papers, IEEE workshop

on Speech Coding for Telecommunications, pp. 47-48, 1995.

12. J.S. Collura, A McCree and T.E. Tremain. "Perceptually Based Distortion

Measurements for Spectrum Quantization," Digest of papers, IEEE workshop

on Speech &ding for Telecommunications, pp. 49-50, 1995.

13. A. McCree, K. Truong, E.B. George and T.P. Barnwell. "An Enhanced 2.4

kbit/s MELP Coder," Digest of papers, IEEE workshop on Speech Coding for

Telecommunications, pp. 101-102, 1995.

14. J.R.B. Demarca. "An LSF Quantizer for the north-american half-rate speech

coder," IEEE Transactions on Vehicular Technology, Vol. 43, No. 3, pp. 413-

419, 1994.

15. L. Dong, A.R. Kaye and S.A. Mahmoud. "Transmission of compressed voice

over integrated services frame relay networks - priority service and adaptive

buildout delay," IEE Proceedings on Communications, Vol. 141, p. 265, 1994.

16. A. Gersho. "Advances in Speech and Audio Compression," Proceedings of the

IEEE, Vol. 82, No. 6, pp. 900-918, 1994.

APPENDIX D. LIST OF CITATIONS 158

17. J. Pan and T.R. Fischer. "Vector Quantization - Lattice Vector Quantization

of Speech LPC Coefficients," Proc. IEEE Int. Conf. on Acoustics Speech and

Signal Processing, pp. 1-513-516, 1994.

18. W.Y. Chan and D. Chemla. "Low Complexity Encoding of Speech LSF Param-

eters using Constrained Storage TSVQ," Proc. IEEE Int. Conf. on Acoustics

Speech and Signal Processing, pp. 1-521-524, 1994.

References

[I] J . G. Ables. Maximum Entropy Spectral Analysis. Astron. Astrophys. Suppl.

Series, 15:383-393, 1974.

[2] J. P. Adoul, P. Mabilleau, M. Delprat, and S. Morissette. Fast CELP Coding

Based on Algebraic Codes. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal

Process., pages 1957-1960, April 1987.

[3] L. B. Almeida and J . M. Tribolet. Harmonic coding:a low bit-rate good-quality

speech coding technique. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal

Process., pages 1664-1667, Paris, 1982.

[4] J . Anderson and J. Bodie. Lease Squares Quantization in PCM. IEEE Trans.

Info. Theory, IT-21:379-387, July 1975.

[5] B. Atal and J . Remde. A New Model of LPC Excitation for Producing Natural

Sounding Speech at Low Bit Rates. In Proc. IEEE Inter. Conf. Acoust., Speech,

Signal Process., pages 614-61 7, Paris, 1982.

[6] B. S. Atal. Stochastic Gaussian Model for Low-Bit Rate Coding of LPC Area

Parameters. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages

2404-2407, 1987.

171 B. S. Atal, R. V. Cox, and P. Kroon. Spectral Quantization and Interpolation

for CELP Coders. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,

pages 69-72, Glasgow, Scotland, May 1989.

[8] B. S. Atal and S. L. Hanuer. Specch Analysis and Synthesis by Linear Prediction

of the Speech Wave. J. Acoust. Soc. Amer., 50(2 (Part 2)):637-655, 1971.

REFERENCES 160

[9] B. S. Atal and M. R. Schroeder. Predictive Coding of the Speech Signals. In

Proc. Conf. Speech Comm. and Processing, pages 360-361, Nov. 1967.

[lo] B. S. Atal and M. R. Schroeder. Adaptive Predictive Coding of the Speech

Signals. Bell Syst. Tech. J., 49:1973-1986, Oct. 1970.

[ll] B. S. Atal and M. R. Schroeder. Predictive Coding of Speech Signals and Sub-

jective Error Criteria. IEEE Trans. Acoust. Speech Signal Processing, ASSP-

27(3):247-254, June 1979.

[12] B. S. Atal and M. R. Schroeder. Stochastic Coding of Speech Signals at Very

Low Bit Rates. In Proc. Int. Conf. Comm., pages 1610-1613, May 1984.

[13] C. F. Barnes and R. L. Frost. Vector Quantizers with Direct Sum Codebooks.

IEEE Trans. Info. Theory, 39(2):565-580, Mar. 1993.

[14] C. F. Barnes, S. A. Rizvi, and N. M. Nasrabadi. Advances in Residual Vector

Quantization: A Review. IEEE Trans. Image Processing, 5(2) :226-262, Feb.

1996.

[15] B. Bhattacharya, W. LeBlanc, S. Mahmoud, and V. Cuperman. Tree Searched

Multi-Stage Vector Quantization for 4 kb/s Speech Coding. In Proc. IEEE Inter.

Conf. Acoust., Speech, Signal Process., pages 1-105 - 1-108, San Francisco, March

1992.

[16] M. S. Brandstein, P. A. Monta, J. C. Hardwick, and J. S. Lim. A Real-Time

Implementation of the Improved MBE Speech Coder. In Proc. IEEE Inter. Conf.

Acoust., Speech, Signal Process., pages 5-8, Albuquerque, April 1990.

[17] A. Buzo, A. H. Gray Jr., R. M. Gray, and J . D. Markel. Speech Coding Based

Upon Vector Quantization. IEEE Trans. Acoust. Speech Signal Processing, ASSP-

28(5):562-574, Oct. 1980.

[18] J. P. Campbell and T. E. Tremain. Voiced/Unvoiced Classification of Speech

with Applications to the U.S. Government LPC-1OE Algorithm. In Proc. IEEE

Inter. Conf. Acoust., Speech, Signal Process., pages 473-476, Tokyo, April 1986.

REFERENCES

[19] J . P. Campbell, T . E. Tremain, and V. C. Welch. The Federal Standard 1016

4800 bps CELP Voice Coder. Digital Signal Processing, 1(3):145-155, July 1991.

[20] W. Y. Chan, S. Gupta, and A. Gersho. Enhanced Multistage Quantization by

Joint Codebook Design. IEEE Trans. Comm., 4O(ll): 1693-1697, Nov. 1992.

[21] R. E. Crochiere, S. A. Weber, and J. L. Flanagan. Digital Coding of Speech in

Sub-Bands. Bell Syst. Tech. J., 55:1069-1085, Oct. 1976.

[22] V. Cuperman. On Adaptive Vector Transform Quantization for Speech Coding.

IEEE Trans. Comm., 37(3):261-267, March 1989.

[23] V. Cuperman. Speech coding. Advances in Electronics and Electron Physics,

82:97-196, 1991.

[24] V. Cuperman and A. Gersho. Vector Predictive Coding of Speech at 16 Kbit/s.

IEEE Trans. Comm., COM-33(7):585-696, July 1985.

[25] V. Cuperman, P. Lupini, and B. Bhattacharya. Spectral Excitation Coding of

Speech at 2.4 kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,

pages 496-499, Detroit, May 1995.

[26] A. Das, A.V. Rao, and A. Gersho. Variable-Dimension Vector Quantization of

Speech Spectra for Low-Rate Vocoders. In Proc. Data Compression Conference,

pages 421-429, 1994.

[27] G. Davidson and A. Gersho. Complexity Reduction Methods for Vector Excita-

tion Coding. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages

3055-3058, Tokyo, April 1986.

[28] L. D. Davisson. Rate-Distortion Theory and Application. Proc. IEEE, 60:800-

808, 1972.

[29] A. De and P. Kabal. Cochlear discrimination: An auditory information-theoretic

distortion measure for speech coders. In Proc. 16th Biennial Symp. on Commun.,

pages 419-423, Kingston, Canada, May 1992.

REFERENCES 162

[30] A. De and P. Kabal. Rate-Distortion Function for Speech Coding based on

Perceptual Distortion Measure. In Proc. Globecom, pages 452-456, Orlando,

Florida, Dec. 1992.

[31] H. Dudely. The Vocoder. Bell Labs Rec., 18:122-126, Dec. 1939.

[32] DVSI. INMARSAT M Voice Codec. USA, Feb. 1991. Version 1.3.

[33] M. Farvardin. A Study of Vector Quantization for Noisy Channels. IEEE Trans.

Info. Theory, IT-36:799-809, July 1990.

[34] N. Farvardin and R. Laroia. Efficient Encoding of Speech LSP Parameters Using

the Discrete Cosine Transform. In Proc. IEEE Inter. Conf. Acoust., Speech,

Signal Process., pages 168-171, Glasgow, May 1989.

[35] J. L. Flanagan. Speech Analysis, Synthesis and Perception. Springer-Verlag, New

York, 1972.

[36] J. L. Flanagan, M. R. Schroeder, B. S. Atal, R. E. Crochiere, N. S. Jayant, and

J . M. Tribolet. Speech Coding. IEEE Trans. Comm., COM-27(4):710-737, April

1979.

[37] P. Fleischer. Sufficient Conditions for Achieving Minimum Distortion in a Quan-

tizer. In IEEE Int Conv. Rec., pages 104-111, 1964.

[38] J . B. Fraleigh and R. A. Beauregard. Linear Algebra. Addison-Wesley Publishing

Company, second edition, 1990.

[39] S. Furui. Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker

Inc., New York, 1989.

[40] A. Gersho. Principles of Quantization. IEEE Trans. Circuits and Systems, CAS-

25(7):427-436, July 1978.

[41] A. Gersho. Advances in Speech and Audio Compression. Proceedings of the

IEEE, 82(6):900-918, June 1994.

[42] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer

Academic Publishers, 1992.

REFERENCES 163

[43] I. Gerson and M. Jasiuk. Vector Sum Excited Linear Prediction (VSELP) Speech

Coding at 4.8 kbps. In Proc. of Inter. Mob. Sat. Conf., pages 678-683, Ottawa,

1990.

[44] I. Gerson and M. Jasiuk. Vector Sum Excited Linear Prediction (VSELP) Speech

Coding at 8 Kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,

pages 461-464, Albuquerque, April 1990.

[45] 0 . Ghitza and J . L. Goldstein. Scalar LPC Quantization Based on Formant

JND's. IEEE Trans. Acoust. Speech Signal Processing, ASSP-34(4):697-708, Aug.

1986.

[46] J . D. Gibson and K. Sayood. Lattice Quantization. Advances in Electronics and

Electron Physics, 72:259-330, 1988.

[47] H. Gish and J.N. Pierce. Asymptotically Efficient Quantizing. IEEE Trans. Info.

Theory, IT-14:676-681, Sept. 1968.

[48] R.M. Gray, J.C. Kieffer, and Y. Linde. Locally Optimal Block Quantizer Design.

Inform. and Control, 45:178-198, May 1980.

[49] A.H. Gray Jr., R. M. Gray, and J.D. Markel. Comparison of Optimal Quan-

tizations of Speech Reflection Coefficients. IEEE Trans. Acoust. Speech Signal

Processing, ASSP-25:9-23, Feb. 1977.

[50] A.H. Gray Jr. and J.D. Markel. Distance Measures for Speech Processing. IEEE

Trans. Acoust. Speech Signal Processing, ASSP-24(5):380-391, Oct. 1976.

[51] A.H. Gray Jr. and J.D. Markel. Quantization and Bit Allocation in Speech

Processing. IEEE Trans. Acoust. Speech Signal Processing, ASSP-24(6):459-473,

Dec. 1976.

[52] D. W. Griffin. Multi-Band Excitation Vocoder. PhD thesis, Massachusetts Insti-

tute of Technology, 1987.

[53] D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short Time Fourier

Transform. IEEE Trans. Acoust. Speech Signal Processing, ASSP-32(2):236-243,

April 1984.

REFERENCES

[54] D. W. Griffin and J . S. Lim. Multiband Excitation Vocoder. IEEE Trans. Acoust.

Speech Signal Processing, 36(8): 1223-1235, August 1988.

[55] P. Hedelin. A Tone Oriented Voice-Excited Vocoder. In Proc. IEEE Inter. Conf.

Acoust., Speech, Signal Process., pages 205-208, 1981.

[56] F. Itakura. Line Spectrum Representation of Linear Predictive Coefficients of

Speech Signals. J. Acoust. Soc. Amer., 57, Supplement No. 1:S35, 1975.

[57] F. I. Itakura and S. Saito. Analysis-Synthesis Telephony Based on the Maximum

Likelihood Method. In Proc. 6th Intern. Congr. Acoust., pages C17-20, Tokyo,

August 21-28 1968.

[58] N. S. Jayant. Digital Coding of Speech Waveforms: PCM, DPCM, and DM

Quantizers. Proceedings of the IEEE, 62:611-632, May 1974.

[59] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, Englewood

Cliffs, New Jersey, 1984.

[60] B. Juang, D. Y. Wong, and A. H. Gray Jr. Distortion Performance of Vector

Quantization for LPC Voice Coding. IEEE Trans. Acoust. Speech Signal Pro-

cessing, ASSP-30(2):294-303, April 1982.

[61] P. Kabal and R.P Ramachandran. The Computation of Line Spectral Frequencies

Using Chebyshev Polynomials. IEEE Trans. Acoust. Speech Signal Processing,

ASSP-34(6):1419-1426, Dec. 1986.

[62] G.S. Kang and L.J. Fransen. Low-Bit Rate Speech Encoders Based on Line-

Spectrum Frequencies (LSFs). NRL Report 8857, Naval Research Laboratory,

Washington, D.C., Jan. 1985.

[63] W. B. Kleijn. Continuous Representations in Linear Predictive Coding. In Proc.

IEEE Inter. Conf. Acoust., Speech, Signal Process., pages 201-204, Toronto, May

1991.

[64] W. B. Kleijn and W. Granzow. Methods for Waveform Interpolation in Speech

Coding. Digital Signal Processing, 1(4):215-230, Oct. 1991.

REFERENCES 165

[65] A. M. Kondoz. Digital Speech (Coding for Low Bit Rate Communications Sys-

tems). John Wiley & Sons, Chichester, England, 1994.

[66] F. Kossentini, M. J . T . Smith, and C. F. Barnes. Necessary Conditions for

the Optimality of Variable-Rate Residual Vector Quantizers. IEEE Trans. Info.

Theory, 41(6):1903-1914, Nov. 1995.

[67] P. Kroon, E. Deprettere, and R. Sluyter. Regular-Pulse Excitation, A Novel

Approach to Effective and Efficient Multipulse Coding of Speech. IEEE Trans.

Acoust. Speech Signal Processing, ASSP-34: 1054-1063, 1986.

[68] G. Kubin, B. S. Atal, and W. B. Kleijn. Performance of Noise Excitation for

Unvoiced Speech. In Proc. IEEE Workshop on Speech Coding for Telecommuni-

cations, pages 35-36, 1993.

[69] W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman. Efficient

Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters for

4 kb/s Speech Coding. IEEE Trans. Speech and Audio Processing, 1(4):373-385,

Oct. 1993.

[70] W. P. LeBlanc and S. A. Mahmoud. Structured Codebook Design in CELP. In

Proc. Inter. Mob. Sat. Conf., pages 667-672, Ottawa, June 1990.

[71] D. Lin. New Approaches to Stochastic Coding of Speech Sources at Very Low

Bit Rates. In I.T. Young et al., editors, Signal Processing 111: Theories and

Applications, pages 445-447. Elsevier, North-Holland, Amsterdam, 1986.

[72] Y. Linde, A. Buzo, and R.M. Gray. An Algorithm for Vector Quantizer Design.

IEEE Trans. Comm., COM-28(1):84-95, Jan. 1980.

[73] S.P. Lloyd. Least Squares Quantization in PCM. IEEE Trans. Info. Theory,

IT-28:129-137, March 1982. (Originally, unpublished memorandum, Bell Labo-

ratories, 1957).

[74] P. Lupini and V. Cuperman. Non-Square Transform Vector Quantization. IEEE

Signal Processing Letters, 3(1):1-3, Jan. 1996.

REFERENCES 166

[75] J. Makhoul, S. Roucos, and H. Gish. Vector Quantization in Speech Coding.

Proceedings of the IEEE, 73(11):1551-1588, Nov. 1985.

[76] J. D. Markel. The SIFT Algorithm for Fundamental Frequency Estimation. IEEE

Trans. Audio Electroacoust., AU-20:367-377, Dec. 1972.

[77] J. D. Markel and A. H. Gray. A Linear Prediction Vocoder Simulation Based

upon Auto-correlation Method. IEEE Trans. Acoust. Speech Signal Processing,

ASSP-23(2):124-134, April 1974.

[78] J. D. Markel and A. H. Jr. Gray. Linear Prediction of Speech. Springer Verlag,

Berlin, 1976.

[79] J . D. Markel and A. H. Gray Jr. Implementation and Comparison of Two Trans-

formed Reflection Coefficient Scalar Quantization Methods. IEEE Trans. Acoust.

Speech Signal Processing, ASSP-28(5):575-583, Oct. 1980.

[80] J. S. Marques, L. B. Almeida, and J. M. Tribolet. Harmonic Coding at 4.8

Kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages 17-20,

Albuquerque, April 1990.

[81] J . Max. Quantizing for Minimum Distortion. IEEE Trans. Info. Theory, IT-6:7-

12, March 1960.

[82] R. J. McAulay and T. F. Quatieri. Speech Analysis/Synthesis Based on a Si-

nusoidal Representation. IEEE Trans. Acoust. Speech Signal Processing, ASSP-

34(4):744-754, August 1986.

[83] R. J. McAulay and T. F. Quatieri. Low-Rate Speech Coding Based on the

Sinusoidal Model. In S. Furui and M. Sondhi, editors, Advances in Speech Signal

Processing, chapter 6, pages 165-208. Marcel Dekker Inc., New York, 1992.

[84] D. L. Neuhoff and N. Moayeri. Tree Searched Vector Quantization with Interblock

Noiseless Coding. In Proc. 1988 Conf. Infor. Scien. Sys., pages 781-783, Mar.

1988.

REFERENCES 167

[85] M. Nishiguchi, J . Matsumoto, R. Wakatsuki, and S. Ono. Vector Quantized MBE

with simplified V/UV decision at 3.0 kbps. In Proc. IEEE Inter. Conf. Acoust.,

Speech, Signal Process., pages 151-154, Minneapolis, April 1993.

[86] K. K. Paliwal and B. S. Atal. Efficient Vector Quantization of LPC parameters

at 24 bitslframe. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,

pages 661-664, Mar. 1991.

[87] K. K. Paliwal and B. S. Atal. Vector Quantization of LPC Parameters in the

Presence of Channel Errors. In IEEE Workshop on Speech. Coding for Telecom-

munications, pages 33-35, Sept. 1991.

[88] P. E. Pa.pamichalis and T. P. Barnwell, 111. Variable Rate Speech Compression

by Encoding Subsets of the PARCOR Coefficients. IEEE Trans. Acoust. Speech

Signal Processing, ASSP-31(3):704-713, June 1983.

[89] A. Pa.poulis. Signal Analysis. McGraw-Hill Book Co., Singapore, international

student edition, 1984.

[90] N. Phamdo a.nd N . Farvardin. Coding of Speech LSP Parameters Using TSVQ

with Interblock Noiseless Coding. In Proc. IEEE Inter. Conf. Acoust., Speech,

Signal Process., pages 193-196, 1990.

[91] N. Phamdo, N. Farvardin, and T. Moriya. Combined Source-Channel Coding of

LSP parameters Using Multi-Stage Vector Quantization. In IEEE Workshop on

Speech Coding for Telecommunications, pages 36-38, 1991.

[92] L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice

Hall, Englewood Cliffs, N.J., 1978.

[93] K.R. Rao and P. Yip. Discrete cosine transform : algorithms, advantages, and

applications. Harcourt Brace Jovanovich, Boston, 1990.

[94] M. J . Sabin and R. M. Gray. Product Code Vector Quantizers for Waveform and

Voice Coding. IEEE Trans. Acoust. Speech Signal Processing, ASSP-32(3):474-

488, June 1984.

REFERENCES 168

[95] R. A. Salami, L. Hanzo, and D. G. Appleby. A Fully Vector Quantized Self-

Excited Vocoder. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,

pages 124-127, Glasgow, May 1989.

[96] C. E. Shannon. Coding Theorems for a Discrete Source with a Fidelity Criterion.

In IRE Nut. Conv. Rec., Part 4, pages 142-163, Mar. 1959.

[97] Y . Shoham. High Quality Speech Coding at 2.4 to 4.0 Kbps Based on Time-

Frequency Interpolation. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal

Process., pages 167-170, Minneapolis, April 1993.

[98] B. Smith. Instantaneous Companding of Quantized Signals. Bell Syst. Tech. J.,

27:446-472, 1948.

1991 P.K. Soong and B.W. Juang. Line Spectrum Pair (LSP) and Speech Data Com-

pression. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages

1.10.1-1.10.4, San Diego, CA, March 1984.

[loo] N . Sugamura and N. Farvardin. Quantizer Design in LSP Speech Analysis Syn-

thesis. IEEE Trans. Selected Areas in Corn., 6(2):432-440, Feb. 1988.

[loll Y . Tohkura and F. Itakura. Spectral Sensitivity Analysis of PARCOR Parameters

for Speech Data Compression. IEEE Trans. Acoust. Speech Signal Processing,

ASSP-27(3):273-280, June 1979.

[I021 Y . Tohkura, F. Itakura, and Hashimoto S. Spectral Smoothing Technique in

PARCOR Speech Analysis-Synthesis. IEEE Trans. Acoust. Speech Signal Pro-

cessing, ASSP-26(6):587-596, Dec. 1978.

[lo31 T. E. Tremain. The Government Standard Linear Predictive Coding Algorithm:

LPC- 10. Speech Technology, pages 40-49, April 1982.

[I041 F. F . Tzeng. Analysis-by-Synthesis Linear Predictive Speech Coding at 2.4 kbit/s.

In Proc. Globecorn, pages 1253-1257, 1989.

[I051 T. Umezaki and F Itakura. Analysis of Time Fluctuating Characteristics of

Linear Predictive Coefficients. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal

Process., pages 1257-1261, 1986.

REFERENCES 169

[lo61 C. K. Un and D. T. Magill. The Redisual-Excited Prediction Vocoder with

Transmission Rate Below 9.6 kbitsls. IEEE Trans. Comm., COM-23(12):1466-

1474, Dec. 1975.

[lo71 R. Viswanathan and J. Makhoul. Quantization Properties of Transmission Pa-

rameters in Linear Predictive Systems. IEEE Trans. Acoust. Speech Signal Pro-

cessing, ASSP-23:309-321, June 1975.

[I081 D. Wong, B. Juang, and A. H. Gray Jr. An 800 bit/s Vector Quantization LPC

Vocoder. IEEE Trans, Acoust. Speech Signal Processing, ASSP-23(5):770-780,

Oct. 1982.

[log] R.C. Wood. On Optimum Quantization. IEEE Trans. Info. Theory, IT-15:248-

252, March 1969.

[I101 S. Yeldener, A.M. Kondoz, and B.G. Evans. High Quality Multiband LPC Coding

of Speech at 2.4 kbit/s. Electronics Letters, 27(14):1287-1289, July 4 1991.

[ I l l ] M. Young, G. Davidson, and A. Gersho. Encoding of LPC Spectral Parameters

Using Switched- Adaptive Interframe Vector Prediction. In Proc. IEEE Inter.

Conf. Acoust., Speech, Signal Process., pages 402-405, 1988.

[112] K. A. Zeger and A. Gersho. Zero Redundancy Channel Coding in Vector Quan-

tization. Electronics Letters, 23:654-656, May 1987.

[I131 R. Zelinski and P. Noll. Adaptive Transform Coding of Speech Signals. IEEE

Trans. Acoust. Speech Signal Processing, ASSP-25:299-309, Aug. 1977.

efficient vector quantization of lpc parameters for harmonic...

Documents