efficient vector quantization of lpc parameters for harmonic...
TRANSCRIPT
EFFICIENT VECTOR QUANTIZATION OF LPC
PARAMETERS FOR HARMONIC SPEECH CODING
by
Bhaskar Bhattacharya
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
in the School
of
Engineering Science
@ Bhaskar Bhattacharya 1996
SIMON FRASER UNIVERSITY
October, 1996
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
APPROVAL
Name:
Degree:
Title of thesis :
Doctor of Pl~ilosopl~y
Efficient Vector Quantization of LP(I Para~neters for H a -
~nonic Speech (loding
Examining Committee: Dr. .John .Jones, (.:hairn~an
Date Approved:
. , . - Dr. Vladimir ( 'u$rn~an, Senior Supervisor ProfessorJn~ineering Science, SFTI
Y V ,
Dr. Paul Ho. Supervisor Associate Professor, Engineering Science, SFIT
/
v f . Dr. JacquesValsey, Supervisoj Assistant Profesor, Engineering Scienct., SF11
Jim (:avers, Internal Examiner rofessor, Engineering Science. SFIT
Dr. Sanjit K. Mitra, External Examiner Professor, Electrical and Comput,er Engineering University of (Ihlifornia, Santa Barbara
October 11, 1996
PARTIAL COPYRIGHT LICENSE
I hereby grant to Simon Fraser University the right to lend my thesis, project or extended essay (the title of which is shown below) to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its usrs. I further agree that permission for multiple copying of this work for scholarly purposes may be granted by me or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without my written permission.
Title of Thesis/Project/Extended Essay
"Efficient Vector quantization of LPC Parameters for Harmonic Speech Codin?"
Author: 1
(signature)
(name)
October 11. 1996 (date)
Abstract
The present thesis deals with the problem of efficient (in bit rate and computational
complexity) quantization of Linear Prediction Coding (LPC) parameters for low bit
rate speech coding. The thesis introduces a new LPC quantization technique based on
the Multi-Stage Vector Quantization (MSVQ) combined with a multi-candidate M-L
search. The resulting procedure is assessed by evaluating the quantization spectral
distortion on a speech data-base and by evaluating the subjective speech quality of a
low-rate speech coder which employs the MSVQ LPC quantization.
The general structure of MSVQ is described along with a geometrical interpreta-
tion to provide insight into the structure of the reproduction alphabet in MSVQ. In
~ar t icular , it is shown that MSVQ codevectors provide a tiling of the sample space
with repetitive patterns. Two tree-search techniques are suggested and one of them,
the M-L search technique is studied in more detail.
The experimental results obtained with MSVQ indicate that transparent quan-
tization of LSFs (Line Spectral Frequencies - an efficient LPC representation) can
be achieved with just 22 bitslvector with computational complexity comparable to
the Split VQ at 24 bitslvector. Alternatively, transparent quantization of LSFs can
be done using 24 bitslvector (as is done using Split VQ) at a much lower computa-
tional complexity. Several results relating performance and complexity trade-offs are
reported showing that MSVQ is a very flexible approach which provides a wide range
of performance-complexity trade-offs and good robustness.
The performance of MSVQ codes have been studied under channel error condi-
tions and codebook ordering using pseudo-Gray coding. It is shown that while VQ
based systems have lower average spectral distortion and a lower percentage of 2-4
dB outliers even with transmission errors, scalar quantization may lead to a lower
percentage of 4 dB outliers particularly at high error rates.
Performance of the IVQ codes have also been studied for effects of language and
input spectral shape. It has been shown that MSVQ codes become more robust as
the number of stages are increased.
Finally, one of the MSVQ codes developed here has been used to implement a
1800 bps speech coder using a harmonic coding of excitation and a very coarse 0-bit
quantization of harmonic spectral shape. The speech quality of the 1800 bps coder
was better than the 2400 bps LPC-lOe coder.
Acknowledgements
I would like to thank Prof. Vladimir Cuperman for all his guidance and patience all
along this work. His suggestions were very helpful during the course of this research.
I also thank Dr. Jacques Vaisey and Dr. Paul Ho for being on my advisory committee
and making constructive criticism of the work.
I wish to express my heartfelt gratitude to my wife Roma for all her encourage-
ments and tolerance, and all my friends, particularly Peter Lupini, Aamir Husain,
and Yingbo Jiang for the exciting discussions that make research a lively occupation.
I also obtained a lot of help in keeping my spirits up from my friends Hong Shi and
Jacqueline Duffy, my sincere thanks to them.
Contents
... Abstract ................................................................... in
Acknowledgements ........................................................ v
List of Tables .............................................................. x
List of Figures ............................................................. xi
1 Introduction ........................................................... 1
1.1 Speech Coding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Waveform Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Parametric Coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Speech Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Motivation and Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 A Brief Review of Speech Coding Literature ........................ 13
2.1 Source Coding and Rate Distortion Theory . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Analysis-by-synthesis Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Sinusoidal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Relative Merits and Demerits of Different Coding Strategies . . . . . . . . . 26
3 Quantization of LPC parameters ..................................... 28
3.1 Choosing an Appropriate Spectral Representation . . . . . . . . . . . . . . . . . . 29
3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Pre-emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Bandwidth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 High Frequency Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Vector Quantization of LPC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Stochastic VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Techniques Exploiting Interframe Correlations . . . . . . . . . . . . . . 40
3.4 Constrained (suboptimal) VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Tree Structured VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Classified VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.3 Product Code VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.4 Basis Vector VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.5 Multi-Stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.6 Partitioned VQ (Split VQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Multi-Stage VQ of LPC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Suboptimality of Sequential Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.1 Optimality conditions for sequential search . . . . . . . . . . . . . . . . . 62
4.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Search Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Detailed Analysis of The Search Complexity . . . . . . . . . . . . . . . . 71
4.3 Codebook Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Centroid Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Outlier Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Choice of Parameter Representation and Distance Measure . . . . . . . . . . 75
4.5 Performance and Complexity Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Robustness Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.1 Effect of Language and Input Spectral Shape . . . . . . . . . . . . . . . 80
4.6.2 Performance in the presence of channel errors . . . . . . . . . . . . . . . 82
4.7 Improved Codebook Designs for Multi-Stage VQ . . . . . . . . . . . . . . . . . . . 85
4.7.1 Iterative Sequential Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7.2 Simultaneous Joint Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 Recent Developments in MSVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
vii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Summary 88
.............................. 5 A Low Rate Spectral Excitation Coder 89
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction 90
5.2 Architecture of a Very-Low Rate Spectral Excitation Coder . . . . . . . . . 91
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Treatment of Unvoiced Segments 92
. . . . . . . . . . . . . . . . . . . . . . . . 5.3 Computation of the Unquantized Residual 93
5.4 Estimation and Quantization of Harmonic Parameters . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Pitch Estimation 95
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Modelling of Harmonic Phases 99
5.4.3 Estimation and Quantization of Harmonic Magnitudes . . . . . . . 105
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 An 1800 bps Spectral Excitation Coder 111
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Evaluation of Coder Performance 113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusions 114
.................................... 6 Conclusion and Future Directions 115
...................................................... A Linear Prediction 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . 1 Conceptual Formulation 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Equivalent Representations 121
. . . . . . . . . . . . . . . . . . A.2.1 Computation of Line Spectral Frequencies 125
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Maximum Entropy Principle 129
........................................................... B Quantization 131
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.l Scalar Quantization 131
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B . 1.1 Performance Measures 133
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B . 1.2 Robust Quantization 135
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.3 OptimumQuantization 137
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Vector Quantization 141
. . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Vector Quantizer Performance 144
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 Optimum VQ 146
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.3 VQ Design 150
........................................ C Pitch Computation Algorithm 152
... Vll l
D List of Citations . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Digital Speech Coding Standards 9
. . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Some important ITU-T recommendations 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Some early scalar quantization results 31
. . . . . . . . . . . . . . . . . . . . 3.2 Channel error performance of Basis Vector VQ 50
4.1 MSVQ Configurations and Rates Producing an Average Spectral Dis-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tortion of 1 dB 80
4.2 Spectral Distort ion Performance over Different Languages and Input
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral Shapes 81
4.3 Percentage of Outliers (2-4 dB) for Different Languages and Input Spec-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . tral Shapes 81
4.4 Average Spectral Distortion for Different Error Rates and Codes . . . . . 84
4.5 Percentages of Outliers for Different Error Rates and Codes . . . . . . . . . 84
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Bit Allocation for the 1800 bps coder 113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 MOS results 114
. . . . . . . . . . . . . . . C . 1 Values of empirical constants used in 1800 bps coder 155
List of Figures
A classification of speech coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A generalized predictive coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The Source-Filter Parametric Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LPC-10 Speech Synthesis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A schematic diagram of the CELP coder . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Historical bit rates of toll quality coders . . . . . . . . . . . . . . . . . . . . . . . . . . 8
The primary parameters of R-D theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A Generalized Analysis-by-Synthesis System . . . . . . . . . . . . . . . . . . . . . . 18
Computational structure of the CELP coder . . . . . . . . . . . . . . . . . . . . . . 19
Schematic diagram of a Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . 22
Spectral envelope of speech without (solid line) and with (dash line)
high frequency compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
SIVP coding system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A tree-searched VQ for m = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A tree structured VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Classified VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
The Split VQ Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Structure of a two-stage two dimensional VQ . . . . . . . . . . . . . . . . . . . . . . 58
A sequentially searched multi-stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Voronoi regions for a two-stage MSVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Growing Tree search of a three stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . 65
M-L Tree search of a three stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Failure of multi-candidate search in a 2-stage VQ . . . . . . . . . . . . . . . . . . 67
4.7 Failure of M-L search in a 3-stage VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8 Performance of LSF-6+6 MSVQ with M-L search . . . . . . . . . . . . . . . . . . 71
4.9 Performance comparison of LAR and LSF codebooks with M-L search 77
4.10 Spectral distortion of M-L Tree searched MSVQ at 24 bits/vector . . . . 78
4.11 M-L search performance versus search complexity for different rates . . 79
4.12 Performance over different languages and input spectral shapes . . . . . . 82
5.1 Magnitude spectrum of a voiced speech segment and corresponding
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LPC residual 90
5.2 A conceptual schematic of a spectral excitation coder . . . . . . . . . . . . . . . 92
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis of SEC parameters 94
. . . . . . . . . . . . . . . . . . . . . . . 5.4 Performance of the geometric pitch detector 100
. . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Pitch pulses marked by the pitch detector 101
5.6 Difference between measured and predicted phase changes for a voiced
frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 Difference between measured and predicted phase changes for an un-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . voicedframe 104
. . . . . . . . . . . . . . . . . . . . . 5.8 Frequency sampling points for a P-point DFT 106
5.9 Log magnitude spectrum templates for voiced and unvoiced speech . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . . 5.10 A Low bit rate Spectral Excitation Coder 112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.l Linear Prediction Model 121
. . . . . . . . . . . . . . . . . . . . . . . . . A.2 Stepped cylinder model of the vocal tract 123
. . . . . . . . . . . . . . . . . . . A.3 Transformation of Predictor coefficients to LSFs 124
A.4 Plots showing relationships between LSFs and other parameters . . . . . 128
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.l A typical mid-tread scalar quantizer 132
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Additive noise model of quantization 133
. . . . . . . . . . . . . . . . . . . . . B.3 Compander model of nonuniform quantization 136
B.4 A uniform joint pdf over a rectangular region (shown shaded) along
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with the marginal pdf's 143
. . . . . . . . . . . . . . B.5 A vector quantizer satisfying the necessary conditions 149
xii
Chapter 1
Introduction
Recent advances in Multimedia Communication and the real possibility of an impend-
ing integrated services network have generated a lot of interest in digital coding of
speech. With increasing demand on the bandwidth, more and more emphasis is being
placed on low bit rate speech coders. The present thesis addresses an important prob-
lem in low bit rate speech coding - that of efficient quantization of LPC parameters.
A low bit rate coder based on harmonic excitation is also presented that produces
good speech quality at rates below 2 kb/s.
1.1 Speech Coding Techniques
Detailed reviews of different speech coding techniques can be found in [36, 23, 411.
A brief overview is presented below. Speech coding algorithms can be categorized in
different ways depending on the criterion used. The most common classification of
coding systems divides them into two main categories waveform coders and parametric
coders. The waveform coders, as the name implies, try to preserve the waveform being
coded and pay no attention to the fact that the signal being coded is speech. The
parametric coders, on the other hand, depend upon a parsimonious description of
speech using a priori knowledge about how the signal was generated at the source. The
idea is that certain physical constraints of the signal generation can be quantified, and
turned to advantage in efficiently describing the signal. This implies that the signal
must be fitted into a specific mold and parameterized accordingly. These coding
techniques which exploit constraints of signal generation are also called source coders
CHAPTER 1. INTRODUCTION
or vocoders(V0ice CODERS).
Some coders use a mixture (or hybrid) of these two approaches. They use a synthe-
sis filter that models the vocal tract but attempt to quantize the excitation sequence
through an waveform matching procedure. We have put these coders under the cat-
egory of parametric coders in our classification. A broad classification of different
speech coders is shown in Fig. 1.1.
Speech Coding Systems s Waveform Coders Parametric Coders
7'7 Time Frequency Direct Excitation
Domain Domain Speech Encoding
DM Encoding
SBC DPCM ATC STC
ADPCM MBE
APC VPC Open /7\ Mixed Closed
LOOP Loop PWI
LPC- 10 TFI MP-LPC RELP W E SEC CELP
VSELP
Figure 1.1: A classification of speech coders
1.1.1 Waveform Coders
The waveform coders operate either in the time domain or in the frequency domain
and can be classified accordingly.
1.1.1.1 Time Domain Waveform Coders
The time domain waveform coders are all predictive coders, in that they code infor-
mation that cannot be predicted from already reconstructed speech signals. They
CHAPTER 1. INTRODUCTION
Figure 1.2: A generalized predictive coder
evolved from DM (Delta Modulation) [58] which uses a first order fixed predictor
and a one-bit adaptive quantizer, to VPC (Vector Predictive Coding) [24] which uses
a vector predictor and a vector quantizer for the error sequence. APC (Adaptive
Predictive Coding) [9, 10, 111 is a technique that uses a scalar, higher (> 1) order,
predictor to predict both short-term and long-term structures of speech signal and
optionally uses a filtered quantization error feedback to control noise spectrum. A
schematic diagram of a generalized APC coder is shown in Fig. 1.2.
1.1.1.2 Frequency Domain Waveform Coders
Sub Band Coding (SBC) [21] divides the speech spectrum into four or five sub-bands
using a bank of bandpass filters. Each sub-band is translated to base-band by a
single-sideband modulation process, resampled at its Nyquist rate, and encoded by
adaptive quantization or ADPCM. In the receiver, the sub-bands are decoded, mod-
ulated back to their original position in the frequency domain, and summed to give a
reconstruction of the original signal. The spectral shape of the quantization noise is
controlled by bit-allocation.
In Adaptive Transform Coding (ATC) [113], the speech signal is subdivided into
blocks and a transform is applied to each block. The transform coefficients are adap-
tively quantized and transmitted to the receiver where they are decoded and inverse
transformed to obtain the waveform.
CHAPTER 1. INTRODUCTION 4
1.1.2 Parametric Coders
Right from its introduction [57, 8, 771, linear prediction has been very successful in
coding speech. A very popular model used for speech production is the source-filter
model. The sound generating mechanism (the source) is assumed to be linearly sep-
arable from the intelligence-modulating vocal tract (the filter) (Fig. 1.3). The speech
signal, s(n), is analyzed to compute a set of excitation control parameters, J(n), and a
set of synthesis filter control parameters a(n) . The output of the excitation generator,
e (n ) , when passed through the synthesis filter produces reconstructed speech, i (n) .
Excitation Synthesis Generator Filter
Figure 1.3: The Source-Filter Parametric Coder
Despite the success of the source-filter model, some coders do not use it, and
attempt to model the speech signal as a whole. Thus, the class of parametric coders
can be further subdivided into those that attempt to model the speech directly, and
those that attempt to model the excitation sequence and the synthesis filter separately.
1.1.2.1 Direct Speech Encoding
A powerful speech modelling technique uses a sum of sinusoids model to represent
speech signals. This is represented by
s(n) = C Am (n) cos ( e m (n) ) m
(1.1)
where m is the harmonic number and the summation is taken over the number of
harmonics which vary with time.
This was first introduced by Hedelin [55] and later developed by Almeida and
Tribolet [3], McAulay and Quatieri [82, 831, and Marques, Almeida and Tribolet [80].
CHAPTER 1. INTRODUCTION 5
This technique has been called Harmonic Coding and Sinusoidal Transform Coding
(STC) by different authors.
A slightly different form of sinusoidal speech modelling was done by Griffin and
Lim [54]. A closed loop estimation was done for pitch and harmonic magnitudes. The
speech spectrum was divided into voiced and unvoiced bands and voiced and unvoiced
components of a speech frame were synthesized differently. The voiced component
was synthesized in the time domain using Eq. (1.1) and the unvoiced component was
computed from a synthetic DFT using the overlap-add method [53]. They were added
together to form the synthetic speech signal. This technique, although performed
directly on the speech signal is called Multi Band Excitation (MBE). One version of
MBE, called improved MBE (IMBE) [16] was subsequently adopted by INMARS AT
as a standard for satellite voice communication. Another version [85] is currently
under consideration for the TIA half-rate TDMA digital cellular standard. Typical
bit rates for sinusoidal coders range from 4.1 kb/s to 9.6 kb/s.
1.1.2.2 Excitation Encoding
The oldest parametric coder is the Channel Vocoder by Dudely [31]. It exploits the
insensitivity of the aural mechanism to phase, and only attempts to reproduce the
short time power spectrum of the speech waveform. The spectral envelope of the
speech is measured with a bank of filters and ascribed wholly to the vocal tract filter,
while the excitation is estimated to be either a quasi-periodic pulse train, or noise.
In recent coders, that use excitation modelling, the synthesis filter is computed
from a linear prediction analysis of segments of speech and uses what are called LPC
parameters. A variety of techniques are used to represent the excitation signal. So,
the problem in this class of coders is how to quantize the LPC parameters and the
excitation most efficiently. In some coders the excitation is chosen in a closed loop
fashion so as to minimize a perceptually significant distortion between the original and
synthetic speech, and some others use an open loop approach without any reference
to the synthetic speech. There are also some mixed approaches where a classifier is
used and different classes are dealt with in an open or closed loop manner (Fig. 1.1).
CHAPTER 1. INTRODUCTION
r"'--""'--"'-"
Speech
I I L-,,,,,,,,,,,,,-,,,I
Excitation Generator
Figure 1.4: LPC-10 Speech Synthesis Model
Open loop techniques
The oldest speech coding standard, LPC-10 (U.S. Government Federal Standard 101 5 )
[103, 181, uses a 10th order synthesis filter, and pulses and random sequences as the
excitation (Fig. 1.4). The LPC parameters are represented as reflection coefficients
and are scalar quantized. Regular pulses at pitch intervals are used as excitation for
voiced portions and a white random sequence is used for unvoiced portions of the
speech being coded. The energy distribution is maintained by a gain parameter.
A modification of the LPC-10 called RELP (Residual Excited Linear Prediction)
[106] uses a quantized low-pass filtered version of the residual as the excitation and
avoids the problem of classification and computation of pitch.
The Spectral Excitation Coder (SEC) [25] uses a sum-of-sinusoids model to syn-
thesize the excitation signal which is passed through an LPC based synthesis filter
to produce speech. Since the residual is more spectrally flat than speech itself, it
offers advantages in quantizing the harmonic magnitudes over conventional sinusoidal
coders.
Closed loop techniques
The hybrid coders CELP (Code Excited Linear Prediction) [12] and VSELP (Vector
Sum Excited Linear Prediction) [44] employ the same source-filter model (Fig. 1.3)
but the excitation is selected from a fixed and an adaptive codebook in a closed loop
fashion known as analysis by synthesis. A schematic structure of the CELP coder is
shown in Fig 1.5.
VSELP models the excitation sequence as a linear combination of a fixed set of
CHAP TER 1. INTRODUCTION
Adaptive codebook
Input speech
I
Figure 1.5: A schematic diagram of the CELP coder
M basis vectors.
m=l
where 0 5 i 5 2M - 1 and 0 < n < N - 1. The linear combination coefficients O;,
are restricted to either $1 or -1. This simpl&es the procedure of codebook search
for optimum innovation and also makes the system comparatively robust to bit errors
as a single bit error only affects one component. Computational complexity is also
reduced for a joint optimal search of the VSELP codebook and the adaptive codebook
as it requires orthogonalization of a small (typically 10) number of basis vectors only.
MP-LPC (Multi Pulse LPC) [5] and RPE (Regular Pulse Excitation) [67] are pre-
cursors to CELP that uses codebooks of pulse trains whose positions and amplitudes
are determined in a closed loop fashion.
Mixed techniques
It is possible that different approaches be applied in modelling different segments
of the excitation. Specially, advantage can be taken of the apparent periodicity of
the voiced portions of speech. The techniques Prototype Waveform Interpolation
(PWI) [63, 641 and Time Frequency Interpolation (TFI) [97] use open loop frequency
domain interpolation techniques to model the gradually changing pitch cycles of a
C H A P T E R 1. INTRODUCTION 8
voiced excitation while using closed loop techniques like CELP for unvoiced segments
which are difficult to model parametrically due to lack of specific spectral structures.
1.1.3 Speech Coding Standards
A summary of different speech coding standards currently in use is shown in Table 1 . l .
The ITU-T (formerly CCITT) has also passed some recommendations (Table 1.2) for
digital coding of speech. The progression of tolllnear-toll quality speech coding can
be seen in Fig. 1.6 where bit rates of toll quality coders have been plotted with the
year of their introduction.
1975 1980 1985 1990 1995 2000 Year
Figure 1.6: Historical bit rates of toll quality coders
1.2 Motivation and Original Contributions
1.2.1 Motivation
For low bit rate speech coders that employ the source-filter model, a large portion of
the bit rate is invested in coding synthesis filter parameters. Obviously, one way to
improve synthetic speech quality at low bit rates will be to minimize the number of
C H A P T E R 1. INTRODUCTION
Rate ( W s ) 64
32
16
16
13
9.6
8
6.7
6.4
4.8
4.8
2.4
Application
PSTN (1st Generation)
PSTN (2nd Generation)
PSTN (3rd Generation)
INMARSAT Standard B (Maritime)
Pan European Digital Mo- bile Radio (DMR) Cellular System (GSM)
S kyphone (Aeronautical)
North American DMR (Mobile)
Japanese DMR (Mobile)
INMARSAT Standard M (Land-Mobile)
U.S. Government Federal Standard 1016
NASA MSAT-X (Mobile Satellite)
U. S. Government Federal Standard 1015
Coding Algorithm
Pulse Code Modulation
(PCM)
Adaptive Differential PCM (ADPCM)
Low Delay Code Ex- cited Linear Predictive Coding (LDCELP)
Adaptive Predictive Coding
(APC)
Regular Pulse Excitation Long Term Prediction (RPE-LTP)
Multi-Pulse Linear Predictive Coding (MPLPC)
Vector Sum Excited Linear Predictive Coding (VSELP)
VSELP
Multi-Band Excitation (MBE)
CELP
Vector Adaptive Predictive Coding (VAPC)
Linear Predictive Coding (LPC-10)
Year Adopted 1972
1984
1992
1985
1991
1990
1992
1993
1993
1991
1991
1977
Table 1.1: Digital Speech Coding Standards
CHAPTER 1. INTRODUCTION
Recommendation G.711
Table 1.2: Some important ITU-T recommendations
Code Rate (kb/s) 64
16,24,32,40
16
8
bits used to represent LPC parameters while keeping the spectral distortion within
acceptable limits. The bits thus saved can be used for a better representation of the
excitation.
It has already been reported [86] that a spectral distortion of less than 1 dB
is required for transparent quantization of LPC parameters. Paliwal and Atal [86]
achieved transparent quantization of LPC parameters using a highly constrained VQ
structure - Split VQ. This was a bit surprising since multi-stage VQs have failed to
achieve transparent quantization of spectral parameters before; and it was clear that
split VQ was a constrained version of multi-stage VQ (MSVQ).
An analysis of the multi-stage VQ showed that a sequentially searched MSVQ has
to be severely constrained for the sequential search to be optimal. Clearly, a better
search could be performed and as we show in this thesis, M-L search provided the
best performance-complexity trade off in obtaining transparent quantization of LPC
parameters.
An 1800 bps spectral excitation coder was also implemented to show the effective-
ness of the new efficient LPC quantizer in achieving a moderate quality (better than
LPC-lOe) coder at a low bit rate.
Algorithm PCM
ADPCM
LD-CELP
ACELP
1.2.2 Original Contributions
The original contributions reported in this thesis are as follows.
For the first time M-L search was combined with MSVQ resulting in a very
efficient, low complexity, suboptimal VQ (section 4.5).
CHAPTER 1. INTRODUCTION 11
It was demonstrated for the first time that transparent LPC quantization could
be done using an MSVQ with large number of small stages. In fact, the memory
complexity was reduced to a total of only 60 codevectors for a quantizer that
achieved transparent quantization at 30 bitslvector (Fig. 4.11).
A method for designing LPC quantizers with very low computational complexity
was indicated, resulting in complexity lower than the only transparent quantizer
known at that time (split VQ by Paliwal and Atal) (sections 4.2 - 4.4).
A transparent LPC quantizer was designed at 22 bits/vector which was the
lowest rate transparent LPC quantizer at that time (Table 4.1).
MSVQ with M-L search was proved to result in a robust VQ with respect to
transmission errors, speakers, and languages. It was the first time that a VQ
was obtained with proven robustness (section 4.6).
In designing the harmonic coder, a 0-bit harmonic magnitude shape quantizer
was used that helped to achieve low bit rate for the coder (section 5.4.3).
A new geometric pitch detector. with low computational complexity was de-
signed. The pitch detector can provide locations of individual pitch pulses which
is useful in pitch synchronous algorithms (section 5.4.1).
After publication of the first set of results in ICASSP 1992, our work has been very
widely referenced (a partial list of citations is given in appendix D). Several companies
like Rockwell and Texas Instruments have our LPC quantizer integrated into their
products. Also, to the best of our knowledge, the new DoD standard 2400 bps coder
uses our LPC quantizer.
The following is the list of publications that resulted from the work reported in
this thesis.
1. B. Bhattacharya, W. LeBlanc, S. Mahmoud, and V. Cuperman. Tree Searched
Multi-Stage Vector Quantization for 4kb/s Speech Coding. ICASSP, pp. 1-105
- 1-108, San Francisco, March 1992.
2. W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman. Efficient
Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters
C H A P T E R 1 . INTRODUCTION 12
for 4 kb/s Speech Coding. IEEE Trans. Speech and Audio Processing, Vol. 1,
No. 4, pp. 373-385, Oct. 1993.
3. V. Cuperman, P. Lupini, and B. Bhattacharya. Spectral Excitation Coding of
Speech at 2.4 kb/s. ICASSP, pp. 496-499, Detroit, May 1995.
Chapter 2
A Brief Review of Speech Coding
Literature
There is a vast literature available on information theoretic aspects of coding but as
we point out, not much of it is directly relevant in the context of speech coding. Three
major speech coding techniques are also reviewed here and an attempt is made to find
out their lacunae to obtain pointers to a successful design of a low bit rate speech
coder.
2.1 Source Coding and Rate Distortion Theory
The main concern of source ,coding theory is how best to map source symbols to
channel symbols assuming a perfect channel. This involves assigning channel symbols
to source symbols such that the average symbol length is minimum. Consider a
discrete memoryless source with symbols {xl, xg, . . . , xM) and corresponding symbol
probabilities {P(xl) , P(xz) , . . . , P(xM)). Entropy (average information per symbol)
of this source is given by
Usually, the base of the logarithm is 2 and hence entropy is measured in bits/symbol.
If the channel alphabet is binary (0, I) , then we need a minimum average of H(x)
bits per symbol to encode this source. If the source is correlated and modelled as a
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 14
stationary, ergodic, Markov process, the entropy is lower than that given by Eq. (2. I) ,
and can be written (for a first order process) as
where cj and ck are successive states for the Markov process, P(ck, cj) is the joint
probability of occurance of the state pair (ck, cj), and o k j is the transition probability
from state cj t,o state ck, i.e. a k j = P(ck, cj)/P(cj). It should be borne in mind that
an m-th order Markov process can be reduced to a first order process by considering
an m-th extension of the source alphabet and hence Eq. (2.2) applies to all stationary,
ergodic, Markov processes. The bit rate indicated by H(x) is the minimum bit rate
required to represent x without any distortion. Many times, the information needs to
be represented using a lower bit rate due to system constraints and a data compression
needs to be done. This is the problem of source coding where a set of source symbols is
to be mapped to a set of reproduction symbols with lower entropy. The rate-distortion
function R(D) gives the minimum bit rate at which information can be coded with
an average distortion of D or less. The distortion-rate function D(R) defines the
minimum distortion achievable for a given coding rate R.
Assume a discrete message source x with alphabet size M and a reproduction
alphabet y with alphabet size N. A deterministic mapping {x) H {y) of source
symbols to reproduction symbols can be completely specified by an assignment matrix
or by a table with entries denoted by Q ( j I i) to indicate that the source symbol x; is
mapped to the reproduction symbol yj. The probability that the symbol y j occurs is
given by
where Q ( j I i) represents the deterministic assignment {x) H {y). Note that if we
require the Q ( j I i ) to have the normalization
then the'function Q ( j I i ) behaves just like a conditional probability and, in fact,
is mathematically indistinguishable from a conditional probability even though the
process considered is deterministic. The function Q ( j I i) is called a conditional
assignment function.
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 15
A single-letter distortion measure [96] is given by an M x N matrix with elements
d(i, j) which reflects the cost if symbol x; is reproduced as symbol yj. The average
distortion, D, over all possible source and reproduction symbols can then be written
as :
where P( i , j) = Q ( j I i)P(x;) is the joint probability of occurrence of source symbol xi
and reproduction symbol yj. When D is given a numerical value, it is called a fidelity
criterion. The primary parameters of rate distortion theory are shown in Fig. 2.1.
Occurs with probability
Figure 2.1: The primary parameters of R-D theory
Given the input probability distribution p(x) and the distortion measure, the av-
erage distortion is a function of the conditional assignment function Q ( j I i). A
conditional assignment Q( j 1 i) is called D-admissible if it results in an average dis-
tortion that is upper bounded by D. We define a set of D-admissible assignments,
QD as -
QD = {Q( j I 2 ) : D 5 0). (2.6)
The mutual information between source messages and reproduced messages is
Using the relationship
P ( i , j ) = Q(j I i )P(xi)
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 16
and Eq. (2.3), the mutual information can be written as
Thus I(x, y ) is dependent on the conditional assignment function Q(-, .) and the input
probability distribution p(x).
The rate distortion function, R(D), is defined as the minimum of I (x , y ) over the
set of D-admissible conditional assignments QD that produce an average distortion
less than or equal to D, i.e.
R(D) = min I (x , y). Q E Q D
It is evident that in order to apply rate distortion theory to determine performance
limits of speech coders, the major difficulty encountered is in defining the terms in-
volved. It is not clear how to define what constitutes the source and reproduction
alphabets, and hence it is not possible to talk about the source probability distribu-
tion, entropy of the source, or a distortion measure between source and reproduction
symbols. It is generally observed that unvoiced speech can be coded in a perceptually
accurate manner at a very low bit rate [68] and that required for voiced speech is
usually relatively high. This shows that most probably the voiced portions of speech
carry more information compared to the unvoiced portions contrary to one's first im-
pression that unvoiced segments have a high information rate because of their lack of
an obvious structure. In fact, if one attempts to compute entropy of different speech
segments using quantized PCM speech and a Markov model, it is highly probable that
the results will show unvoiced segments as the main carriers of information.
De and Kabal [30] made some attempts to apply rate distortion theory to speech
coding using cochlear models and a perceptual distortion measure called cochlear
discrimination [29]. The performances of four different speech coders - 4.8 kb/s CELP,
8 kb/s VSELP, 16 kb/s wide-band CELP, and 32 kb/s ADPCM, were studied and
compared with their rate-distortion performance limits. The results [30] showed that
the perceptual quality obtained by the 4.8 kb/s, 8 kb/s, 16 kb/s, and 32 kb/s coders
can be achieved at 1.5 kb/s, 4 kb/s, 5.4 kb/s, and 20 kb/s respectively according to
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 17
the rate distortion curve computed by them. Considering the present day research
goals these results seem to be quite reasonable.
A review is presented below of three different speech coding philosophies along
with their merits and demerits.
2.2 Analysis-by-synt hesis Speech Coding
Analysis-by-synthesis, as the name implies, involves analysis and synthesis. This
further implies that these coders are parametric coders that require an analysis to
compute model parameters. Analysis-by-synthesis is a general approach in which some
or all of the model parameters are estimated by systematically searching a parameter
space for a close match between synthesized and original speech. The search is carried
out by starting with speech being synthesized using an initial set of parameters and
then changing the parameter set and resynthesizing the same segment of speech until
all points in the parameter space have been visited. The set of parameters that
produced synthetic speech closest to the original speech according to some chosen
distortion measure is transmitted to the decoder.
Essentially, the analysis-by-synthesis technique can be applied to any paramet-
ric speech coder which satisfy (or can be constrained to satisfy without making the
synthetic speech of unacceptable quality) the following two conditions.
1. The parameter space should be finite.
2. It should be possible to quantize the parameter space into a finite set of points.
It should be noted that since all parameters need to be quantized anyway before
transmission to achieve finite bit-rate, essentially all parametric coders can be imple-
mented in the analysis-by-synthesis fashion. The actual choice of which parameters
(if any) to estimate using analysis-by-synthesis technique depends on the ease with
which the parameter space can be searched and is determined by a complexity vs.
quality tradeoff. Also, for some parameters (e.g. synthesis filter parameters for a
source-filter model), direct computation techniques may exist obviating the need to
do a search.
It should be borne in mind that the distance computation (using a chosen distor-
tion measure) between original and synthetic speech during the search can be made
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 18
either in time domain or in a transform domain (e.g. frequency domain). The general
structure of an analysis-by-synthesis system is shown in Fig. 2.2.
Direct Analysis (optional)
Computed parameters
Parameter Code book
/ Synthesis /
- A-by-S Model parameters
Index selection
1 ..-..-..-..-..-..-. Selected index
computation 9 Encoder -..-..-..-. Decoder
Synthesis book
Figure 2.2: A Generalized Analysis-by-Synthesis System
Although analysis-by-synthesis coders belong to a general class as defined above,
the term usually refers to the more specific class of parametric coders that employ
a linear prediction based synthesis filter in a source-filter configuration. The first
practical A-by-S system was the multi-pulse LPC (MP-LPC) [5] where the excitation
sequence was modelled as a sequence of pulses whose positions were determined in an
A-by-S manner. After the optimal positions are determined, the pulse magnitudes are
computed. In Regular Pulse Excitation (RPE-LPC) [67], the excitation sequence is a
sequence of regularly spaced pulses where the position of the first pulse and the pulse
amplitudes are encoded. The most important and popular form of A-by-S coding is
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 19
speech
zero input
zero state
codebook
zero state zero state
Figure 2.3: Computational structure of the CELP coder
known as Code Excited Linear Prediction (CELP) coding.
The rudimentary structure of a CELP coder has already been discussed in Chap-
ter 1 along with a schematic diagram (Fig. 1.5). A computationally efficient structure
is obtained by pushing the perceptual weighting filter W ( z ) through the summation
sign giving rise to the weighted input speech signal and the weighted short term
synthesis filter %. This structure is shown in Fig. 2.3.
The computation of the synthetic vector is simplified in this model by separating
the zero-input response (ZIR) and zero-state response (ZSR) of the synthesis filter.
As shown in Fig. 2.3, only ij(n) depends on the code vector being filtered and u(n)
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 20
and r (n ) only depend on the filter parameters. Therefore, a target vector y(n) is
calculated as y(n) = s,(n) - u(n) - r (n) which is matched with $(n) to search for an
appropriate code vector. Since $(n) only constitutes the ZSR of the synthetic filter, it
can be computed as a matrix vector multiplication of a code vector with a fixed (for
the duration of a subframe) lower triangular toeplitz impulse response matrix H.
and N is the dimension of the subframe vector. The special structure of the impulse
response matrix facilitates a low complexity computation of the filtered code vectors.
Most of the research in CELP have been directed towards complexity reduction
and many different techniques have been investigated to that end. Computational
complexity can be reduced by introducing some structure in the stochastic codebook
albeit with some loss in performance due to suboptimali ty introduced by the structural
constraint. Several suboptimal codebook structures have been studied in an attempt
to reduce complexity. A widely used technique is the use of sparse codebooks where
most of the elements are zeros. Sparse codebooks were first independently proposed
by Davidson and Gersho [27] and Lin [71]. Lin [71] also suggested an overlapped
codebook technique where each code vector is a subsequence derived from a longer
sequence of random numbers. Each code vector is obtained by shifting a fixed length
selection window over the longer sequence by one or more samples. Substantial savings
in computation and storage can be obtained by this technique. Sparse codebooks can
also be combined with overlapped codebooks and elements of the codevectors can be
restricted to take on only binary or ternary values. The DoD FS-1016 4.8k CELP
coder is an example where a sparse, ternary, overlapped codebook is used [19].
Other structured codebooks leading to some complexity reduction are lattice code-
books [46] and algebraic codebooks [2]. In these structures, regularly spaced arrays are
used as codebooks obviating the need to store them. Since these codevectors can
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 21
be generated in an orderly fashion, there is a predetermined correspondence between
lattice points and binary words.
A different complexity reduction technique is used in VSELP [44, 431. The coder,
adopted as the North American standard (IS-54) for digital cellular communication,
contains two VSELP excitation codebooks, with 2M and 2N codevectors respectively.
These are constructed from a set of M and N basis vectors. In IS-54, both M and N
are 7 giving rise to 128 codevectors in each codebook. Gerson [43] also reported a 4.8
kb/s VSELP coder using a single excitation codebook with M = 10 basis vectors. The
following description assumes a single excitation codebook for brevity, the extension
to multiple codebooks is straight forward. Defining vm(n) as the mth basis vector and
u;(n) as the ith codevector, each from the VSELP codebook, then:
M
~ i ( n ) = C oimvm (n) (2.13) m=l
where 0 5 i 5 2M - 1 and 0 5 n 5 N - 1.
Thus, each codevector in the codebook is a linear combination of the M basis
vectors. The coefficients Ojm are equal to 1 if bit m of codeword i is a 1 and are
equal to -1 if the corresponding bit in the codeword is 0. This special structure of
the VSELP codebook lends itself to a fast search as only the basis vectors need to be
filtered since the filtered codevectors are formed as sums and differences of the filtered
basis vectors as the linear combination coefficients are restricted to either +I or -1.
The VSELP also uses an adaptive codebook as in CELP (Fig. 1.5). The adaptive
codebook is a sequence of past excitation, a suitable segment of which is used to form
the current excitation along with the contribution(s) from the excitation codebook(s) .
The adaptive codebook and the VSELP codebook(s) are jointly searched by searching
the adaptive codebook first and orthogonalizing the filtered basis vectors with respect
to the chosen adaptive codevector. In general this will be a highly computationally
intensive operation but is feasible for the VSELP structure.
There are many different ways codebooks have been structured (e.g. multi-stage
VQ, split VQ, etc.). Various structured codebook design techniques have been dis-
cussed in detail by LeBlanc and Mahmoud [70].
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 22
Q2 ............. ~ i ' %{ Analysis &-- m-3 Synthesis
Figure 2.4: Schematic diagram of a Transform Coder
2.3 Transform Coding
Transform coding, as the name implies, deals with the problem in a transform domain.
Speech is first transformed into a suitable set of parameters which are quantized and
inverse transformed to obtain decoded speech. In general, it is not necessary that
the speech signal itself be transformed and quantized but a parametric description of
the signal may be obtained first and it may be useful to quantize the parameters in
a transform domain instead of using a straightforward quantization. The parametric
representation would in general be derived from an appropriate model of the speech
signal. The general block diagram of a transform coder is shown in Fig. 2.4. Blocks
T1 to Tn are the transforms and Q1 to Qn are the respective quantizers. The analysis
block may be just an identity operator. Transform coders are useful when the elements
of the input vector are highly correlated to each other and a transform can achieve
decorrelation and energy compaction such that most of the signal energy is contained
in a subset of the transform coefficients. An adaptive bit allocation technique can
then be used for efficient coding. The quantizers Q1, Q2,. .., Qn may be scalar [113]
or vector '[22] quantizers.
An insight into the transform coding process can be obtained by considering the
simple case where each quantizer Q j is a scalar quantizer, and only one transform A
is applied to a block of input samples of length N. The transform equation is written
as
y = A x (2.14)
The minimization of the average distortion in a transform coder involves ( i ) choice of
an optimum bit assignment rule, and (ii) choice of an optimum transform A.
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 2 3
The variances of the transform coefficients are different in general and the bit rate,
Ri (bits/sample), required to quantize the coefficients y; of variance a: such that the
average mean squared distortion is upper bounded by D; can be written as
The second term in the above equation is the rate distortion bound for i.i.d. Gaus-
sian variables and 6 is a correction term that takes into account the performance of
practical quantizers and any deviation from a Gaussian distribution. It is easy to
show that the optimum bit assignment for quantizing the transform coefficients for
minimum average distortion is given by
where is the average bit rate in bitslsample. With an optimum bit assignment, the
average distortion can be written as
Let R,, and Ryy be the covariance matrices for the input signal and the transform
coefficients respectively. Then,
det Ryy 5 n o; j=1
for any transform A, and
det R,, = det Ryy
for any unitary transform A. The variances a; are the diagonal elements of Ryy . We
also have
det R,, = n A j
where X j are the eigenvalues of R,,. Observing equations (2.17)-(2.20) it can be seen
that minimum distortion is achieved if the variances oj2 are equal to the eigenvalues
X j . The Karhunen-Loeve transform (KLT) has the desired property.
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 24
Assuming that the quantizer parameter 6 in Eq. (2.15) remains the same whether
time domain or transform domain samples are quantized, the coding gain of transform
coding over PCM can be written as
For unitary transforms, the signal variance a2 is equal to the average of the variances
of the transform coefficients: 1
a'= -xu; N j=1
Thus, the gain of transform coding over PCM is the ratio of the arithmetic and
geometric mean of the variances of the transform coefficients. The maximum gain is
achieved if the transform is KLT, and GKLT is equal to one only if all eigenvalues are
equal, i.e. if the signal process is white noise.
Comparing transform coding with predictive coding, it can be shown [28] that
where Gp(j) is the prediction gain of an optimal j th order predictor. The maximum
transform coding gain GKLT is just the geometric mean of the predictor gains. The
predictor gains are monotonically increasing with the order of the predictor. Hence,
the transform coding gain is always smaller than the predictor gain if a transform
coder with a block length of N is compared with a predictive coder employing Nth
order predictor. The asymptotic coding gain for transform coding is the same as that
for DPCM [23] and is equal to the spectral flatness measure of the given signal [59].
This means that transform coding may achieve the same degree of signal decorrelation
as linear prediction.
speech signal is essentially non-stationary. Therefore, one needs to compute the
KLT matrix for every block of samples being coded and transmit the transform matrix
to the decoder for optimal transform coding. This is a highly expensive operation
considering the computational complexity and resulting bit rate of the coder. It has
been shown [93, 1131 that the Discrete Cosine Transform (DCT) performs almost as
well as the KLT enabling one to use a fixed transform matrix. Zelinski and Noll [I131
have shown that adapting the bit assignment to local signal statistics gives an extra
SNR gain of 4 to 6 dB compared to a fixed bit assignment transform coder.
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 25
2.4 Sinusoidal Coding
An important class of coders, generally called sinusoidal coders, has emerged in recent
years as a promising choice for bit rates below 4 kb/s. These coders use a sinusoidal
representation of speech by expressing the synthetic speech as a sum of sinusoids.
M ( n )
i (n ) = Am (n) cos 0, (n) m = l
The best known systems based on sinusoidal coding are Sinusoidal Transform Coder
(STC), and Multi-Band Excitation (MBE) Coder. While STC uses a sinusoidal model
to synthesize both voiced and unvoiced speech, MBE uses the sinusoidal representa-
tion only for the voiced part of the speech. The unvoiced segments are synthesized in
the frequency domain for MBE while the voiced segments are synthesized in the time
domain using a sinusoidal model as in Eq. (2.24). The major difference between the
two techniques is the computation of the harmonic frequencies and their magnitudes.
In MBE, the frequencies and magnitudes are evaluated in a closed loop fashion in
the frequency domain as the solution to an optimization problem. The cost function
is defined as the squared error between the windowed speech spectrum and the syn-
thetic spectrum. If S,(w) is the windowed speech spectrum, Ak are the harmonic
magnitudes, and W(w) is the window spectrum, then the error to be minimized is
given as
Minimizing the above expression under the assumption of an orthogonal window (i.e.
' 2~ J_", W*(w - kwo) W(w - 2wo)dw = 0 for k # l ) , the harmonic magnitudes can be
where the asterisk (.*) indicates the complex conjugate. The optimal values of A k and
wo are jointly searched for to obtain minimum error. This procedure yields a very fine
estimate of pitch as well as spectral magnitudes.
In STC, on the other hand, pitch and harmonic magnitudes are estimated in an
independent fashion. This reduces the computational complexity of the algorithm but
also makes it error prone.
CHAPTER 2. A BRIEF REVIEW OF SPEECH CODING LITERATURE 26
The other difference between the two approaches is the way the harmonic phases
are treated. From Eq. (2.24) it can be seen that the angle B,(n) is a function of
time n. In general this allows for arbitrary variation in the harmonic frequencies
(being derivatives of phases) and it is not necessary that a harmonic relationship is
maintained between them at all times. MBE allows a linear change in fundamen-
tal frequency over an analysis frame, i.e. it allows for a piecewise linear change in
pitch with time. This implies that in the MBE model, the harmonic phases change
quadratically with time. The STC approach however allows for a piecewise quadratic
change in pitch thereby allowing a cubic change in phase values.
The main difficulty in harmonic coders arises from the fact that the number of
harmonics within the band of interest varies with time thereby requiring techniques to
deal with the problem of quantizing a variable number of parameters with a constant
number of bits (at least in the case of fixed bit rate systems). Several solutions
have emerged recently that handle the issue with adequate quantization performance
[26, 741. A different approach to this problem has been to obtain a residual signal
with relatively flat spectrum through LP modeling. In this case the residual signal is
modelled as a sum of sinusoids and the problem of quantizing harmonic magnitudes
get reduced to a simple scalar quantization [I101 or can be achieved with a very simple
quantizer structure [25].
2.5 Relative Merits and Demerits of Different Cod-
ing Strategies
The three coding strategies discussed above have been the major focus of research in
speech coding. The huge success of CELP lies in the closed loop nature of its struc-
ture. The analysis-by-synthesis method successfully searches the parameter space and
brings the power of vector quantization to coding excitation sequences in a source-filter
parametric coder. Although use of the perceptual weighting filter takes advantage of
the human auditory perception characteristics and a source-filter model is used, CELP
coders attempt to match original and synthetic waveforms. This is dictated by the
Euclidean distortion measure used to select the optimal excitation in the absence of
any known perceptual distortion measure. Nevertheless, the structure is very flexi-
ble in the sense that it can accommodate any meaningful distortion criterion as they
become available.
The main demerit of CELP is that it places a large emphasis on time domain
behaviour of the synthetic speech (through selection of squared error as the distortion
measure) while our ears are not very sensitive to phase information which is inherently
retained by a time domain description at the expense of many bits. This is the reason
why CELP coders do not work very well below a bit rate of about 4.8 kb/s without
seriously degrading the quality of speech.
The other two coding structures, e.g. Transform Coding and Harmonic Coding
attempt to address the problem of obtaining good perceptual quality by modelling
speech in a perceptually meaningful way. Classical Transform Coding (ATC and
VTC) does not address the question of perception in a direct manner, instead they
focus on finding a suitable transform that will concentrate all the energy in a specific
number of bins and quantize those bins using an adaptive bit-allocation procedure.
Harmonic coders have been the main focus of research in recent years because
they address the issue of perceptual significance in a direct manner. This allows them
a parsimonious parametric description of the speech sound rather than the speech
waveform. The main difficulty encountered is of course the lack of detailed knowledge
about what is perceptually important. That such knowledge can reduce the number
of parameters to be considered for quantization can be demonstrated with a simple
example of a single tone. Viewed as a waveform, a proper description of the signal
must include three parameters - (a) amplitude, (b) frequency, and (c) phase at a
given time instant. Perceptually, only the first two parameters suffice as a complete
description of the sound of the tone. This does not apply however when a large
number of tones are involved. A classical example would be a single pulse. This can
be described as the sum of infinitely many tones added with a very definite relative
phases. The sound produced by this waveform is like a percussion. When the same
tones are added with random phases, the sound produced is like a hissing sound of
white noise. This brings us to the importance of correct phase modelling in sinusoidal
coders. Currently, phase modelling is the main problem in obtaining toll quality
speech with a low bit rate sinusoidal coder.
Chapter 3
Quantization of LPC parameters
It has been already pointed out in Chapter 1, section 1.2, that a significant number of
bits in a low bit rate speech coder goes to quantizing the LPC parameters, also called
the spectral parameters. The objective of designing a good quantizer is to minimize
the average distortion between unquantized and quantized sets of parameters with a
reasonable amount of complexity. An introductory presentation of scalar and vector
quantization is given in Appendix B.
Given many equivalent representations of spectral parameters (an introduction
to Linear Prediction and equivalent representations is given in Appendix A), and
different quantization techniques, there are several issues that need to be addressed
for designing an efficient quantizer for LPC parameters. Typically, the following
questions need to be addressed.
1. Which parameter representation should be used? Some of the popular choices
have been - (a) reflection coefficients, (b) log area ratios, (c) cepstral coefficients,
(d) arc sine coefficients, and (e) line spectral frequencies.
2. What quantization strategy to use - scalar, vector, or matrix quantization?
Should a hybrid strategy like partly scalar and partly vector quantization be
used?
3. What distortion measure should be used in designing the quantizer?
4. How should the bit allocation be done for scalar quantized parameters?
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 29
5. Should any orthogonalizing transform be applied to the parameters before they
are quantized?
6. What should be the structure of the codebook if vector quantization is used?
Should it be a full search or a tree search codebook? Should it be a trained
codebook or a stochastic codebook? Should it be a single stage or a multi-stage
codebook?
7. Should the signal be classified before quantization and different strategies be
adopted for different classes?
Generally, the answers to these questions depend on the particular scenario in
which the quantizer needs to operate and the performance requirements it needs to
satisfy. A brief review of the past research in this ,area is presented below. i
3.1 Choosing an Appropriate Spectral Represen-
tation
Over the past several years, it has become quite clear that Line Spectral Frequencies
(LSFs) are probably the best candidates for quantizing the speech spectral enve-
lope. Before the introduction of LSFs by Itakura [56] in 1975, it was widely believed
that Log Area Ratios and Reflection Coefficients are the best candidates for spectral
quantization. Particularly, Viswanathan and Makhoul [I071 studied many equiva-
lent representations including (a) linear predictor coefficients, a;, (b) autocorrelation
coefficients of {a;), (c) cepstral coefficients of A(z), (d) poles of l/A(z), and (el reflec-
tion coefficients, k;; and showed that a transformation of reflection coefficients, Log
Area Ratios (LAR), were the best choice for quantization among the parameter sets
considered.
Tohkura and It akura [loll studied the spectral sensitivities of PARCOR (reflec-
tion) coefficients and their transforms for a 10th order linear prediction model. Specif-
ically the study was done to compare spectral sensitivities of Reflection Coefficients
(k;s), Log Area Ratios (LARS), and Arc Sine Parameters (ASRCs) for efficient scalar
quantization of these parameters. The study showed a monotonically decreasing sen-
sitivity for k;s with kl having the highest sensitivity which was twice that of kz. The
C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 30
spectral sensitivity of the first reflection coefficient was also speaker dependent whereas
those of the higher order coefficients were less dependent on the speaker. This is an
important observation and poses a problem for the robustness of trained quantizers.
Unlike the k;s, ASRCs and LARs show much less variation in spectral sensitivity -
lower order coefficients showing only marginally higher sensitivity compared to higher
order coefficients. The first order coefficient still showed a high dependence on speak-
ers and female speech was seen to have a higher sensitivity for first order parameters
compared to male speech.
A study of the effects of preprocessing showed that speaker dependence is decreased
by using short analysis windows (FZ 10 ms) and spectral smoothing by autocorrelation
windowing.
A compilation of different scalar quantization results from the literature is shown
in Table 3.1 where the quantization distortion is measured as spectral distortion in
dB, defined as
where AQ(ejw) is the quantized form of the inverse filter A(ejw). These results clearly
show that transformed coefficients have better quantization properties when compared
to actual reflection coefficients. For example, while a 3 dB average distortion was
achieved using 24 bits for reflection coefficients [60], a 2 dB average distortion was
obtained with 25 bits using line spectrum frequencies [34].
The main advantage of using LSFs for quantization was pointed out by Kang and
Fransen [62]. They showed that the effects of quantization error in LSFs are localized
around the respective LSFs in question, i.e., if a single LSF is disturbed, the spectral
error at far removed frequencies is practically zero. This is unlike reflection coefficients
where an error in one coefficient produces error in the entire spectrum. However, as
seen from Table 3.1, LARs and LSFs both appear to be suitable representations of
spectral parameters for quantization.
Another issue in choosing an appropriate representation of LPC parameters is
the interpolation properties of the particular representation used. In practical speech
coding systems it is not possible to have spectral information transmitted in a contin-
uous manner. Usually, the interval between two successive transmissions of spectral
information is of the order of tens of milliseconds. This means that while synthesizing
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
Parameter
RC [51] LAR [51] ASRC [51] ASRC [79]
ASRC [17]
RC [60]
LSF [loo]
RC [45]
LSFD [7] LSFD [7] LSFD [7] ASRC [7] ASRC [7] ASRC [7] LAR [7] LAR [7] LAR [7] LSF [34] LSF [34] LSF 1341 LSF [34] LSF [34] LSF [34]
No. of bits ~ S D (dB) 3.0 (max) 3.0 (max) 3.0 (max) 1.0 (avg)
1.8 (avg)
3.0 (avg)
1 .O (avg)
Not Measured
1.5 (avg) 1.1 (avg) 0.8 (avg) 1.4 (avg) 1.1 (avg) 0.8 (avg) 1.4 (avg) 1.1 (avg) 0.8 (avg) 3.7 (avg) 2.7 (avg) 2.0 (avg) 1.3 (avg) 1.0 (avg) 0.7 (avg)
Comments
Lower bound on number of bits required for a maximum spectral distortion Minimum deviation method [49], based on experimentally derived ~robabilitv densities Uniform sensitivity quantization
11071 Minimum deviation method [49], open test Nonuniform quantization, designed to minimize mean squared error Only voiced segments; impercepti- ble difference Bandwidth expansion is used. No significant difference observed be- -
tween different parameter choices
Adaptive quantization with backward adaptation
Table 3.1: Some early scalar quantization results
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 3 2
speech, the spectral information needs to be interpolated for good results in order
to avoid steep changes across frame boundaries. Umezaki and Itakura [lo51 studied
the temporal variations of LPC parameters and found that the LSFs are particularly
suitable for interpolation. Atal, Cox and Kroon [7] did a detailed study of interpo-
lation properties of different LPC representations and could not find much difference
between LSFs, LARS, and ASRCs. Interpolation in all these domains always produces
a stable synthesis filter.
It is widely believed [86] that a spectral distortion of less than 1 dB (in quantizing
LPC parameters) is required to achieve perceptually transparent spectral quantiza-
tion. From Table 3.1 it is clear that at least 33 bits are necessary to achieve this goal
using scalar quantization.
3.2 Preprocessing
It was observed by Viswanathan and Makhoul [I071 that short-time spectral dynamic
range of speech signals is the single most important factor that affects the quantization
properties. There are two popular techniques that address this issue in completely
different ways. Atal and Schroeder [ll] also describes another technique to improve
the stability of the LPC parameters that is particularly important for finite precision
arithmetic. These techniques are -
i) pre-emphasis,
ii) bandwidth expansion,
iii) high frequency compensation.
Pre-emphasis reduces the spectral dynamic range by decreasing the general slope of
the spectrum [107]. This is done by passing the speech signal through a single zero
filter of the form 1 - cuz-l. An optimal value for cu is obtained by solving for the
pre-emphasis filter that "whitens" the signal. This is given by the first order linear
predictor, where r1 a = - ro'
C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 3 3
rl and ro being autocorrelation coefficients of the speech signal at lags 1 and 0 re-
spectively. In practice, researchers have used values of cr = 0.8 to 1.0 [88, 61.
3.2.2 Bandwidth Expansion
In bandwidth expansion, reduction in spectral dynamic range is obtained by increas-
ing the pole bandwidths of the linear predictor [107]. Often the pitch frequency Fo is
very close to the first formant frequency Fl causing an underestimation of the predic-
tor pole bandwidths [102]. This produces extremely high sensitivities to parameter
perturbation; a slight change in parameters produces a big change in spectral enve-
lope. An underestimated bandwidth also causes unnatural speech at the decoder.
Tohkura et. al. [I021 showed that application of bandwidth expansion before quan-
tization produces lower spectral distortion compared to bandwidth expansion of the
quantized spectra at the decoder.
Let the i-th root of A(z) = Cyz0 aiz-' be
Then the 2-th formant frequency, F, and the corresponding bandwidth, B; are given
by [51, 1021
where T is the sampling interval. Bandwidth expansion is achieved by replacing a;
in A(z) by crie-""Ti. The modified polynomial becomes
If poles of l /A(z) are at z;, then poles of l/Af(z) are at
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 34
Hence the new bandwidth is given by
For a sampling rate of 8 KHz and bandwidth expansion of 10 Hz, the expansion factor
Common values for bandwidth expansion are 10-15 Hz [86, 71.
3.2.3 High Frequency Compensation
The technique of high frequency compensation (HFC), introduced by Atal and Schroeder
[ll] compensates for the loss in high frequency components in sampled speech due to
the use of non-ideal anti-aliasing filters before sampling. The autocorrelation matrix
of the low-pass filtered speech is nearly singular (typical value of the LINPACK recip-
rocal condition estimate is 5.0 x This results in a non-unique solution for the
prediction coefficients since all practical computations are done using finite precision
arithmetic. That is, different sets of predictor coefficients can approximate the speech
spectrum equally well in the passband of the low-pass filter.
The ill-conditioning of the autocorrelation matrix can be avoided by adding to the
autocorrelation matrix another matrix proportional to the autocorrelation matrix of
high-pass filtered white noise. If the autocorrelation matrix of the speech segment
being analyzed is R , the modified autocorrelation matrix, R, is obtained as
where X is a small constant between 0.01 and 0.10, R, is the autocorrelation matrix of
high-pass filtered noise, and CYM is the minimum mean squared value of the prediction
error. High frequency compensation and bandwidth expansion by 10 Hz have been
used in all our computation of prediction parameters. The values used were
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 35
where we used the same values as Atal [ll] for p,, i.e., po = 318, pl = -114, pz = 1/16,
and pk = 0 for k > 2. The minimum residual energy c r ~ is computed from the LPC
coefficients a; computed from the uncorrected signal autocorrelation coefficients r,(i)
as 1
= x a;r,(i) (3.11) i=o
where 1, the order of the predictor, is usually smaller than the actual order, p, of the
predictors being corrected. Examples of spectral envelopes of speech computed for
X = 0 (uncorrected) and X = 0.05 is shown in Fig. 3.1.
-201 I I I I I I I I 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Figure 3.1: Spectral envelope of speech without (solid line) and with (dash line) high frequency compensation
3.3 Vector Quantization of LPC Parameters
Theoretically, vector quantization should yield better performance than scalar quan-
tization. One of the major problems in applying VQ to encode spectral parameters
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 36
is the very large codebook required to achieve a spectral distortion of less than 1 dB.
Codebooks of such sizes are impossible to implement in real time with the present
day technology. Practically, in most applications, it is very difficult to handle a sin-
gle stage codebook larger than 12 bits. Other problems related to the application of
VQ are possible talker dependence of the performance obtained with trained code-
books and sensitivity to transmission errors. Still, VQ has become the quantization
technique of choice for spectral parameters because of its efficiency at low bit rates.
In this section, we present a summary of previous results reported in literature for
application of VQ to spectral parameter quantization. Careful judgement is needed
while comparing different VQ simulation results because some of them are closed
test, i.e., in training, and some are open test, i.e., out of training. Another difficulty
is introduced by the fact that different researchers do not report their results in terms
of the same distortion measure and that makes a comparative analysis particularly
difficult. In the present discussion, the performances are compared in terms of spectral
distortion (dsD) expressed in dB. The spectral distortion has already been defined
earlier in Eq. (3.1). Other popular distortion measures (for two unity gain spectra)
used in the literature are
1. Likelihood ratio (with equivalent representations) [60]:
where the (:) symbol indicates a quantized variable, a is the residual energy for
the vector of quantized predictor parameters and a, is the minimum residual
energy obtainable using a predictor of order p. The k;s denote reflection coef-
ficients and R, is the autocorrelation matrix of speech signal of order p + 1. a
can also be computed as
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 3 7
where r,(n) and rx(n) are given by
2. Modified Itakura-Saito distortion 1421:
3. Squared error (Euclidean distance):
where x and i are unquantized and quantized parameter vectors respectively.
4. Weighted squared distortion:
where x and i are unquantized and quantized parameter vectors as before and
W ( x ) is an appropriate weighting function.
It can be shown [50] that for small distortions, the spectral distortion can be approx-
imated from the likelihood ratio as
dsD cz l ~ l o g , ~ e x JZ& (3.24)
= 6 . 1 4 2 6 . (3.25)
Buzo et. al. [17] attempted VQ of spectral parameters by designing a codebook for
r,, the autocorrelation of prediction parameters at the encoder and a corresponding
codebook of reflection coefficients at the decoder. They obtained a spectral distortion
of 1.8 dB (closed test) with a 10-bit full search codebook. It should be borne in mind
that a closed test performance measure can be very misleading and is not the right
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
I way to test codebook performance. Based on a similar approach but using dLR as the
distortion measure, Juang et. al. [60] obtained a spectral distortion of 2.37 db (closed
test) and 3.35 dB (open test) with a 10-bit codebook. The difference between the
open and closed tests is rather large and indicates that one has to be very careful in
choosing an appropriate training set. Wong et. al. I1081 reported a spectral distortion
of 2.56 dB (closed test) using a 10-bit codebook of autocorrelation coefficients.
In assessing the advantages of vector quantization over scalar quantization Juang
et. al. [60] made an important observation that vector quantization produces smoother
error spectral transitions compared to scalar quantization. This comparison helps in
understanding the difference in speech quality obtained using different quantization
techniques. They concluded that since the error spectra changes smoothly for VQ,
sustained sounds such as vowels will generally be distorted in a similar manner for
consecutive frames. From a perceptual standpoint, such consistent distortion from
vector quantization does not seriously introduce an extra effect such as warbles due
to frame transition.
Other than straightforward VQ of spectral parameters various innovations have
been tried by different researchers. Some of these techniques are described below.
3.3.1 Stochastic VQ
In stochastic VQ 16, 951, the LPC parameter vector is quantized using a Gaussian
codebook by transforming the uncorrelated codebook entries into vectors having cor-
relations similar to those of the LPC parameter vectors. An important advantage of
using a random codebook is that they provide robust performance across different
speakers and speech recording conditions. In practice, the vector to be quantized is
transformed to a vector with uncorrelated components and quantized using a random
Gaussian codebook.
Let a vector x of dimension N be transformed into a vector u with uncorrelated
components by using an orthogonal rotation with an N x N matrix A
For x with jointly Gaussian components, the optimal rotation A is given by a matrix,
whose rows are the normalized eigenvectors of J?,, the covariance matrix of x. This
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 39
is usually referred to as Kahrunen-Loeve Transform (KLT), and it can be applied to
some extent to non-Gaussian sources [75]. The covariance matrix is given by
where E(.) is the expectation operator and 51 = E ( x ) . r, can be decomposed into
where V is a matrix whose columns are the normalized eigenvectors of I?, and D is
a diagonal matrix whose elements are the eigenvalues of I?,. Therefore, the rotated
vector u is given by
u = VTx. (3 .29 )
It can be shown that the covariance matrix of u is the diagonal matrix D, which
means that u has uncorrelated components. In order for the transformed vector u to
have unity covariance matrix and zero mean, the following transformation is used
u = D'f2VT ( X - Z ) . (3 .30 )
In the stochastic vector quantization method, a vector x is quantized using codevectors
chosen from a codebook of zero mean, unity variance, Gaussian entries through the
transformation
2 = x + P V D ' I ~ U ~ . (3 .31 ) .
The scalar P is introduced to allow flexibility in matching the powers of x and 2. The
mean squared error minimized during the codebook search is given by
The simplification in the last step is afforded by the fact that V is unitary. The
optimum codebook gain ,f3 is computed from Eq. (3 .33 ) by setting dEk/dp = 0;
which, after some simplifications, becomes
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
F Salami et. al. [95] computed a long term covariance matrix r, from a large database
of LPC vectors and used the same covariance matrix for all speech signals. They
reported no improvement when r, was updated every LPC analysis frame. The
stochastic VQ technique was tried with LSF difference vectors (difference between
the present unquantized and previous quantized vectors) and LAR vectors. They
reported an average dsD of 0.8 dB for LSFD using 23 bitslvector, and an average dsD
of 1.0 dB for LAR using 28 bits/vector.
3.3.2 Techniques Exploiting Interframe Correlations
Selective encoding of sub-vectors
Papamichalis and Barnwell [88] considered a variable rate quantization of PARCOR
coefficients where parameter subvectors are transmitted depending on the change in
the analyzed parameter vector. They utilized interframe correlation of parameters
and observed that for many sounds, parameters do not change significantly from one
frame to the next. Several consecutive frames were analyzed at once and all possible
sequences of PARCOR coefficient vectors were examined before a selection was made.
They also observed that the leading coefficients were more perceptually significant
than the trailing ones and must be updated with a higher priority. Three different
distortion measures were investigated - (i) spectral distortion, (ii) mean square log
area ratio distance, and (iii) mean square inverse sine distance. In their study, mean
square LAR distortion performed better than spectral distortion. Up to a maximum
of 16 consecutive vectors were examined with a dynamic programming algorithm
in deciding which subvector should be quantized. It was noted that no significant
perceptual improvement was obtained beyond a depth of 6 stages.
Switched-Adaptive Interframe Vector Prediction
Switched- Adaptive Interframe Vector Prediction (SIVP) [ I l l ] considers the time se-
quence of LPC parameter vectors as a realization of a stochastic vector process. The
correlation between successive time indexed random vectors is modelled using a first
order predictor whereby an estimate of the n-th parameter vector is written as
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 4 1
where A is a p x p prediction matrix and x n - ~ is a zero mean vector at time index
(n - 1). If the vector components have non-zero mean then the mean is subtracted
before prediction. The prediction error vector en is given by
The optimum prediction matrix which minimizes the mean squared prediction error
is given by [24]
A = ColCy;' (3.37)
where 1 N
E ( - ) is the expectation operator and N is the number of vectors in the training set.
A schematic diagram of the SIVP technique is shown in Fig. 3.2. The classifier works
flag signal I -4 Classifier 1
Figure 3.2: SIVP coding system
on a codebook of instantaneous correlation vectors rn defined as
For every input parameter vector, rn is computed and an appropriate predictor Pi is chosen. The index of the chosen predictor matrix is transmitted to the receiver
as side information. The error vector en can be quantized using scalar or vector
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 42
quantization. Young et. a1 [ I l l ] noted that synthetic speech almost indistinguishable
from the original could be achieved with 26 bits/vector using SIVP combined with
scalar quantization. Same speech quality was obtained with 20 bits/vector when VQ
was used following SIVP.
Tree-Searched VQ with Interblock Noiseless Coding
Tree-Searched VQ (a constrained VQ technique described in the next section) with
interblock noiseless coding (TSVQ-IBNC) [84, 901 uses a tree search VQ to exploit
the correlation between vector components (intraframe correlation) and interblock
noiseless coding to exploit correlation between successive vectors (interframe corre-
lation). Phamdo and Farvardin [go] designed a tree search VQ where the encoder is
implemented by a tree-searched algorithm as shown in Fig. 3.3 for a 3-level codebook.
Here, ci is the codevector associated with the i-th node at the j-th level of the tree.
Figure 3.3: A tree-searched VQ for m = 3
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 43
Initially the encoder compares the source vector, x, with c: and c:, and depending
of the outcome, the encoder advances to one of the two nodes in the first level of
the tree. It then compares x with the two accessible codevectors from the present
node and advances to the nearest (lower distortion) node. The process continues till
the last level is reached where a final selection of the codevector is made. The m-bit
binary codeword is formed for an m level TSVQ by the path taken through the tree.
For vectors with little difference, as is the case for subsequent LSF vectors (because
of interframe correlation), the path taken through the encoding tree are very similar.
In fact the path map associated with the codevector of a frame have a sizeable prefix
in common with the path map of its predecessor. Let the length of the greatest
common prefix between two adjacent frames be represented by a random variable k.
Therefore, k = k implies that the codeword of the present frame has k consecutive
bits (in the most significant places) in common with the codeword for the previous
frame, 0 5 k 5 m. When k = m, the two codewords coincide exactly. In IBNC, the
value of k is provided to the decoder along with the remaining (suffix) bits since the
first k bits can be obtained from the previous decoded codeword. In fact, in this case
the (k + 1)-th bit can be obtained by taking the complement of the (k + 1)-th bit in
the previous codeword. Hence only m - k - 1 bits are required for encoding the suffix
(except when k = m when no bits are required).
Phamdo and Farvardin [go] used Huffman code to encode the value of k and
used a 13 level TSVQ for encoding the LSF vectors. For a frame rate of 100 Hz
(10 ms period), an average spectral distortion of about 4 dB was obtained with 9.34
bitslframe; and a spectral distortion of about 3.5 dB was obtained with about 11.4
bitslframe for a 22.5 ms frame period. Scalar quantization of the TSVQ-IBNC error
was done to achieve a spectral distortion of less than 1 dB, and about 24.5 bitslframe
were required for 10 ms frames and about 26 bitslframe were necessary for a 22.5 ms
frame.
3.4 Constrained (suboptimal) VQ
Vector quantization is a very powerful quantization technique and we already men-
tioned (see appendix B) that no quantizer can outperform a VQ. However, there are
significant computational and storage costs associated with a VQ. For a quantizer of
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS I
D dimension k with resolution T bits per vector component, the codebook has
code vectors. The memory required to store the codebook as well as the computation
required to search the codebook with an additive distortion measure is proportional
to
kN = k2Tk
and grows exponentially with both r and k.
In the case of quantization of LPC parameters for speech coding, using a 10-th
order LPC model and 24 bits per vector, the codebook needs to have 224 or roughly
16.7 million vectors. The computational and storage complexity involved are really
very large. Since this is the order of resolution required in most practical applications
of VQ in LPC quantization, it is imperative that a suboptimal solution be used.
Traditionally, the suboptimal solution is a VQ that is constrained structurally or
otherwise to reduce the computation and storage requirements to a tractable size.
There are many suboptimal VQ techniques one might consider for quantizing LPC
parameters. Only those that lead to a fixed rate coder are considered here. Variable
rate VQ techniques like pruned tree-structured VQ and entropy constrained VQ were
not considered in this work. Also, only memoryless VQ techniques are considered in
this study.
Some of the suboptimal VQ techniques are
i ) tree structured VQ [I?, 75, 421,
ii) classified VQ [42],
iii) product code VQ [94, 17, 421,
iv) basis vector VQ [44, 431,
v) multi-stage VQ [17, 421, and
vi) partitioned VQ (split VQ) [42, 861.
A brief description of the above techniques is presented below. A good review of
different constrained VQ techniques can be found in [42].
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
3.4.1 Tree Structured VQ
Tree structured VQ is a very effective way to reduce search complexity in vector quan-
tization but the price is paid in terms of a large storage complexity and performance
degradation. The search is performed in stages and in each stage a large number
of codevectors are eliminated from the search. The structure of a tree structured
codebook is shown in Fig. 3.4.
The encoder first searches the root codebook C* and finds the minimum distor-
tion test vector (code vector). For a balanced m-ary tree, the index i indicates the
codebook to search in the next stage of codebooks. The next search of codebook C;
yields the next stage index j. Assuming the next stage is the last stage (as shown
in Fig. 3.4), a search of codebook C i j produces the code vector that is the quantizer
output.
The decoder does not need to have the test vectors and is identical to a conventional
VQ. However, if a progressively better approximation is sought where the quantized
vector is desired to be updated after each stage is searched, the complete sequence
of indices may be transmitted and in this case the decoder will need to have all the
stage codebooks as well.
An m-ary tree with d stages is said to have a breadth m and a dep th d. If the
codebook size is N = md, then only md distance computations are required instead
of md distance computations as in unstructured codebook.
Although the search complexity is quite low, the storage complexity is high com-
pared to an unstructured codebook. In addition to storing the md code vectors, the
test vectors for each stage of the tree must be stored (at least in the encoder). The
number of nodes in stage k is mk-', hence the total number of nodes is
Since each node stores m test or code vectors, the total number of vectors to be stored
For a binary tree, the search complexity is reduced by a factor of 2d/2d while the
storage complexity is 2(2d - 1) vectors - slightly less than double the storage required
Figure 3.4: A tree structured VQ
C H A P T E R 3. QUANTIZATION OF LPC PARAMETERS 47
for an unstructured VQ of same size. We already know that a large number (> 220)
of code vectors is needed to quantize LPC vectors using an unstructured codebook.
Use of a tree structured codebook would lead to a further increase in the storage
complexity of the VQ and is not an attractive alternative to consider.
3.4.2 ClassifiedVQ
Classified VQ (Fig. 3.5) is similar to a two stage tree structured VQ where the first
stage of the tree structured VQ is replaced by a classifier. The classifier produces a
codebook index for the codebook to be used and a search of the selected codebook
produces a codevector index. Both of these indices are transmitted to the decoder
where the code vector representing the quantized vector is retrieved from the indicated
codebook.
x , { Codebook 1 codeyctor index
I
T codebook index #
Figure 3.5: Classified VQ
he classifier is designed to partition the input space according to some statistic
of the signal being quantized and it may not always be easy to determine the best
way to design a classified VQ. The codebooks can have different sizes depending on
which class they represent, and tolerable distortion for that class.
The storage complexity of classified VQ is at best the same as that of an un-
structured VQ since the codebooks C1 to C L must include the same code vectors at
least once to have the same reproduction alphabet as an unstructured VQ. Search
complexity is reduced by L times only and there are no good ways of classifying LPC
vectors into L classes when L > 3 to 6.
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
3.4.3 Product Code VQ
A product code VQ is a collection of quantizers each of which is applied to a feature
vector derived from the vector to be quantized. The features are defined in such a
way that the collection of all the features completely define the vector.
Given a vector x of dimension k > 1, let f; indicate the function that extracts the
i-th feature. That is,
4; = fi(x), i = 1,2 ,..., Nf (3.42)
where 4; is the i th feature and Nf is the number of features being extracted. Then,
it should be possible to define a reconstruction function r(.) such that
A separate quantizer is designed for each of the 4;s in product code VQ.
Each feature vector could be easier to quantize because it takes on values in a more
compact region of k-dimensional space or has a lower dimensionality. If the features
could be defined such that they are independent of each other, the coding complexity
can be greatly reduced without any performance penalty.
The quantizers are in general dependent on each other, i.e., the reproduction value
for one feature vector depends on the reproduction values of other feature vectors. If
so desired, independent quantizers can be used for some (or all) of the feature vectors
with some degradation in performance.
Two common product codes are shape-gain VQ [94, 421 and mean-removed VQ
(or mean-residual VQ) [42]. Shape-gain VQ was first used by Sabin and Gray [94]
where the gain and spectral shape of the short term filter were jointly quantized using
one scalar and one vector quantizer. They reported an improvement in performance
compared to the gain separated VQ (another product code with independent quan-
tizers) introduced by Buzo et. al. [17]. In mean-removed VQ, the mean of the vector
elements is subtracted from each element and the mean and the residual vector are
quantized together. This is equivalent to removing the DC bias from a signal and
making it zero-mean. Both mean-removed VQ and shape-gain VQ decompose the
input vector into one scalar and one vector component.
For quantizing LPC coefficients, one still needs a large codebook of vectors even
after removing the gain or the mean from the LPC vector. So, it is not expected to
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 49
lead to significant reduction in search and storage complexity to make it an attractive
technique for LPC quantization.
3.4.4 Basis Vector VQ
A basis vector VQ has a codebook where all vectors in the codebook are linear com-
binations of a smaller number of basis vectors:
k l
A particularly interesting basis vector codebook is used in the VSELP coder (see
Chapter 2, page 21 for a brief description) where the linear combination coefficients
(hi's) are restricted to values of -1 and $1. This leads to an enormous reduction in
search complexity of the codebook. The storage complexity is also very low as only
the basis vectors need to be stored.
Basis Vector Design
The error function minimized in a basis vector VQ can be written as
where M is the number of basis vectors and cn = xEl bjn)v; is the chosen code vector.
Minimizing with respect to y,, m
Each code vector can be written in terms of a selection matrix, B,, and a column
vector of stacked basis vectors V
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 50
When this is substituted in the expression for distortion over the entire training set
and the total distortion is minimized with respect to V , the optimal basis vectors for
the given training set are obtained as
Performance of basis vector V Q
To determine the suitability of basis vector VQ for LPC quantization, a basis vector
VQ with resolution of 10 bitslvector was designed using 20000 LAR vectors obtained
from recordings from FM radio stations.
As mentioned earlier, the coefficients, b;s, are restricted to take up values from
the set (-1, $1) in VSELP. In general, if the coefficients are restricted to two values,
they need not be equal to -1 and +1 but could be any set of two scalars known to
the decoder. It may be pointed out here that only the ratio of the two numbers in the
set is significant. It is easy to show that the sets {a, b] and {ka, kb) are equivalent in
the sense that the constant k can be absorbed in the gain term.
The performance was measured on 3987 test vectors outside the training set and
minimum average spectral distortion of 3.86 dB was obtained with the coefficient set
{-0.7,l) compared to 3.61 dB for a full search unstructured VQ.
The performance of basis vector VQ was also measured under channel transmission
errors by simulating uniformly distributed random bit errors at error rates of 1% and
5%. The results are given in Table 3.2. The performance of basis vector VQ under
n Error Probability I Full Search VQ I Basis Vector VQ [I
Table 3.2: Channel error performance of Basis Vector VQ
pe 0.00
channel error conditions is seen to be very poor and it was not studied further.
SD (dB) 3.61
SD (dB) 3.86
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 51
3.4.5 Multi-Stage VQ
Multi-Stage VQ is the main subject of study in this thesis and is discussed in detail in
Chapter 4. We just mention here that in multi-stage VQ, each reproduction vector y,
is obtained by summing up one code vector from each stage of a multi-stage codebook:
where L is the number of stages and cr is the j th code vector from the kth stage
codebook.
3.4.6 Partitioned VQ (Split VQ)
In partitioned VQ, the parameter vector to be quantized is partitioned into a number
of subvectors of fixed, predetermined dimensions, and each subve'ctor is coded with
an independent VQ. Let x = [xl, x2,. . . , x,lT be a parameter vector of dimension p
to be quantized using a split VQ scheme. Then x is as
where x; is a subvector of dimension I ; such that
The scheme is shown in Fig. 3.6. The partitioning of the vector is equivalent to using
a product VQ where each split VQ codeword is a vector in R'1 x R'2 x - - x R'L. Paliwal and Atal [86] used split VQ on LSF vectors and obtained a spectral dis-
tortion of 1.03 dB using 24 bitslvector and a weighted Euclidean distortion measure.
The spectral distortion was 1.19 dB for the same code rate when a simple Euclidean
distortion measure was used. It is easy to show that split VQ is a particular case of
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS
Figure 3.6: The Split VQ Scheme
Multi-Stage VQ. To demonstrate it by example, let the vector x of dimension p be
partitioned into two sub-vectors - xl of dimension l1 and x2 of dimension 12, such
that l1 + 12 = p.
where
The vector xl is quantized to il with a codebook C1 of dimension l1 and the vector
x2 is quantized to i2 with a codebook C 2 of dimension 12. If c,! is the chosen code
vector from codebook C1 and cf is the code vector chosen from codebook C2, the
quantized vector li: is given by
The same reproduction alphabet can be achieved by a multi-stage codebook in
which each stage codebook is derived from the corresponding split VQ codebook by
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 5 3
extending each stage code-vector to dimension p by adding 0 elements at appropriate
positions. For the current example, the equivalent multi-stage codebooks can be
derived as
where
and Nl and N2 are the number of codevectors in codebooks C1 and C2 respectively.
Now, the quantized vector i can be written as
This shows that partitioned VQ is merely a constrained version of multi-stage VQ
where some elements of each stage code-vector are forced to be zeros.
Although partitioned VQ is a constrained version of the already suboptimal multi-
stage VQ, an important advantage of partitioned VQ is that each codebook is con-
strained into subspaces that are orthogonal to each other. This makes the quanti-
zation error (measured in squared Euclidean distance) additive with respect to the
quantization errors from each partitioned vector:
This important property allows each partitioned codebook to be searched indepen-
dently and it is guaranteed that the best reproduction vector within the reproduction
alphabet will be found irrespective of the order in which the search is made. In this
CHAPTER 3. QUANTIZATION OF LPC PARAMETERS 54
sense, the search procedure for a partitioned VQ is always optimal since a sequential
search of all partition codebooks is equivalent to an exhaustive search of the codebook.
Since multi-stage VQ is less suboptimal than split VQ, it is expected that a lower
distortion would be obtained with a multi-stage VQ compared to a split VQ at the
same code rate.
Chapter 4
Multi-Stage VQ of LPC
Parameters
There are several forms of constrained vector quantization techniques as mentioned
in Chapter 3, but multi-stage VQ (MSVQ) has several advantages. MSVQ is simple
to implement and efficient search strategies can be found, as will be described here,
that help to reduce the search complexity appreciably without sacrificing much of
performance compared to a full search, unstructured VQ.
A multi-stage VQ consists of a set of triples,
where L is the number of stages,
is the i-th stage codebook, Q% the mapping used with the i-th stage codebook and
is the corresponding partition of Rn such that Q i ( x ) = cj, if and only if x E S;,.
The number of code vectors in C i which equals the number of cells in Pi is denoted
by N;. The code vectors comprising the codebook Ci and the cells comprising the
partition Pi are indexed with the subscript j,, where j; is a member of the i-th index
set J; = {1,2, . . . , N,). In practice, each Qi(.) is realized as a composition of an
C H A P T E R 4. MULTI-STAGE VQ OF LPC PARAMETERS 56
encoder mapping Ei (.) and a decoder mapping I>'(-) , viz., Q;(x) = Di (Ei (x)) . The
i-th encoder mapping • ’ 5 Rn H Ji is defined as Ei(x) = j;, if and only if x E Sji. For each source vector, the indices produced by the encoder mappings of each stage
are concatenated to form an index L-tuple
Each L-tuple jL is a product code word and is an element of the Cartesian product of
the stage-wise index sets
jL E J1 x J2 x . . . x JL.
The decoder parses the received L-tuple code word and the decoder mappings D' :
Ji t+ C' recover from each stage-wise index j; the corresponding code vector ci. The
quantized representation i of the input source vector x is formed by summing up
exactly one vector from each codebook,
Here c t is the n-th code vector from the k-th stage. The size of the reproduction
alphabet in an MSVQ is
N = r I f= ,~ ' . (4.2)
Usually, the number of code vectors in each stage is an integral power of 2 and
where r; is the resolution of the i-th stage in bits/vector. We will indicate the structure
of an MSVQ by mentioning the resolution of each stage starting from the lowest order
stage to the highest order stage. The parameter representation used will also be
mentioned. Hence a codebook will be named as
where m
to indicate that it has kl stages having 2'1 code vectors each, followed by k2 stages
having 2'2 code vectors each and so on, the total number of stages being L. If any
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 57
of the k;s are equal to 1, it will be omitted for brevity. For example the codebook
LSF-12+12 is a two stage codebook, each stage having a resolution of 12 bits/vector,
i.e., each stage has 4096 code vectors and they are Line Spectral Frequencies (LSF).
Similarly the codebook LAR-4 + 6 + 2 x 4 is a 6-stage codebook of LAR vectors
with first stage having 16 code vectors, second stage having 64 code vectors and
subsequent four stages having 4 code vectors each. The total number of bits required,
R, to quantize a vector using a given codebook can be easily found out by computing
the sum indicated in the name of the codebook.
R = r l x k l + r 2 x k 2 + - . . + r , x k,.
In an unstructured VQ, each reproduction vector can be placed anywhere in the
input sample space independent of other reproduction vectors, but the MSVQ is
structurally constrained and the reproduction vectors are not all independent of each
other. It is worthwhile to take a look at the MSVQ structure as it provides insight for
choosing a proper search algorithm. For a two stage MSVQ, the 2nd stage vectors are
added to each of the first stage vectors (Fig. 4.1) to form the reproduction vectors.
Hence, the pattern of reproduction vectors around each of the vectors in the first stage
is the same and the input space is filled with a repeating pattern of a set of vectors.
In other words, the higher order codebook defines a tile, and placement of the tile is
governed by the next lower order codebook. This is also evident if we write the set of
reproduction vectors Y = { y;; i = 1, . . . , N } as
where N = Nl N2. In the case of codebooks with more than two stages, the codebook
for the highest stage defines the smallest tile. When these tiles are placed according to
the vectors in the next lower order codebook, a larger tile is obtained which is placed
according to the vectors of the next lower order codebook and so on. The difference
between this tiling of input space by the reconstruction vectors and traditional tiling
of space is that in case of MSVQ, the tiles can overlap each other and they do not fill
up the entire space.
Traditionally, a multi-stage VQ is searched in a sequential manner, the basic idea
being that the sum at each stage provides a closer approximation to the input vector
over the sum at the previous stage. The quantization process using a multi-stage
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 1
(a) First stage codevectors (b) Second stage codevectors
(c) The final reproduction alphabet
Figure 4.1: Structure of a two-stage two dimensional VQ
Figure 4.2: A sequentially searched multi-stage VQ
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 59
VQ with sequential search is shown in Fig. 4.2. The quantizer (C1, Q1, P1) quantizes
.1 P
I
the source vector x1 and (Cp+ l , Qp+l, P p + l ) quantizes the error vector (also called
P
Q'
the residual vector) x P + l = x p - Qp(xp) from the preceding stage (Cp, Qp, Pp) for
1 5 p 5 L. The vectors x p and the quantizer mappings QP(.) are related according to
where xL+' is the error from the last stage as well as the total error from all stages.
The quantized vector, a', is given by
It is quite obvious that this procedure is suboptimal and the best reconstruction
vector is not going to be chosen all the time. Several examples of this can be found
in the figure (Fig. 4.1) shown earlier.
The performance of a sequentially searched MSVQ is suboptimal due to three
reasons.
i) The codebook structure is constrained and is not flexible enough to pro-
duce an optimal set of reproduction vectors for a given input pdf (or
training set). This restriction comes solely from the structure and does
not depend on the availability of a suitable technique to compute optimal
reconstruction vectors. In other words, if a suitable technique was avail-
able to compute a set of optimum reconstruction vectors for a given input
C H A P T E R 4. MULTI-STAGE VQ OF LPC PARAMETERS 6 0
distribution and output alphabet size, we would not, in general, be able
to design a set of MSVQ codebooks that will result in the same set of
reconstruction vectors.
ii) The sequential search algorithm is suboptimal and fails in many cases to
find out the best reproduction vector for a given input vector and a multi-
stage codebook. This is particularly true if the number of stages is large
(2 3).
iii) The sequential design procedure that is commonly used in designing an
MSVQ is suboptimal. In other words, most of the times there exists
an MSVQ with the same alphabet size that will perform better than a
sequentially designed one for the same input distribution.
It is believed that the following requirements need to be met to achieve transparent
quantization of LPC parameters [86] :
1. the average spectral distortion should be less than 1 dB,
2. the number of outliers with spectral distortion of 2 dB or more should be less
than 2 percent,
3. there should be no outlier with spectral distortion above 4 dB.
The fact that the split VQ was the first multi-stage VQ to achieve this goal using the
smallest number of bits (24 bits) at that time, clearly shows a lack of understanding of
multi-stage structures - since a less constrained multi-stage structure should perform
better than a split VQ. The main reason why split VQ outperformed a less constrained
MSVQ is the use of sequential search which was inadequate for the MSVQ structure.
4.1 Suboptimality of Sequential Search
The suboptimality of sequential search can be readily seen in Fig. 4.3 where partitions
of the input space has been shown for a two-stage MSVQ. The first stage codevectors
'A parameter is said to be quantized transparently when the speech quality produced by a coder using quantized and unquantized parameters are perceptually indistinguishable from each other.
CHAPTER 4. MULTISTAGE VQ OF LPC PARAMETERS
Figure 4.3: Voronoi regions for a two-stage MSVQ
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 62
are marked with a solid square, the second stage codevectors are shown as arrows, and
the reproduction alphabet is shown with solid circles (e) and crosses (x). Reproduction
vectors with a common parent or predecessor are marked with the same symbol. ( 1 ) The two first stage codevectors partition the space into two regions marked R,
and RY) and the final reproduction vectors partition the space into regions marked as (2) Rig , where i , j = 1,2, i # j. The regions corresponding to the reproduction vectors
having a common predecessor are also shown with the same shade. All regions have
been drawn assuming the nearest neighbour rule. It can be seen that for a sequential
search, all input vectors in R?) will be quantized to one of the reproduction vectors
in the dark shaded regions. Therefore, all input vectors in the white shaded region
R$ will be mapped to a suboptimal reproduction vector. It is easy to see that all
input vectors in the white shaded regions in R?) and all input vectors in the dark
shaded regions of RY) will be quantized to a suboptimal reproduction vector. In fact
the conditions under which sequential search is optimal are quite severe as shown in
the next section.
4.1.1 Opt imality conditions for sequential search
Before stating the optimality conditions for a sequential search, we define a quantity
called the predecessor of a reproduction vector.
Definition 4.1 Let yj be a reproduction vector for an L-stage VQ, such that
where 1; is the indez of the codevector from the i-th codebook di) . The k-th predecessor,
Y:, of yj is defined by
The null vector 0 can be considered the 0-th predecessor of all reproduction vectors.
It should be noted that more than one reproduction vector can have the same set of
lower order predecessors and the highest order predecessor is the reproduction vector
itself. Now we can state the optimality conditions for sequential search.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 6 3
Theorem 4.1 The necessary and suficient condition that a multi-stage codebook is
optimally searched using a sequential search procedure is that for all reproduction
vectors yj7 j = 1, . . . , N and any input vector x,
4x7 yj) < 4x7 Y,) * d(x7 Y;) < 4x7 Y*) Vk, V Y ~ # Y:,
where y: is the k-th predecessor of y j .
Proof: We prove the sufficiency condition first. Given an L-stage MSVQ,
C = {c(";i = 1, ..., L) and
d(x, y j ) < d(x7 yi) * d(x, Y:) < d(x7 Y:); Vk7 VY: # Y:
for any input vector x , suppose that sequential search is not optimal.
That is, 3 y, such that
d(x, Y m ) < d(x, yn)
where yn is the reproduction vector found through sequential search. Let the vectors
y, and y, have common predecessors till stage p - 1 where p = 1 for no common
predecessor. Since y, is found through a sequential search,
But from Eq. (4.10) and Eq. (4.11),
This is clearly contradictory to Eq. (4.12) for y,k = y;. Hence, no such ym exists that
satisfies Eq. (4.11), or in other words, sequential search is optimal for this codebook.
c his proves the sufficient condition.
Now we prove the necessary condition. Given an L-stage MSVQ, C = {c("; i =
1 , . . . , L) and the fact that sequential search is optimal for this codebook, let yn be the
nearest reproduction vector for a given input vector x . That is, Q(x) = y,. Hence,
Assume that for this codebook
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
Therefore, 3y; such that
for some k, yk # y;k.
From the fact that the codebook was searched sequentially, and Eq. (4.17), the
kth predecessor of y, will be chosen over the kth predecessor of y, while searching
stage k. Since the kth predecessor yk will be discarded, none of its successors can be
chosen in a sequential search. This contradicts our original assumption that y, was
found through sequential search. This proves the necessary condition.
4.2 Search Strategy
The performance of a multi-stage VQ can be improved by using a multi-candidate
search procedure. The basic idea of this procedure is to retain multiple candidates
instead of one best candidate in the search of each stage. Two different approaches
are possible - (a) Growing Tree search, and (b) M-L Tree search [4].
In the Growing Tree search, MI candidates corresponding to the lowest MI dis-
tortions, d(x, cli ), i = 1, . . . , MI, are retained from searching the first stage. For each 31
of these vectors in the first stage, the second stage is searched and M2 candidates are
retained from each search. Thus at the end of searching the second stage we have
MI x M2 code vector pairs as possible candidates for the final reproduction vector.
The search is continued till the last stage with Mj candidates retained from each
search of the j-th stage. After having searched the last stage, we are left with IIf=, Mi
candidate reproduction vectors for the input vector. The one having the lowest dis-
tortion among all these candidates is chosen as the final reproduction vector. This is
shown in Fig. 4.4 where each rectangle represents a codebook search and each small
circle represents a code vector retained.
In the M-L Tree search method, MI candidates are retained from the first stage
and the second stage is searched MI times, once for each candidate from the previous
stage, so that we have Ml x N2 distortion values for as many vector pairs searched.
Out of these MI x N2 vector pairs, M2 vector pairs corresponding to the lowest M2
1 CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
Figure 4.4: Growing Tree search of a three stage VQ (MI = 5, M2 = 3, and M3 = 2)
distortions
d(x, (ci: + c:;)), i = 1, . . . , M2,
are retained for searching the next stage. For each of these M2 vector pairs, the third
stage with N3 code vectors is searched giving rise to M2 x N3 distortion values,
Out of these, M3 vector triplets corresponding to the lowest M3 distortion values
are retained for searching the next stage and so on. That is, Mj vector j-tuples are
retained at the end of searching the j-th stage and when all the L stages have been
searched, the best reproduction vector is chosen as the one with lowest distortion from
the ML vector L-tuples. This is shown in Fig. 4.5 where, as before, each rectangle
represents a codebook search and each small circle represents a code vector retained.
It is not difficult to see that none of these methods are guaranteed to find the
best possible reproduction vector for any given input vector but on the average, these
methods find a better reproduction vector most of the time compared to a sequential
search. An example of failure of the Growing Tree search and the M-L Tree search
is shown in Fig. 4.6. In this figure, the vectors chosen in the Growing Tree search
method are shown with thicker lines with the first stage code vectors in solid and the
second stage code vectors in dashed lines. Both Ml and M2 were equal to 2 in this
example. For the Growing Tree search, the final reproduction vector is chosen from
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
Retain M2 best pairs
Retain M3 best triples
0 j o ; 0 ; . . . . . .
I ( 0 ; o ; o ; . . . * * * c )
. . . 0 io i 0 ; . . . ; 0
Select triple with minimum distortion
Figure 4.5: M-L Tree search of a three stage VQ
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
Figure 4.6: Failure of multi-candidate search in a 2-stage VQ (MI = M2 = 2)
{ Y I , Y ~ , Y ~ , Y ~ ) and for M-L Tree search the choice is made from {yl,y2). In either
case, the best reconstruction vector yb is not found.
The Growing Tree search is guaranteed to perform not worse than a sequential
search for every input vector because the sequential search is a special case of a
Growing Tree search where the number of candidates retained at every stage, M j ,
is equal to 1. The M-L Tree search, on the other hand, is not guaranteed to do so
because some paths are pruned early in the search and are eliminated from the choice.
An example with a 3-stage codebook where the M-L Tree search fails to select the
best reconstruction vector yb is shown in Fig. 4.7. The candidates selected by M-L
search method at each stage are shown with thick lines and different stages of the
codebook are shown with different line styles. Note that in this example, a sequential
search would have found the best reconstruction vector.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
A
A I
I I
.-A :' \
\ \ \
4
\ \ \
\ \ \ \
4
A I I
A .A I .:' \
I \ \ \
4
\
4 \ 4 \ \ \
\ \
\ \
4 4
Figure 4.7: Failure of M-L search in a 3-stage VQ ( M I = M2 = M3 = 2)
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 6 9
4.2.1 Search Complexity
The computational complexity of the search for a reproduction vector corresponding
to a given input vector is called the search complexity. The actual search complexity
for a given search technique is a function of the distortion measure used, but as a first
step towards computing the actual search complexity, we will compute the number of
distance computations involved for both the Growing Tree and the M-L Tree search
methods. We consider an L-stage MSVQ with N, code vectors in stage i, each code
vector having a dimension p, and Mj to be the number of candidates we wish to retain
at stage j .
Growing Tree Search Complexity
For the Growing Tree search, the number of candidates retained at any stage must
be less than the number of code vectors available at that stage. The first stage is
searched only once involving a distance computation for Nl vectors. The second
stage is searched MI times and number of distance computations involved is M1N2.
Thus, the total number of distance computations, NdG, for the entire search is given
by
where Mo = 1.
M-L Tree Search Complexity
For the M-L Tree search, the number of candidates available at each stage is dependent
on the number of candidates retained at the previous stage and it is sometimes possible
to specify an M, where Mj candidates are not available (particularly when a single
M is specified for all stages). In that case, all the available candidates are retained
for consideration during search of the following stage. While searching stage j , the
number of distance computations done is Kj-1 Nj where Kj-1 is the actual number of
candidates retained from the previous stage. So, if Mj > I<j-lNj, the actual number
of candidates retained at the j-th stage, K j = Ii'j-l Nj. The total number of distance
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
computations, NdML, for the entire search is given by
NdML = NI + K 1 N 2 + K 2 N 3 + " ' + KL-1NL L-1
= C KiN,+, (4.19) a=O
where, 1i-0 = 1 and
Ki = min (Mi, Ki-1 N,) . (4.20)
It can be seen that the search complexity increases only linearly with Mj and L
for the M-L Tree search, but increases exponentially for the Growing Tree search.
Hence, M-L Tree search is preferable in most practical applications. It should be
remembered though, that the Growing Tree search performs better than the M-L
Tree search because for every input vector, if M-L Tree search can find the best
reproduction vector, so can the Growing Tree search; whereas the converse is not
true. In this thesis, we explore the properties of M-L Tree search only, since Growing
Tree search is impractical in most cases due to high search complexity.
We define the search
where Nd is the number
complexity, Cs, in logarithmic scale, as
CS = ~ o ~ ~ ( N ~ N M A c ) (4.21)
of distance computations and NMAC is the number of arith-
metic operations for a single distance computation. In counting arithmetic operations,
multiply-accumulate is considered a single operation and separate additions and sub-
tractions are neglected.
The MSVQ codebook structure is sufficiently complex that no lower complexity
search algorithm other than a full search of all reproduction vectors exists today that
will always be able to find the best reproduction vector. A full search of course defeats
the purpose of having an MSVQ in the first place, as the search complexity becomes
equal to that for an unstructured codebook with the same resolution as the MSVQ
in bits/vector. Our experiment shows that M-L search very quickly approaches the
performance of a full search with relatively small values of Mi. Since there are many
possible ways M;s can be chosen, we have kept it constant over all stages to limit
the choices to a reasonable number. Fig. 4.8 shows the result of M-L search on LSF
vectors using a LSF-6+6 codebook. The full search performance is shown as the
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
Figure 4.8: Performance of LSF-6+6 MSVQ with M-L search
horizontal line. It can be seen that performance very close to full search is obtained
with M = 8.
Based on the above observations, we chose M-L Tree search for further investiga-
tion in this thesis.
4.2.2 Detailed Analysis of The Search Complexity
Weighted Mean Squared Distortion
Let y4 be one of the reproduction vectors selected as a candidate at the k-th stage.
That is 1 k y: = Cl, + . . . + Clk
where cfp is a selected code vector from p t h stage. While searching the (k + 1)-th
stage, a candidate reproduction vector is formed by adding a code vector c:'' from
the (k + 1)-th stage to the candidate vector yf from the previous stage. Thus, an
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
approximation to the input vector while searching the (k + 1)-th stage is,
k k + l f k + l = y, + c j .
The weighted mean squared distortion (WMSE) at stage k + 1 is given by
k + ~ ) ~ w ( X - yi - c j d(x, f k + l ) = (x - y, - c , 3 I.
where W is a symmetric weighting matrix. Expanding,
d(x, f k+l ) = [(x - yf)T - (c:")~] W [(x - y:) - c:"]
= (U;)~W(U;) - ~ ( u ; ) ~ w c ; + ~ + ( c : + ~ ) ~ wc:+1 T
= (u:)~w(u;) + (c;+l - 2u:) wc?+ l ,
k where u;k = x - yi . The first term is d(x, y,k) and is already known from computations in the previous
k stage except for the first stage. While searching the first stage, yf = 0 and u, = x
and it takes (P2 + p) multiply-adds to compute it once. The second term requires
(p2 + p) multiply-adds for every stage. Using Eq. (4.19) for the number of distance
computations and the definition of search complexity (Eq. (4.21)), we can write
where K; is given by Eq. (4.20).
In most cases, the weighting matrix W is a diagonal matrix. In this case, a
matrix-vector multiplication requires p multiply-adds instead of p2 multiply-adds and
consequently the factor (p2 + p) in the above equations get replaced by 2p. Hence, for
a diagonal weighting matrix, the search complexity is given by
The above estimate of complexity is correct only for a fixed weighting matrix W, but
usually, for perceptually significant distortion measures, W may be a function of the
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 73
input vector x and has to be computed once for each x . However, this is not very
significant in comparison to the computation required for distortion computations.
We will neglect this extra computation required and use Eq. (4.28) as our estimate of
search complexity.
It should be noted that the search complexity for split VQ with weighted mean
squared distortion measure is
where p; is the dimension of the pth codebook.
Mean Squared Distortion
For mean squared distortion, the weighting matrix, W = I, and we have
k T k d(x, ix+l) = (ui ) (u, ) + (c:+' - 2uj) Cj k+l
The first term again is known from computations for the previous stage except for
the first stage and takes p multiply-adds. The second term also takes p multiply adds T
neglecting the subtraction, or if the code vector energies, (c:) c:, are precomputed
and stored along with the codebook then there is no subtraction involved and it takes
exactly p multiply-adds for the second term. Recalling equations (4.19) and (4.21)
again, we can write the search complexity as
4.3 Codebook Design
The codebooks were designed using the generalized Lloyd's algorithm as outlined in
Appendix B. For designing stage k, the training vectors were the error vectors from
the previous stage.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 74
where c;; is the codevector selected from the p t h stage while quantizing x,. Centroid
computation for the weighted mean squared error is done as explained in the next
subsection.
4.3.1 Centroid Computation
Let x, be the vectors in a partition whose centroid c is to be computed. The total
distortion, D, over all vectors in the partition is given by
In most cases Wn is a symmetric matrix, and
The vector c yielding the minimum value of D can be found by a variational technique,
for example by computing the derivative of Eq. (4.34) with respect to c and setting
it to zero. This gives
If Wn is also diagonal, then let
and
C(wnxn) = [bl, b2, . - - 7 bplT-
Then,
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 75
4.3.2 Outlier Weighting
One of the problems in VQ of LPC parameters is that some input vectors are poorly
represented and result in a spectral distortion much larger than the average. These
are called outliers. The outlier performance can be significantly improved by adding a
number of copies for each outlier to the training sequence. Equivalently, the weight of
the outlier may be increased during centroid computation by multiplying the weights
by (SD) ' , r > 1, and retraining the codebook. It was found that this approach does
not increase the average SD significantly. A value of r = 2 was used in designing
the codebooks. This resulted in an increase of average SD by about 0.01 dB and a
reduction of the 2-4 dB outliers by about 1% for the LSF-4x6 codebook.
4.4 Choice of Parameter Representation and Dis-
tance Measure
It has already been pointed out in the previous chapter that LSFs are the best param-
eter representation to use for vector quantization. However, LARS give performance
close to LSFs and may result in lower implementation complexity. We will investigate
both LARS and LSFs and choose one of the parameter sets for more detailed inves-
tigation. For measuring codebook performance we will use a perceptually significant
distortion measure such as Spectral Distortion, but this is a very expensive distortion
measure to use for codebook search in implementation. For this reason, we will use
mean squared error (MSE) and weighted mean squared error (WMSE) for the code-
book search. For the M-L search procedure, the final selection of the reproduction
Gector can be made using SD without incurring too much of computational cost.
Various weighting matrices for LSFs [86,91,62,104] were evaluated in this work. It
was found that the weighting in [62, 1041 performed slightly better than the weighting
in [86] and significantly better than the weighting described in [91]. The weighting
matrix entries are given by [62, 1041
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
where
~ ( f i ) = fi < fcrit
1 - f - f i t ) fcrit < fi < where ~ ( i ) is the group delay of the ratio filter (Eq. (A.49)) at a frequency correspond-
ing to the frequency fi of the i-th LSF, fcr;t = 1000 Hz, TcTit = 1.375 ms, rmaX = 20
ms, and the sampling rate f, = 8000 Hz.
The weighting function (Eq. (4.39) and (4.40)) originally proposed by Kang and
Fransen [62] is derived from their study of spectral sensitivities of LSFs and perceptual
considerations. It has already been shown (Fig. A.4) that the group delay of the
ratio filter is large near spectral peaks. Kang and Fransen showed that the spectral
sensitivities of LSFs are proportional to the square root of the corresponding group
delays, hence the weighting coefficients can be written as
where k is the constant of proportionality. The weighting coefficients are generally
normalized to a value between 0 and 1 and the normalized coefficients would then be
This weighting function (Eq. (4.42)) does not take into account our hearing sensitivity
which is high at spectral peaks and low at spectral valleys [35]. The group delay is low
for flat portions of the LPC spectrum (Fig. A.4) and particularly for an absolutely
flat spectrum where all LSFs are equally spaced, the group delay equals 11/8000
s or 1.375 ms assuming a 10th order LPC and a sampling rate of 8000 Hz. The
weighting function is assumed to be linear for group delays below 1.375 ms. The final
'modification to the weighting function comes from a model, u(fi), of our gradual loss
in hearing resolution with increasing frequency.
Various weightings for LARs were attempted. Static weightings proportional to the
spectral sensitivities [ lol l of LARs were tried with no consistent decrease in spectral
distortion or number of outliers.
The performance of all codes were evaluated using root mean square spectral
distortion (SD) between x and 2 (implemented as in [7] and [86]) givenby
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
LAR-SD +
LSF-WMSE - - LSF-SD
- 0
-A
12 13 14 15 16 17 18 Search Complexity
Figure 4.9: Performance comparison of LAR-6x4 and LSF-6x4 codebooks with M-L search
where no and nl correspond to 125 Hz, and 3.1 KHz respectively. In practice, an
N = 256 point FFT was utilized to compute A(ejZTnlN) and A,(ejZTnIN) and no and
nl were 4 and 100 respectively.
Log area ratios and line spectral frequencies were compared for different multi-
stage VQ configurations using spectral distortion and search complexity as the criteria.
The training database consisted of a total of 374,317 vectors and the test database
consisted of 121,200 vectors unless otherwise noted, all extracted from English speech.
Fig. 4.9 shows a typical result for a configuration having four stages of 6-bit codebooks.
The test vectors for this plot were a subset (FM-train) of the training set. For the
curves marked LAR-SD and LSF-SD, the final choice of the reproduction vector was
based on SD, and for the curve marked LSF-WMSE, weighted squared distortion was
used for all the choices. It can be seen that (Fig. 4.9) a system based on LSFs achieve
a spectral distortion of 1 dB at a search complexity about 4 to 8 times lower than
the LAR based system. However, in a real time system, computation of LSFs is more
difficult than the computation of LARS. In Fig. 4.9 and the subsequent figures, the
points marked on the SD versus complexity curve correspond to M = 1,2,4,8,- . . etc..
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
LSF- 12~2(split) LSF- 12x2 - LSF-8x3 -K-
LSF-6x4 LSF-4x6 +
Search Complexity
Figure 4.10: Spectral distortion of M-L Tree searched MSVQ at 24 bits/vector for various configurations
The number of candidates retained at each stage were the same for all stages. It is
evident that the performance of the LSF codebook is better than the LAR codebook,
hence only LSF codebooks were used for further study.
4.5 Performance and Complexity Trade-offs
In the traditional sequentially searched MSVQ, the design has been oriented toward
the largest implementable codebooks and the smallest number of stages. For exam-
ple, a quantizer using 24 bits per vector would typically be implemented using two
12-bit codebooks each having 4096 code vectors. Increasing the number of stages for
sequentially searched codebooks leads to a quick degradation of performance. The
introduction of tree search for multi-stage VQ leads to a significant increase in per-
formance, particularly for configurations having a relatively large number of small
codebooks.
Fig. 4.10 shows spectral distortion, SD, versus search complexity, Cs, for four
multi-stage VQ configurations and one split VQ configuration using LSFs. One of the
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
10 12 14 16 18 20 Search Complexity
Figure 4.11: M-L Tree search performance versus search complexity for rates 22-30 bitslvector
best configurations in terms of the trade-off between complexity and performance in
Fig. 4.10 is LSF-6x4, which achieves a spectral distortion of about 1 dB at a complexity
more than 8 times lower than LSF-12x2. Moreover LSF-6x4 requires storing only 256
codevectors as compared to 8192 codevectors required by the LSF-12x2 configuration.
In all configurations shown in Fig. 4.10, there are no 4 dB outliers and the percentage
of outliers greater than 2 dB are under 1% for SD < 1 dB. The LSF-12~2(split)
VQ used the same partitioning of the LSF vector (3 in the first stage and 6 in the
second stage) as in [86] and obtained virtually equivalent performance as in [86]. The
LSF-4x6, LSF-6x4, and LSF-8x3 codes all obtained superior performance compared
to split VQ at lower computational complexity and much lower memory requirements.
It is interesting to note that using 6 stages with only 16 codevectors in each stage (96
vectors total), imposes a structural constraint which degrades the performance less
than partitioning the vector into two sub-vectors and coding each with a 4096 level
full search code.
Fig. 4.11 shows spectral distortion versus search complexity at rates of 22-30
bitslvector. Note that the 28 bitslvector system (LSF-4x7) has very low complexity
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 80
at a spectral distortion of 1 dB and a memory requirement of only 112 code vectors.
The following conclusions can be drawn from Fig. 4.11.
1. As the number of stages increase and the number of codewords per stage de-
crease, more is gained from the M-L algorithm as M is increased.
2. The 22-bit (LSF-11x2) code at M = 2 obtains virtually identical spectral distor-
t ion compared to t he 24- bit split VQ code, albeit with a larger search complexity.
Code LSF- 11x2
LSF-12~2(split) LSF-6x4 LSF-4x6
LSF-2x13 LSF-4x7
Table 4.1: Different MSVQ configurations obtaining an average SD performance of approximately 1 dB. The bit rate is given in bits/vector.
Bit Rate 22 24 24 24 26 28
The complexities and rates required to obtain near 1 dB average SD for vari-
ous codes are shown in Table 4.1. The LSF-4x7 code offers relatively low memory
and computational complexity with the possibility of obtaining lower SD by using a
larger value of M. Virtually every MSVQ code considered obtains lower memory and
computational complexity than split VQ as expected.
4.6 Robustness Issues
SD (dB) 1.04 1 .04 1.04 1 .04 1 .03 1 .05
There are good intuitive reasons to believe that increasing the number of stages will
lead to improved robustness on noisy channels and across different talkers and lan-
guages. In this section we present the results related to robustness issues.
4.6.1 Effect of Language and Input Spectral Shape
% Outliers
Vector quantization has often suffered from robustness problems whereby the perfor-
mance of the VQ may be quite poor on data not represented in the training sequence.
2-4 dB 0.67 0.53 0.47 0.59 1.49 0.80
> 4 dB 0.00 0.00 0.00 0.00 0.01 0.00
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 8 1
Table 4.2: Spectral Distortion Performance over Different Languages and Input Spec- tral Shapes: (a) German, (b) Italian, (c) Norwegian, (d) Noisy English, and (e) TIMIT-test speech data base
Table 4.3: Percentage of Outliers (2-4 dB) for Different Languages and Input Spectral Shapes: (a) German, (b) Italian, (c) Norwegian, (d) Noisy English, and (e) TIMIT- test speech data base
Tables 4.2 and 4.3 show the average SD performance and percentage of outliers re-
spectively for test sets of (a) German, (b) Italian, (c) Norwegian, (d) Noisy English,
and (e) TIMIT-test speech data bases. The foreign language database includes IRS
weighted speech which was used for testing codecs in the CCITT 16 kb/s low-delay
competition.
It can be seen that higher rate codes involving smaller codebooks at each stage and
larger number of codebooks are more robust than the lower rate codes using smaller
number of relatively larger codebooks at each stage.
A plot of SD versus Cs is shown in Fig. 4.12 for LSF-6x4 and LSF-3x9 codes.
Note that the spread in SD around the 1 dB distortion region is much smaller for the
LSF-3x9 code compared to the LSF-6x4 code.
It is apparent that robust VQ can be accomplished by adding suitable structure
to the code while impairing average performance only slightly. Possible explanation
Cs 16.32 16.92 15.13 13.35
Code LSF-12~2(split)
LSF-6x4 LsF-8+6~3
LSF-3x9
% outliers (2-4 dB) Bits/vector
24 24 26 27
(e) 0.53 0.29 0.11 0.23
M -
32 8 8
(a) 1.40 1.14 0.26 0.35
(b) 0.56 1.07 0.47 0.43
(c) 1.70 1.28 0.78 0.85
(d) 2.69 4.35 2.51 1.52
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
0.8 I I I I I I I I
10 11 12 13 14 15 16 17 18 Search Complexity
Figure 4.12: Performance of two codes over different languages and input spectral shapes
for the improved robustness of the structured codebooks are the weak dependence
between code vectors and the training set, and the ability of structured codebooks
to produce spectra not present in the training set. These properties are particularly
important given the fact that both the training and the test set may not be represen-
tative of outliers present in natural speech.
4.6.2 Performance in the presence of channel errors
Good performance in the presence of channel errors is critical for a robust codec.
Distortions due to channel errors can be reduced without using redundancy bits by
appropriately assigning binary code words to each vector in the reproduction alphabet.
For scalar quantization, this can be achieved by assigning a Gray coded binary number
to each of the output levels sorted by magnitude. This concept was extended by Zeger
and Gersho [I121 to vector quantization and was called pseudo-Gray coding. Pseudo-
Gray coding is a locally optimal algorithm which effectively reduces the expected
distortion due to channel noise.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 83
If c; and c j are the transmitted and received code vectors, the average distortion
due to channel error can be written as
where b is the resolution of the codebook in bits/vector, qm is the probability that
exactly m bits are in error during transmission, p(c;) is the probability mass function
of the codevectors, and Nm(i) is the mth neighbourhood of i defined as
where H( i , j) is the Hamming distance between indices i and j . The total cost, F(c;) ,
of a given codevector measuring its contribution to the total distortion when it is
selected by the encoder is defined as
The total cost D of the codebook is minimized by first ordering the codevectors
according to decreasing individual cost F(ci) and then switching the position of each
codevector in sequence with the one that yields the greatest reduction in the total
cost.
The performance of multi-stage VQ with pseudo-Gray coding is studied in this
section. The scalar quantizer in DoD CELP (FS-1016) was sorted and encoded ac-
cording to Gray code while each stage of the multi-stage VQ was pseudo-Gray coded
(based on mean squared error) using the algorithm by Zeger and Gersho [112].
The error performance in terms of average spectral distortion, percentage of out-
liers within 2 and 4 dB, and percentage of outliers above 4 dB is shown in Tables 4.4
and 4.5.
The increase in the number of outliers is a much better indication of degradation in
performance in the presence of channel errors than average spectral distortion since
errors occur relatively infrequently but may cause a very large (and very audible)
spectral error. The results for the split VQ code are comparable (although slightly
better in outlier performance) to that reported by Paliwal and Atal 1871, although the
scalar quantizer performance (especially at an error rate of was much better
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
FS-1016 (scalar) LSF-12x2 (split)
LSF-12x2 LSF-3x9
LSF-12+10 LSF-6x4
LSF-Gx4(unsorted)
Rate (bits/vector) 34 24 24 27 22 24 24
Table 4.4: Average Spectral Distortion for Different Error Rates and Codes
FS-1016 (scalar)
LSF-12x2 (split)
LSF-12x2
LSF-3x9
LSF-12+10
LSF-6x4
LSF-Gx4(unsorted)
Rate 2-4 dB > 4 dB 2-4 dB > 4 dB 2-4 dB > 4 dB 2-4 dB > 4 dB
34 11.4 0.01 11.4 0.02 11.8 0.19 15.3 1.7
Table 4.5: Percentages of Outliers for Different Error Rates and Codes
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS 85
than that obtained in [87]. Possibly the scalar quantizer indices were not Gray coded
Tables 4.4 and 4.5 show that while VQ based systems have lower average spectral
distortion and lower 2-4 dB outliers even with transmission errors, scalar quantization
may lead to lower 4 dB outliers particularly at high error rates.
Performance was only marginally better for the LSF-6x4 pseudo-Gray coded code
compared to the LSF-6x4 unsorted code. Although the first stage had inherent ro-
bustness since it was initialized using a splitting procedure [33] the subsequent stages
were designed randomly and had no such structure.
4.7 Improved Codebook Designs for Multi-Stage
Although the codebooks reported in this chapter were designed in a sequential manner,
other design strategies exist that lead to a better codebook design. In sequential
design, the error minimized while designing each stage is not the final quantization
error but a partial reconstruction error till that stage since it assumes all subsequent
stages to be populated by zero vectors. A stage-(k + 1) code vector is computed as
where the summation is carried over all training vectors in the cell being considered
and u; is the partial reconstruction error
Here cP, is the code vector selected from the pth stage while quantizing x,. J P
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
4.7.1 Iterative Sequential Design
In iterative sequential design [69], called joint design by some authors [20, 131, each
stage is designed with all other stages fixed. The error function minimized is the total
quantization error and the stage-(k + 1) error uk is now obtained as
Design of each stage is iterated till convergence and then all stages are redesigned till
all stages satisfy a preset distortion improvement limit. The starting codebooks for it-
erative sequential design is usually derived from a sequential design. The performance
improvement obtained through iterative sequential design is very small compared to
the improvement obtained through M-L search. For codebooks with many stages, the
convergence of the iterative sequential design procedure can be extremely slow as only
a single codebook is optimized after each pass over the training sequence.
4.7.2 Simultaneous Joint Design
Simultaneous joint design of multi-stage codebooks updates all code vectors in ea.ch
stage simultaneously [69]. Ideally, a full search of the codebook is done for each
training vector and the error vectors at each stage (computed from codevectors cho-
sen from previous and subsequent stages as in iterative sequential design) are used
to compute new centroids for that stage. The sequential nature of search is lost
and whereas sequentially designed codebooks have monotonically decreasing average
energy, it may not be the case after few iterations of simultaneous joint design. Mono-
tonic convergence of the simultaneous joint design procedure is guaranteed if a full
search procedure is used. It has been shown experimentally [69] that convergence
is achieved even if M-L search is used instead of a full search. Since M-L search is
sequential in nature, the codebooks need to be re-ordered according to average energy
before repartitioning of the training sequence in each iteration.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
4.8 Recent Developments in MSVQ
In a recent paper 1131, Barnes and Frost derived necessary conditions for optimality
of full search Direct Sum Codebooks2. They also presented conditions that need to be
satisfied by an optimal sequential encoder using a direct sum codebook and noted that
it was impractical to do optimal sequential encoding in general since the complexity of
such a coder generally exceeds that of an exhaustive search of the direct sum codebook.
A design algorithm for jointly optimizing all codebook stages (similar to the iterative
sequential design mentioned earlier) was also presented. M-L search was suggested
as an attractive alternative to full search and experimental results were obtained for
memoryless Gaussian and Laplacian sources as well as second order Gauss-Markov
sources using 2-stage and 4-stage MSVQs and a squared error distortion function.
The results showed that performances for codebooks jointly optimized with M-L
search approached those for codebooks that are jointly optimized using exhaustive
search. They also noted that the memory complexity of an MSVQ for Gauss-Markov
sources was lower than that of a full search single stage VQ with equivalent perfor-
mance. Performance of MSVQ with large number of stages was not investigated. This
work was done independently of our work as evidenced by the date of publication of
the paper.
Some recent work reported on entropy constrained MSVQs [66] show that entropy
constrained MSVQs can obtain small improvements over fixed rate MSVQs. The
interested reader is directed to a recent publication [14] that presents a detailed review
of the state of the art in multi-stage vector quantization (not necessarily related to
quantization of LPC parameters).
2Historically, the term Multi-Stage VQ always implied a sequential search representing a successive approximation approach. More recently, the term Direct Sum Codebook is being used to acknowledge the fact that the codebooks may not be sequentially searched and the whole reproduction alphabet may be available for search at any time. It is easy to see that for an exhaustive search, the order of the codebooks is irrelevant.
CHAPTER 4. MULTI-STAGE VQ OF LPC PARAMETERS
4.9 Summary
The salient features of this chapter have been the following:
The suboptimality of sequential search has been made evident by deriving the
conditions under which a codebook can be optimally searched in a sequential
manner. A multi-candidate search technique has been proposed to mitigate the
problem and it has been shown that performance close to full search can be
obtained with much lower complexity.
The strength of the M-L search technique has been demonstrated by designing
codebooks with large number of stages which many researchers did not believe
to be possible. For a 30 bits/vector coder, the storage complexity was only 60
vectors compared to more than 220 vectors required in a full search codebook.
The computational complexity was extremely low as well.
A transparent quantizer at 22 bits/vector has been designed for 10th order LSF
vectors. It was the lowest bit rate transparent quantizer for linear prediction
parameters at that time.
It has been shown that multi-stage codebooks with more number of stages tend
to be more robust against variation in languages and input spectral shapes.
Multi-Stage VQs have also been demonstrated to be robust against random
channel errors.
We have shown here that MSVQ with M-L search performs better than split
VQ with lower complexity.
Chapter 5
A Low Rate Spectral Excitation
Coder
In order to test the proposed Multi-Stage Vector Quantization of LPC parameters in
a speech coding application, a low rate speech coder was developed. We have already
discussed quantization of LPC parameters in the previous chapters. To build a speech
coder, the next step is quantization of the excitation.
Past experience has shown that sinusoidal modelling performs very well in low
bit rate applications. Sinusoidal modelling of the LPC residual has several advan-
tages over sinusoidal modelling of speech itself. As pointed out in Chapter 2, a
major problem of sinusoidal coding is quantization of harmonic magnitudes. Since
the LPC residual has a relatively flat spectral envelope compared to the speech signal
(Fig. 5.1)) the harmonic magnitudes for modelling LPC residual may be quantized
very efficiently.
, In this chapter we discuss the development of a sinusoidal synthesis model for the
excitation which is used along with the multi-stage VQ of LSFs in implementing a
low bit rate speech coder.
We also introduce a novel 0-bit harmonic magnitude quantization technique that
has been demonstrated to work well giving good quality synthesized speech at 1800
bps.
As will be discussed below, determination of the correct value of pitch is very
important for harmonic coders. We present a new geometric pitch determination
technique that also determines the positions of pitch pulses and can be quite useful
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
1 I I I I I I I 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
- ?i a 0 3 C .- C
P I
Figure 5.1 residual
LPC residual 20 I I I I I I I
0 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz)
Magnitude spectrum of a voiced speech segment and corresponding LPC
in pitch synchronous algorithms.
5.1 Introduction
Sinusoidal coders attempt to synthesize speech or prediction residuals as a sum of
sinusoids produced by a bank of harmonic oscillators,
i ( n ) = A, ( n ) cos 0, ( n )
where M is the number of sinusoids, and A,(n) and 0,(n) are the amplitude and
phase functions for the m-th sinusoid. Usually, the sinusoids considered are harmonics
of a fundamental, the pitch frequency. Different models allow birth and death of some
harmonic frequencies in time and they also allow small deviations of the frequencies
from integer multiples of the fundamental. The number of harmonics, M, is a function
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
of the pitch frequency wo and is given by
where P is the pitch period in number of samples.
The problem of estimating and coding the parameters of the sinusoidal model can
be broken into two subproblems:
i) estimation and coding of the harmonic magnitudes, and
ii) estimation and coding of the phase functions.
Since each harmonic is a pure sinusoid, each of the phase functions can be derived from
the corresponding frequency function by simple integration, and thus only the initial
phases are sufficient along with the frequency functions to completely specify the
phases at all times. The amplitude functions, however, are completely independent
and must be estimated and coded for proper reconstruction.
5.2 Architecture of a Very-Low Rate Spectral Ex-
citation Coder
The general conceptual structure of a spectral excitation coder is shown in Fig. 5.2.
The speech signal s (n) is filtered through an analysis filter A(z), whose coefficients are
obtained through linear prediction analysis of the signal, to generate a residual e(n) .
The residual is analyzed for harmonic components and the harmonic magnitudes,
A, (n), the harmonic phases, 4, (n), and the harmonic frequencies, w, (n) , are derived.
These harmonic parameters along with the LPC parameters a ( n ) are passed on to
the decoder.
The decoder reconstructs an estimate, ;(n), of the residual by summing up si-
nusoids of the given magnitudes and phases. The estimated residual is then passed
through a synthesis filter whose coefficients are obtained from the LPC parameters
received by the decoder.
In a real speech coder, all parameters required to be transmitted to the decoder
need to be quantized. This leaves us with the problems of quantizing harmonic mag-
nitudes, phases, and frequencies along with quantization of LPC parameters. The
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
Analysis m Encoder
Decoder
n
A(z)
Figure 5.2: A conceptual schematic of a spectral excitation coder
MSVQ presented in Chapter 4 was used to quantize the LSFs (transformed LPC pa.-
rameters), and quantized LSFs were used to obtain the filter coefficients for both the
analysis and synthesis filters. The analysis and quantization of the sinusoidal model
parameters are presented in the following sections.
e(n) '
5.2.1 Treatment of Unvoiced Segments
It is obvious from our presentation so far that the sinusoidal model can suitably
represent all signals that are periodic in nature. The voiced speech signal, being
quasi-periodic in nature, is a perfect candidate for sinusoidal modelling. The unvoiced
segments have no such periodicity.
The problem is solved by considering each unvoiced segment as a single period of
Harmonic Analysis Arn(n>, @rn(@9 q,@1)
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 9 3
a periodic signal. As a consequence, all unvoiced segments are synthesized using the
same fundamental frequency. As we will see later, synthesis is done one frame at a
time and the pitch period is set equal to the synthesis frame length for all unvoiced
segments.
5.3 Computation of the Unquantized Residual
The sinusoidal model parameters are computed from a harmonic analysis of the un-
quantized residual. This section outlines the procedure used to generate the unquan-
tized residual.
To generate the unquantized residual e(n), s (n) is passed through A ( z ) (Fig. 5.2)
which is a time varying filter. Usually, an assumption is made on the quasi-stationarity
of speech and the filter parameters are determined once every L samples. For a low
bit rate coder, the analysis interval typically varies from 20 ms to 40 ms. Since our
coder was targeted to operate at less than 2000 bps, we chose an analysis interval of
40 ms which corresponds to L = 320 for a sampling rate of 8000 Hz.
The analysis window needs to be small enough not to violate the assumption of
local stationarity and at the same time every speech sample needs to be included at
least in one analysis window. Usually, overlapping analysis windows are used to main-
tain smooth transition between LSFs computed for successive analysis frames. Since
the analysis frame size chosen here was already quite large (40 ms), non-overlapping
analysis windows with length equal to that of the analysis frame were chosen.
The large analysis frame length can give rise to an abrupt change in the filter
response and therefore the filter coefficients need to be interpolated between measure-
, ment points. It is already known [7] that LSFs are an excellent choice for interpolation
of the short term filter parameters. Ideally best results are obtained if filter parame-
ters are updated at every sample but this gives rise to a large complexity in the coder.
Subjective quality tests were done to choose an appropriate interpolation interval. It
was found that no quality difference could be perceived when the interpolation inter-
val was shortened below 2 ms. The LSFs were linearly interpolated every 16 samples
in our coder and held constant between interpolation points as shown in Eq. (5.2)
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 94
where L is the analysis frame length, I is the interpolation interval and n, 0 5 n 5 L,
is an index within an analysis frame. The organization of the analysis frames, windows
and interpolation points are shown in Fig. 5.3.
LSF Computation
ws \- - - I -
n t e o a t e ,,Late Estimate LSFs LSFs LSFs
Compute Pitch and Harmonic Magnitudes
- synthesis frame
Figure 5.3: Analysis of SEC parameters
5.4 Estimation and Quantization of Harmonic Pa-
rameters
The harmonic parameters are derived from an analysis of the unquantized residual.
The unquantized residual is analyzed over synthesis frames. It should be noted that
the analysis frame used to compute LSFs and the unquantized residual is not related to
the synthesis frame as quantization of the residual through sinusoidal modelling is an
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 95
independent problem. The harmonic parameters and the short term filter parameters
can be updated at different rates as long as they are updated at the same instants of
time as they were computed during analysis.
The choice of synthesis frame size depends on the rate at which harmonic param-
eters vary in speech since each harmonic parameter is estimated at synthesis frame
boundaries and is interpolated linearly within the frame. The synthesis frame length
was chosen to be a submultiple of the LSF analysis frame for convenience. Preliminary
experiments showed that reducing synthesis frame size below 10 ms (80 samples) does
not produce significant improvement in speech quality. For low pitch talkers (P 2 80),
this implies determination of harmonic parameters more than once per pitch period.
For the average male or female talker, an 80 sample synthesis frame implies that
harmonic parameters are estimated once every 2-3 pitch periods. In our coder, a
synthesis frame size L, of 80 samples corresponding to 10 ms was chosen.
5.4.1 Pitch Estimation
Correct pitch estimation is very important in harmonic coders as the fundamental
frequency is determined by the pitch frequency. MBE employs a closed loop esti-
mation of pitch by jointly estimating the harmonic magnitudes and pitch through
minimization of a spectral estimation error function. Closed loop pitch estimation is
very expensive in terms of computation and therefore an open loop pitch estimation
is used here.
The more popular open loop pitch estimation procedures (Normalized Autocorrela-
tion Method, SIFT) [65, 76, 77, 781 use autocorrelation of the speech or residual signal
to measure similarity between the original and shifted versions of a speech/residual
segment. The distance function minimized in these procedures is
where N is the analysis frame length, T is the shifted distance, and ,8 is a scaling
factor to take into account changes in signal energy with time. The optimum value
for ,f3 is found by setting the partial derivative dE(r , P)/d(P) to 0 and solving for /3.
This gives
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 96
Substituting the optimum P in Eq. (5.3)) the distance function to be minimized be-
comes
This is equivalent to maximizing the square of the normalized autocorrelation function
Maximizing square of autocorrelation can give wrong results as correlation may be
negative. Since only positive correlations are of interest, the function maximized
should be
Rn( r ) is computed for Pmin 5 r < P,,, where Pmin and PmaZ are the minimum and
maximum pitch values of interest, and the value of r that maximizes the function is
the value of pitch.
In a real implementation, given a pitch estimation window of length N,, the func-
if no sample outside the estimation window is to be used. An alternative formulation
with negative shifts gives the function to be maximized as
Autocorrelation based pitch estimators have a high computational complexity and
frequent occurrences of pitch doubling/halving. A geometric pitch detector was devel-
oped that works on peak detection on both speech and the residual signal. Subsequent
pitch intervals are marked, and one is able to obtain estimates of subsequent pitch
cycles. These pitch periods may be averaged if one desires to obtain an average pitch
period. The algorithm also maintains a track length of pitch values that fall within a
threshold around the previous pitch value. Usually, the track length increases through-
out a voiced segment and is reset to zero at the beginning of an unvoiced segment. A
default pitch value of 0 is returned for unvoiced frames. The decoder, upon receiving
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 9 7
a value of 0 for pitch, replaces it by the synthesis frame length (80 in our coder) for
synthesizing an aperiodic sequence.
The heart of the pitch estimator is a peak detector that detects peaks in the
speech/residual signal within a pitch estimation window. These peaks are subse-
quently examined and those that mark a pitch cycle are retained. Autocorrelation
computations are performed'when peak detection fails or is deemed to be unreliable.
A tracking procedure is used to minimize the occurrence of pitch doubling/halving.
The Peak Detector
Different segments of speech/residual can have larger positive or larger negative peaks.
Usually, one voiced segment has large peaks of one polarity and the peak polarity can
change for different voiced segments (phonemes). Generally, detecting peaks that
represent pitch pulses is a non-trivial problem, but it becomes easier if one is able to
detect one pitch pulse successfully and then search around that peak. In general, the
sample with the largest magnitude within the pitch estimation window represents a
pitch pulse if the sample in question is not a boundary sample. In case the sample
with the largest magnitude is on the boundary, the search for the maximum can be
restricted to a smaller part of the window till a qualifying peak is found.
The peak detection procedure is described below where all operations are carried
out within a pitch estimation window of length N,.
1. Find maximum sample value, smax, and minimum sample value, s,;,, for the
signal s (n) and save the indices.
2. if Ism;,l > IsmaXl, multiply all samples by -1 (reverse the polarity of all samples
so that the largest peak is positive) to obtain a sign corrected signal x(n). Choose
the appropriate index from the two indices retained in the previous step as the
index, i,,,, to the largest value in the sign corrected signal.
All subsequent references to the signal will mean the sign corrected signal x(n).
3. If the maximum sample is an end sample (i,,, = 0 or N, - 1), then do the
search again ignoring o N , samples from the end where the maximum sample
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 98
was found. For example, if i,,, = 0, repeat the search for maximum within
samples cuN, to N, - 1. a is a suitable constant less than 1. In the pitch
detector implemented, a was 1/3. Iterate the step till a valid peak is found.
4. Once the largest peak in the signal is found, search for other peaks to the left
and right of the largest peak making sure that no successive peaks are separated
by less than Ng samples where Ng, called the guard period, is a function of the
previous value of average pitch, P-l, and the number of consecutive voiced
segments, called the tracklength. If two peaks are found within Ng samples,
retain the larger of the two.
5 . If the next peak is separated by more than Ng samples from the previously
retained peak, keep it if it is larger or exceeds z,, times the previous peak
value, where Tlow is a preset threshold for qualifying lower peaks.
The peak detection algorithm is applied to both the speech and the corresponding
residual signal and the number of pitch cycles obtained from both signals are counted.
If the number of pitch cycles obtained is less than an expected minimum number of
pitch periods (determined from the previous average pitch value), the value of the
threshold, Tlow, for qualifying lower peaks, is reduced and the peak detection procedure
is applied again. If TIOW goes below a lower limit T,;,, and still the required minimum
number of cycles have not been found, the segment is declared unvoiced and the pitch,
P, is set to 0. It should be noted that the number of pitch cycles obtained from the
residual and the speech signal may not be the same.
Pitch Computation
Once the values of possible pitch periods, p(n), (differences between successive peak
indices) within a pitch estimation window has been found for both the residual and
the speech signal, periodicity of the peaks is checked by computing a mean normalized
sample standard deviation defined as
where
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 9 9
N, is the number of possible pitch periods and p is the vector of possible pitch periods.
We will use the subscript s to indicate parameters related to the speech signal,
and the subscript e to indicate parameters related to the residual signal.
The normalized standard deviations - u,(p,) for the speech signal, and a,(pe)
for the residual signal - are computed along with the respective mean values p(p,)
and P ( P ~ ) . Parameters derived from the speech signal are given priority over those determined
from the residual for pitch estimation and an average pitch is computed through a
complex heuristics using a sequence of tests. The sequence of tests performed for
computing average pitch is described in Appendix C.
It has been found that the use of both speech and residual waveform is very helpful
as peaks are sometimes more easily discernible in one compared to the other. Pitch
values from 20 to 140 are considered in this algorithm and the pitch analysis is done
over an analysis window of 240 samples (30 ms).
The algorithm has been found to be very robust under normal recording condi-
tions. No characterization of the algorithm was done under noisy recording conditions.
Figure 5.4 shows typical results from an autocorrelation based pitch detector (labelled
p) with no pitch tracking and the pitch contour obtained from the geometric pitch
detector (labelled p_g) described here which uses pitch tracking. Although the geo-
metric pitch detector uses both speech and residual signals, this does not lead to an
increase in complexity for our coder as the residual is already available and does not
need to be specially computed just for pitch determination.
Figure 5.5 shows a typical plot for a voiced segment where all peaks in the speech
and the residual signal have been marked as detected in this algorithm.
5.4.2 Modelling of Harmonic Phases
There has been two basic approaches to phase modelling in the past. In one approach
[54], the harmonic frequency is assumed to be varying linearly with time between
measurement points on the speech signal, and in another [82, 831, the variation in
frequency is assumed to be quadratic. In the original work by Griffin [54], fundamental
frequency and phases of all harmonics were measured every 20 ms and linear frequency
interpolation was used to satisfy four boundary conditions - two at the start and two
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
40 1 I I I I
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 sample no
Figure 5.4: Performance of the geometric pitch detector.
at the end of a measurement interval. If $, and q5, indicate the measured and
predicted phases for the m-th harmonic, and w, indicate the measured frequency of
the m-th harmonic, then the boundary conditions are
where L is the interval between measurements.
Obviously, a linear change in frequency, which gives rise to a quadratic interpola-
tion model for phase, cannot satisfy all four conditions as a quadratic has only three
coefficients. Griffin [54] allowed the model frequency track to be raised or lowered
by a small amount Awm to force the phase matching conditions at the expense of
violating the frequency matching conditions. He proposed a model where the phase
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
I I
0 50 100 150 200 Sample No.
Figure 5.5: Pitch pulses marked by the pitch detector.
is given in the continuous time domain by
and 1 L
a m (t ) = urn (0) + [urn ( L ) - wm (O)] - L + Awm.
In the second basic approach, McAulay [82] used a cubic phase interpolation model
which could actually satisfy all four boundary conditions. His model can be written
giving rise to a quadratic frequency change over the interpolation interval
The parameters a,, b,, and c, can be solved for the given boundary conditions
(Eq. (5.11) - Eq. (5.14)) and the following results are obtained.
CHAPTER 5. A LOW RATESPECTRAL EXCITATION CODER 102
Since the measured phase $,(t) has an uncertainty of 2 s Mm, where M, is an inte-
ger, the term $,(L) is replaced by $,(L) + 2sMm in the above equations. The value
of Mm is determined from a frequency smoothness criterion such that the functional
is minimized. The functional f (-) measures the deviation of the frequency track from
a constant frequency (which will result in d28,/dt2 being zero).
Although the quadratic interpolation model fails to satisfy all boundary conditions,
it is more popular because it needs a smaller number of parameters and can even be
simplified further requiring no transmission of parameters other than the pitch period.
This is achieved by replacing Aw, in equation 5.16 by a suitable random number.
In the discrete time domain, if $,(n) denotes the measured phase of the mth
harmonic, and Om(n) denotes the model phase used in the synthesis equation, the
quadratic phase interpolation gives rise to the following equations.
where n = N is the end of the current frame or equivalently, the beginning of the next
frame. If dm(n) denotes the predicted phase from an assumption of linear frequency
change over the frame, then
For a coder completely based on predicted phase, the measured phase $,(N) in
equation 5.24 is replaced by the predicted phase dm(N) from equation 5.25 making
Awm = 0 as expected. In Griffin's original MBE coder [54,52], the difference, $m (N) -
dm(N), between measured phase and predicted phase was quantized and transmitted
for every harmonic. In the IMBE (INMARSAT-M) [32] system, the quantity $ , (N) -
dm(N) is modelled as
$m(N) - dm(N) = Arm (5.26)
i CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
where r, is a random number uniformly distributed between [-x, x], and X is the
fraction of unvoiced energy in the frame being synthesized, estimated as the ratio of
the number of unvoiced bands to the total number of frequency bands (usually between
6 and 12) used. Thus, encoding of phases is avoided in the IMBE system.
Numerous studies have been made using the sinusoidal synthesis model for speech
[3, 16, 54,55, 80, 82, 83,851. All of them use a phase model based on a predicted value
of phase and add a random component to it depending on a voicing probability that
measures how close the speech frame is to being voiced. The voicing probability is
measured based on a goodness of fit of the sinusoidal model [83] or from a normalized
autocorrelation coefficient at the pitch lag [25].
2 4 6 8 10 12 14 16 18 harmonic number, m
Figure 5.6: Difference between measured and predicted phase changes for a voiced frame.
The difference between the measured phase changes, A$, = $ , ( N ) - $,(0),
for each harmonic and predicted phase changes, Ad, = d,(N) - r$,(O), is plotted
against the harmonic number m in Figure 5.6 for a voiced frame. It can be noted that
the deviation from the predicted phase increases with frequency. Phase changes for
harmonic components can be measured over an unvoiced segment if an assumption
is made about the fundamental frequency (pitch period). As already pointed out
earlier, our coder uses a pitch period equal to the synthesis frame length in order
i r CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
to synthesize an unvoiced segment. Under this assumption, the phase changes for
each harmonic over an unvoiced frame is shown in Fig 5.7. The difference between
5 10 15 20 25 30 35 40 harmonic number; m
Figure 5.7: Difference between measured and predicted phase changes for an unvoiced frame.
measured and predicted phase changes for unvoiced speech is random and does not
exhibit any pattern.
In modelling phase changes for each harmonic, the measured phase changes, A+,(N),
in Eq. (5.24) is replaced by the predicted phase change, Aq5,(N), plus a random phase
rm - A$m (N) = A ~ J ~ ( N ) + rm (5.27)
where sg 9 for voiced speech
sgXr for unvoiced speech
Here s, is an empirical constant called phase-scatter gain, X is a uniformly distributed
random number between [-I, 11, and M is the number of harmonics for the frame
being synthesized. In our coder sg was set to 1.0.
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 105
5.4.3 Estimation and Quantization of Harmonic Magnitudes
The harmonic magnitudes can be estimated in several ways. In MBE [54], the mag-
nitudes are derived from the spectrum of windowed speech. Since the windowed
spectrum depends both on the magnitudes and the pitch, both harmonic magnitudes
and pitch are estimated as
the harmonic magnitudes
ming window at least 2.5
to
a solution to a joint optimization problem. In STC [82, 831,
are measured from a periodogram obtained with a Ham-
times the average pitch period and normalized according
In the TFI technique [97], the harmonic magnitudes are measured from a pitch sized
DFT of the speech segment with a rectangular window.
The use of pitch sized DFT to compute harmonic magnitudes is particularly at-
tractive for sinusoidal synthesis systems because it can provide with an exact synthesis
of the frame if unquantized harmonic magnitudes are used along with unquantized
DFT phases. This is shown below starting from the definition of DFT.
Let x(n) be the speech signal and P be the pitch period in number of samples.
Also let X(k) be the P-point DFT of x(n) obtained as
Then the signal x(n) can be written as
If P is odd, the frequency sampling points are as shown in Fig. 5.8(a) and all samples
can be paired with their complex conjugate term except for the one at w = 0. The
case when P is even is shown in Fig. 5.8(b) and similar pairs can be formed except
for the samples at w = 0 and w = T. We can represent x(n) for these two cases as
follows.
Case P odd:
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
Case P even:
(a) P odd (b) P even
Figure 5.8: Frequency sampling points for a P-point DFT
P-I - 1 2
= -X(O) + x IIX(k)ll cos (%kn+ d k ) P k=l
P-1 7
x ( 0 ) 2~ - - + C IIX(k)ll p cos (pkn + 4,)
P k=l
where dk = arg[X(k)].
Assuming that the signal energy is practically zero at w = 0, it is evident that the
signal x(n) can be exactly represented as a sum of harmonics
where M = L f ] and each harmonic magnitude is given by
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
except for k = P/2 when P is even in which case,
Quantization of harmonic magnitudes poses a special problem in sinusoidal syn-
thesis systems since the number of harmonic magnitudes varies with pitch. Therefore,
the number of magnitudes to be quantized changes with the talker and also within a
sentence spoken by one person. IMBE uses a very elaborate coding scheme to code
this variable dimension magnitude vector with the same number of bits. Recently,
the problem has been addressed by two techniques known as the variable dimension
vector quantization (VDVQ) [26] and the Non-Square Transform Vector Quantization
(NSTVQ) [74].
As has been shown earlier in this chapter (Fig. 5.1), the residual signal has a
relatively flat spectral magnitude and an elaborate quantization scheme may not be
required for reasonable speech quality at a low bit rate. A novel 0-bit quantization of
harmonic spectral shape is used here along with scalar quantization of the energy of
the magnitude vector in quantizing the harmonic spectral magnitude vector.
The coder measures and transmits the value of pitch using 7 bits that can encode
128 pitch values. The pitch value in our coder is allowed to vary between 20 and 140
giving rise to 121 different values. Thus there are 7 unused codes that can be used to
transmit other information whenever a pitch value is not transmitted. We transmit
a pitch code of 0 for unvoiced speech segments. The decoder, upon receiving a zero,
sets the pitch to the synthesis frame length and also takes note of the fact that the
segment was unvoiced. If a non-zero pitch code was received, then the segment was
voiced.
Once the V/UV classification of the segment to be synthesized is known, the
decoder uses two harmonic spectral shape templates - one for voiced and another
for unvoiced speech - to obtain harmonic magnitudes for synthesis by sampling the
templates at appropriate sampling points given by
where F, is the sampling frequency in Hz and P is the pitch period in number of
samples.
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 108
The templates were created from two sets of training vectors, one for voiced and
another for unvoiced speech. Each vector in the training set was 257 points long
corresponding to the frequency range of [0, a]. Each training vector was created as
follows:
i) Obtain one pitch cycle of voiced segment or synthesis frame size (80 sam-
ples in our coder) of unvoiced segment.
ii) Zero pad to N(= 512) samples and compute FFT magnitudes.
iii) Take first N / 2 + 1(= 257) elements of the FFT magnitudes to form the
FFT magnitude vector.
iv) Compute norm of the FFT magnitude vector and normalize the vector by
dividing by the norm.
v) Compute logarithm to the base 10 for each element of the normalized
vector.
The template is then computed as the centroid of all normalized log FFT magnitude
where S, is a training vector of FFT magnitudes.
Now, the quantization of harmonic magnitudes can be described as follows:
i) Given the harmonic magnitude vector A and the quantized pitch P (that
has embedded V/UV information), choose the voiced or unvoiced template
of log spectral magnitudes, c .
ii) If the value of P was less than P,;,, replace the value by the length of the
synthesis frame.
iii) Compute log harmonic magnitude shape vector x = {x; : i = 1,. . . , M )
as follows:
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
0 500 1000 1500 2000 2500 3000 3500 4000 Frequency (Hz)
Figure 5.9: Log magnitude spectrum templates for voiced and unvoiced speech
0 k j = k , + l
0 if kj > N/2+ 1, ki = N/2 + 1
~ i = ~ ~ , + ( ~ ~ , - c k , ) ( f ; - k ; ) + r ~ X , w h e r e { c ~ : 1 = 1, ..., N/2+1) are
elements of the template vector c, r, is the magnitude randomization
gain and X is a uniformly distributed random number in the range
[-I, I]. The random component is added to avoid excessive similarity
of spectra for consecutive speech segments.
iv) Compute harmonic magnitude shape vector y as y; = lox1, i = 1 , . . . , M .
v) Compute gain, g = w. llyll
vi) Scalar quantize g using bg bits to i j .
vii) Compute quantized harmonic magnitude vector, A = ijy.
A value of 0.1 to 0.2 for r, gave good results. The gain g was quantized by using
a uniform quantizer on loglog. It should be noted that the quantization scheme
presented above needs to have identical random number sequences generated both at
the encoder and the decoder. This can be easily achieved by using identical random 1
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 110
number generators at the encoder and decoder, and initializing them with the same
seed.
An alternate quantization scheme that works as well and does not need synchro-
nized random number generators is as follows.
Encoder:
i) Given the harmonic magnitude vector A and the quantized pitch P (that
has embedded V/UV information), choose the voiced or unvoiced template
of log spectral magnitudes, c.
ii) If P < P,;,, replace the value by the length of the synthesis frame.
iii) Compute log harmonic magnitude shape vector x = {xi : i = 1, . . . , M }
as follows:
0 f; = ~ i / p
ki = [fij
0 kj = k i + l
0 if kj > N / 2 + 1 , kj = N / 2 + 1
xi = ck, + (ck, - cb,)(fi - ki), where {cr : 1 = 1,. . . , N/2 + 1) are
elements of the template vector c.
iv) Compute harmonic magnitude shape vector y as y; = 1OX1, i = 1 , . . . , M .
v) Compute gain, g = llAll llvll '
vi) Scalar quantize g using b, bits to 6 .
Decoder:
i) Given the value of i) (that has embedded V/UV information), choose the
voiced or unvoiced template of log spectral magnitudes, c.
ii) If i) < Pmin, replace the value by the length of the synthesis frame.
iii) Compute log harmonic magnitude shape vector x = {xi : i = 1 , . . . , M }
as follows:
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
f; = N ~ / P
k; = Lf;] 0 k j = k ; + l
if kj > N/2 + 1, kj = N/2 + 1
xi = ck, + (ck, - cki)(fi - k;), where {cr : 1 = 1 , . . . , N/2 + 1) are
elements of the template vector c.
iv) Compute harmonic magnitude shape vector y as y; = loxt , i = 1, . . . , M .
v) Add random components to the log magnitude shape vector x to generate
randomized shape vector x'.
where r , is the magnitude randomization gain and X is a uniformly dis-
tributed random number in the range [-I, 11.
vi) Compute randomized magnitude shape vector y' as yl= lox:, i = 1, . . . , M .
vii) Compute quantized harmonic magnitude vector,
5.5 An 1800 bps Spectral Excitation Coder
The schematic diagram of an 1800 bps Spectral Excitation Coder is shown in Fig. 5.10.
At the encoder, the speech spectral envelop is estimated every 40 ms using 10th order
LPC analysis. The LPC coefficients are transformed to LSFs and quantized using an
MSVQ with M-L search [15, 691 at 24 bits/vector. An eight stage MSVQ with eight
vectors per stage is used with M = 29 to provide robust transparent quantization of
LSFs. The quantized LSFs are interpolated and the analysis filter A(z) is updated
once every 2 ms (16 samples) in computing the residual signal e(n) . The original
speech and the computed residual are used to determine pitch P using a geometric
pitch detector described in subsection 5.4.1. The pitch P is quantized to P using 7
bits as follows.
P = Pmin - 1 +Pi, (5.42)
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 112
Encoder Decoder
Figure 5.10: A Low bit rate Spectral Excitation Coder
is the pitch code transmitted to the coder. P,;, = 20 in our implementation. The
value P; = 0 is reserved to indicate an unvoiced frame.
The harmonic magnitude vector is computed once every 10 ms (80 samples) by
applying a pitch sized rectangular window on the unquantized residual and applying
DFT on the windowed signal. The harmonic magnitude gain g is computed from the
harmonic magnitude vector and sampled voiced or unvoiced log magnitude template
depending on the value of @ as described earlier. The gain g is quantized to 6 using
a scalar logarithmic quantizer with 5 bits (using a uniform quantizer on loglog) as
follows:
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER
g,,, was 20,000 and g,;, was 1.0 in our implementation.
The information transmitted to the decoder are the quantized LSFs, the quantized
pitch (which includes V/UV information) and the quantized harmonic magnitude
gain.
The decoder computes the harmonic magnitudes from the quantized magnitude
gain, the magnitude randomization gain, and the voiced/unvoiced magnitude tem-
plates stored at the decoder. It also computes the phases for all harmonics using the
phase prediction and dispersion model described earlier.
A magnitude randomization gain, r, = 0.2, was found to give good results and
was used in our implementation. The phase scatter gain s, (see Eq. (5.28)) was set
to 1.0.
The quantized residual produced by a sinusoidal oscillator bank using the quan-
tized harmonic magnitudes and phases is passed through a synthesis filter l/A(z) to
produce synthesized speech. The coefficients of the synthesis filter are updated from
the quantized LSFs which are interpolated every 2 ms in synchronization with the
interpolation at the encoder.
The bit allocation for the 1800 bps coder is shown in Table 5.1.
Pitch Exc. Gain J 1
-.
)I Harmonic Mags I 0 I 0 I 0 n - , I I
TO^ a1 1800 I
Parameter
LSFs
Table 5.1: Bit Allocation for the 1800 bps coder
5.5.1 Evaluation of Coder Performance
Bits 24
The performance of the 1800 bps coder was evaluated using an informal Mean Opinion
Score (MOS) test. Three codecs - IMBE at 4150 bps, LPC-lOe at 2400 bps and our
harmonic coder at 1800 bps were used in this test. Each codec was used to encode 8
Updates
1
Rates (bps)
600
CHAPTER 5. A LOW RATE SPECTRAL EXCITATION CODER 114
sentences - 4 male speakers and 4 female speakers. Coded sentences were played in
random order following the original 16- bit PCM sentence through stereo headphones.
Seven participants took part in the informal MOS test which provided 56 ratings
for each codec. The results of the test are shown in Table 5.2. The 1800 bps SEC
performed better for female speech compared to male speech. This is expected since
the number of harmonics in male speech is generally larger than in female speech and
the higher harmonics display a random phase change that is difficult to model. The
1800 bps SEC performed significantly better than 2400 bps LPC-lOe and scored more
than 0.3 MOS points higher. IMBE was used as an anchor to validate the MOS scores.
IMBE obtained a MOS score of about 3.5 which is a generally accepted score for IMBE
indicating that the test scores are reliable. The major distortion experienced in the
SEC coder was in nasal sounds. Significant improvement of quality for nasal sounds
should be possible by using a nasal harmonic magnitude template and a classifier.
This, however, was not investigated in this work.
Table 5.2: MOS results
5.6 Conclusions
'
In this chapter we presented a harmonic coder a t 1800 bps that used an MSVQ with
M-L search for quantization of LPC parameters developed earlier. The MOS obtained
for this coder was significantly higher than the LPC-lOe at 2400 bps. The nasal
sounds were most audibly distorted showing that the harmonic magnitude shapes for
nasal sounds are different from non-nasal voiced sounds. This shows that a possible
improvement in speech quality can be made by using a separate template for nasal
sounds.
Coder IMBE
LPC-lOe SEC
Variance Rate (bps)
4150 2400 1800
Mean Opinion Score Female
0.27 0.27 0.22
All 0.36 0.43 0.29
All 3.46 2.65 2.98
Male 0.44 0.48 0.33
Male 3.33 2.41 2.84
Female 3.58 2.88 3.13
Chapter 6
Conclusion and Future Direct ions
The major focus of this thesis has been efficient quantization of LPC parameters and
a low bit rate (1800 bps) coder has been implemented using the LPC quantization
technique developed here. The structure of a multi-stage vector quantizer has been
analyzed and various search strategies for a multi-stage vector quantizer codebook
along with their inherent problems have been presented.
It has been shown that a multi-candidate search with an appropriate distortion
measure not only provides a lower quantization distortion but it does so at a lower
computational complexity. Several multi-stage structures have been studied and their
performances have been presented. It has been shown that as the number of stages
increase and the number of codewords per stage decrease, more is gained from the
M-L algorithm as M is increased. It has also been shown that transparent coding
of LPC parameters can be done using 22 bits per frame at the same computational
complexity as the 24-bit split VQ [86] which has been considered to be the lowest rate
transparent LPC quantizer so far. At 24 bitslframe, the M-L search technique can
achieve transparent quantization at much lower search complexity.
The performance of MSVQ codes have been studied under channel error condi-
tions and codebook ordering using pseudo-Gray coding. It is shown that while VQ
based systems have lower average spectral distortion and a lower percentage of 2-4
dB outliers even with transmission errors, scalar quantization may lead to a lower
percentage of 4 dB outliers particularly at high error rates.
The robustness of each multi-stage structure has been studied and it has been
found that robust VQ can be achieved by adding suitable structure to the code while
CHAPTER 6. CONCLUSION AND FUTURE DIRECTIONS 116
impairing average performance only slightly. Possible explanation for the improved
robustness of the structured codebooks are weak dependence between code vectors
and the training set, and the ability of structured codebooks to produce spectra not
present in the training set. These properties are particularly important given the fact
that both the training and the test set may not be representative of outliers present
in natural speech.
The low rate spectral excitation coder at 1800 bps uses a novel technique for 0-bit
spectral shape quantization for the excitation signal by hiding the V/UV information
in pitch values. This was achieved because the range of quantized pitch values (20 5 P 5 140) do not require the full 7-bits used to represent them.
A new geometric pitch detector was also developed that has much lower compu-
tational complexity compared to an autocorrelation based pitch detector. It can also
provide estimates of individual pitch periods making it suitable for use with pitch
synchronous algorithms.
There are several possibilities for improvement in both the LPC quantizer and
the Spectral Excitation Coder presented here. First, the value of M used in the
M-L search need not be the same for each stage of the codebook. In fact M for
each stage can be chosen from the average quantization distortion from the previous
stage. This should provide some reduction in computational complexity for the same
quantization performance. Next, techniques for selecting the MSVQ structure can be
studied. The codebook size at each stage can be estimated during the design of the
stage by studying the distribution of training vectors at that stage. This of course
applies only to a sequential design of the codebook.
The spectral excitation coder presented in this thesis does not work very well for
nasal sounds. This can be improved by using more spectral templates at the expense
of a small increase in bit rate.
Appendix A
Linear Prediction
Linear Prediction (LP) has become the most widely used method of speech signal
analysis and synthesis since its introduction to speech [9, 781. The success is usually
attributed to the fact that most sounds from the vocal tract can be modeled by an
all pole structure. Another reason for its success is the nature of human perception
that we perceive spectral peaks rather than spectral nulls. This explains why LP
models work even in the case of nasal and other sounds with spectral zeros. The
LP modeling of speech factorizes the problem of speech coding into two independent
coding problems - coding of the spectral parameters representing the vocal tract and
coding of the excitation to the vocal tract. Out of many possible candidates for
spectral estimation, LP based estimates have been most useful because it addresses
the spectral estimation problem from a deeper point of view, the Maximum Entropy
Principle.
A. 1 Conceptual Formulation
Let x be a stochastic process and let X( t ) be a column vector of L independent
measurements from L realizations of x. Then
Linear Prediction involves predicting this vector from a set of p previous measurements
T X(t - m) = [xO(t - m), xl(t - m), . . . , z ~ - l ( t - m)] , l l m l p . (A.2)
I
i APPENDIX A. LINEAR PREDICTION
The forward (in time) predicted value is then written as
P
Rf ( t ) = C am(t)X( t - m). m=l
Writing the past observations in matrix form,
X ( t ) = [X( t - l ) , X( t - 2 ) , . . . , X ( t - p) ] ,
Eq. (A.3) can be written as
~ f ( t ) = X(t)a( t )
where a(t) = [a&), a2(t), . . . , ap(t)lT.
The prediction error vector, e( t ) , can then be written as
The prediction error energy is then given by
(A. 7)
Let &(t) be the set of parameters that minimize the prediction error, i.e.
S(t) = min [a(t)] . 4 t )
(A.9)
From Eq. (A.7) and (A.8)
Setting partial derivatives with respect to ST(t) equal to zero,
APPENDIX A. LINEAR PREDICTION 119
Equation (A.12) is known as the normal equation. Now, we can write the forward
predicted value as
Rf ( t ) = X( t )&( t )
= X ( t ) ( x T ( t ) x ( t ) ) -l x T ( t ) x ( t )
= P ( t ) X ( t ) ( A . 14)
where
P ( t ) = X ( t ) ( x T ( t ) x ( t ) ) -I x T ( t ) , P ( t ) E RLxL. (A.15)
The following theorem [38] shows that P ( t ) is a projection operator and ~f ( t ) is the
projection of X ( t ) on the space spanned by the columns of X ( t ) which is the subspace
of past observations.
Theorem A . l Let W be a subspace of Rn. There is a unique n x n matrix P such
that for each column vector b in Rn, the vector Pb is the projection of b on W . The
projection matrix P can be found by selecting any basis { a l , a2,. . . , a k ) for W and
computing P = A(ATA)- lAT, where A is the n x Ic matrix having column vectors
al ,a2, . . . ,ak.
The corresponding minimum prediction error vector is given by
where P L ( t ) is the orthogonal complement of P ( t ) .
The term X T ( t ) X ( t ) in Eq. (A.12) is an estimate of the autocorrelation matrix
of the stochastic process x and the term X T ( t ) X ( t ) is an estimate of a vector of
autocorrelation coefficients since each element in the matrix X T ( t ) X ( t ) and the vector
XT ( t ) X ( t ) is an ensemble average over L realizations. Extending the averaging over
all realizations of x and invoking the properties of stationarity, the normal equation
( A.12) can be written as
R,a= v (A.17)
APPENDIX A. LINEAR PREDICTION
where
The autocorrelation matrix being positive definite, always has an inverse, hence the
predictor coefficients can be computed from the above equations as
In speech coding, the speech waveform is assumed to be a realization of a stationary
stochastic process. Further, the autocorrelation matrix is estimated from this single
realization under the assumption of ergodicity.
From a discrete signal processing point of view, defining
the prediction equation can be written as
and the prediction error can be written as
where
(A. 24)
APPENDIX A. LINEAR PREDICTION
Figure A.l: Linear Prediction Model
A(z) is called the inverse filter. The speech analysis and synthesis models can then
be shown as in Fig. A.1.
Comparing with the source-filter model of speech production (Fig. 1.3), the syn-
thesis filter can be easily identified as the l/A(z) block. It also gives us a way to
compute an excitation E(z) , given the synthesis filter and the speech, X ( z ) , to be
synthesized. This clearly shows the utility of LP analysis that breaks up a signal x(n)
into a signal e(n) and a filter A(z) which can be quantized independently of each
other.
A.2 Equivalent Represent at ions
There are many different equivalent representations of the filter A(z). Some very useful
ones are Reflection Coefficients, Log Area Ratios, and Line Spectrum Frequencies.
They are all derived by considering a backward predictor along with the forward
predictor. For ease of notation, let us write o, = -a;, i = 1,2, . . . , p . Then eq (A.24)
can be written as
(A. 25)
where a0 = 1. The forward prediction error for a p-th order predictor is then written
as
APPENDIX A. LINEAR PREDICTION 122
The backward prediction error results from predicting the past from the future and is
written as
where pp+l = 1. It can be shown that a relationship exists between the coefficients
a; and the coefficients pi: one is just the time reversed sequence of another [39]. The
reflection coefficients (also known as PARCOR coefficients) are then defined as
A useful property of the reflection coefficients (RCs) is that the set of reflection coeffi-
cients {kl . . . kp ) for a p t h order predictor is a subset of the coefficients {kl . . . kp, k,+l )
for the (p + 1)-th order predictor. This is not true for the linear prediction coefficients
ay's. However, a recursive relation may be derived between the coefficients of different
orders of prediction.
where B, (z) is given by -
These reflection coefficients are closely related to the acoustic reflection coefficients for
the stepped cylinder model [92] of the vocal tract (Fig. A.2). One important property
of the reflection coefficients is that they are always within the interval [- 1, +I]. Thus
the synthesis filter described by the reflection coefficients can be easily checked for
stability when the coefficients are quantized. This cannot be accomplished by a simple
observation of the Linear Prediction coefficients.
APPENDIX A. LINEAR PREDICTION
Figure A.2: Stepped cylinder model of the vocal tract
Although RCs are restricted to values between f 1, the spectral sensitivity for
quantization largely differs between the region near f 1 and that around 0. Therefore
some equivalent representations are often used that have better quantization proper-
ties. These are Log Area Ratios (LARS) and Arc Sine coefficients (ASRCs) defined
as follows.
and
1 + k, LAB, = log -
1 - k,
ASRC, = sin-'(k,)
(A.31)
(A. 32)
Log Area Ratios and Arc Sine Coefficients have more uniform sensitivity properties
making themselves more suitable for a simple uniform quantization, and being derived
from Reflection Coefficients can be easily checked for instabilities in the synthesis filter.
Even then, their relationship with the formant structure of the vocal tract transfer
function l/A(z) is not very straightforward. In fact quantization error of only one
coefficient affects the whole spectral envelope. Also, being derived from Reflection
Coefficients, these are essentially parameters operating in the time domain as the
autocorrelation coefficients. A different representation of the all-pole filter called Line
Spectral Frequencies (LSFs) (i.e., resonant frequencies with an infinite Q or discrete
frequencies) was introduced in 1975 by F. Itakura [56]. It is worthwhile to note that in
this representation, the vocal-tract-filter parameters are frequency domain parameters
as they are in the channel vocoder and the formant vocoder.
The LSFs are obtained from the LP coefficients through a transformation de-
scribed below. The main advantage of using LSFs for quantization is the fact that
quantization error in one coefficient results in spectral distortion only around the
APPENDIX A. LINEAR PREDICTION 124
neighbourhood of that frequency. Other advantages are its better behaviour when
linearly interpolated and the fact that they may be more readily quantized in accor-
dance with properties of auditory perception to save bits (i.e., coarser quantization of
the higher frequency spectral components).
Prediction coefficients may be transformed into LSFs through the decomposition
of the impulse response of the LPC-analysis filter into even and odd time sequences
(Fig. A.3) [62]. This decomposition is reversible because the original pulse response
can be obtained by half the sum of the even and odd time sequences. It is easy to
show that both the even and odd time sequences have roots along the unit circle in
the complex plane. These roots, being on the unit circle, denote resonant frequencies
with infinite Q. Hence, these sequences may be expressed by LSFs.
(a) Impulse response of 10-th order LPC analysis filter 1
(b) Time-shifted and time-reversed waveform of (a)
1
(c) Sum of waveforms (a) and (b), the sequence is even symmetric
(d) Difference of (a) and (b), the sequence is odd symmetric
Figure A.3: Transformation of Predictor coefficients to LSFs
APPENDIX A. L INEAR PREDICTION 125
Let us assume that Ap(z) is given and construct two (p + 1)th order predictors
P(z ) and Q(z) under the conditions kp+l = 1 and kp+l = - 1 respectively, i.e.
Then, from the recurrence relations (Eq. (A.29)),
(A. 33)
(A.34)
(A.35)
(A. 36)
The relationship of these equations with figure A.3 is easily recognized as B,(z) =
z-("+~)A,(z-') [42]. The arguments of the complex roots of the difference filter P ( z )
and the sum filter Q(z) are called the Line Spectral Frequencies (LSFs), therefore
conversion of LP coefficients to LSFs is, in essence, finding roots of these two filters.
A.2.1 Computation of Line Spectral Frequencies
The polynomials Q(z) and P(z ) being symmetrical and anti-symmetrical, respectively,
have roots at z = $1 and/or z ='-I which can be removed by polynomial division
i) p = even
ii) p = odd
(A.39)
(A. 40)
Now, Gl(z) and G2(z) are symmetric of even order. Since the roots of P(z ) and Q(z)
alternate in position on the unit circle, and P(z ) always has a root at z = +1 (w = O ) ,
the lowest LSF corresponds to a root of Gl(z).
APPENDIX A. LINEAR PREDICTION 126
Let the order of GI (z) be 2M1 and the order of G2(z) be 2M2. Then MI = p/2,
M2 = p/2 for p = even, and MI = (p + 1)/2, M2 = (p - 1)/2 for p = odd. Explicitly
showing the symmetry of the polynomial coefficients
Evaluating on the unit circle,
where
The roots of Gi (w) and G',(w) are the LSFs.
There are several ways LSFs may be computed. The most common method of
computation is by Kabal and Ramachandran [61]. In this method, a mapping of
x = cos w allows one to express these polynomials in terms of Chebyshev polynomials:
cos m u = Tm (x) (A .44)
where Tm(x) is an mth order Chebyshev polynomial in x. Thus, the equations involv-
ing cosines (Eq. (A.43)) can be expressed in terms of Chebyshev polynomials.
APPENDIX A. LINEAR PREDICTION 127
This series lends itself to an efficient evaluation [61] which bypasses an expansion in
powers of x. The mapping x = cosw maps the upper semicircle in the z-plane to the
real interval 1-1, $11. Therefore all the roots x; lie between -1 and +1, with the root
corresponding to the lowest frequency LSF being the one nearest to +l. The series
Gi(x) is evaluated first on a grid close to $1 for the lowest frequency LSF and then
proceeds alternately on Gi(x) and Gh(x) looking for all the LSFs. Once the roots
{x;) of Gi(x) and G',(x) are determined, the corresponding LSFs are given by
Other techniques of solving for LSFs starting from LP coefficients are given by
Soong and Juang [99], and Kang and Fransen [62].
Some important relationships of the Line Spectral Frequencies with speech for-
mants and one of the root finding techniques can be understood by rewriting the
expressions for P(z) and Q(z) as follows
where
The filter R,(z) is called the ratio filter.
The ratio filter is an all-pass filter. It can be readily seen that when the phase
angle of the ratio filter is a multiple of 27r radians, the amplitude response of the
diflerence filter, P(z), is zero. On the other hand, when the phase angle of the
ratio filter is an odd multiple of 7 r , the amplitude response of the sum filter, Q(z)
is zero. The relationships between the LPC spectrum, the zeros of P(z) and Q(z),
the phase function of the ratio filter, and the group delay of the ratio filter can be
seen in figure A.4. The x-axis is frequency in Hz in all the plots. The real root of
the difference filter at $1 and that of the sum filter at -1 can also be seen on their
amplitude responses as zeros at 0 Hz and 4000 Hz respectively. The horizontal line
in (c) represents 7r radians. The sampling rate was 8000 Hz for this plot.
It can be seen that the group delay of the ratio filter is large near speech formant
frequencies, and LSFs are close together. Since the phase of the ratio filter has to
APPENDIX A. LINEAR PREDICTION
(a) Amplitude response of the LPC synthesis filter
(b) Amplitude response of P(z) and Q(z)
"0 500 1000 1500 2000 2500 3000 3500 4000 (c) Phase response of ratio filter
(d) Group delay of ratio filter
Figure A.4: Plots showing relationships between LSFs and other parameters.
APPENDIX A. LINEAR PREDICTION 129
change by 7~ from one LSF to the next, it is obvious that the group delays are going
to be larger whenever LSFs are closer together. These also define the location and
bandwidth of the formants.
Now we show that the spectral estimate for a finite data sequence must be an
all-pole spectrum from a maximum entropy point of view.
A.3 Maximum Entropy Principle
Given a set of autocorrelation coefficients for a stationary process, the best estimate
for the underlying spectrum is given by the Maximum Entropy Principle which says
that the spectrum that maximizes the entropy is the best estimate. Fundamentally, it
gives a procedure for constructing the spectrum from a finite set of data points without
making any constraining assumptions about the signal. It can be shown that spectral
estimates based on maximizing entropy provides maximum spectral resolution with
minimum spectra1 splatter [l]
Let S ( w ) be a power spectral density function. As with continuous probability
distributions, the entropy can be properly defined only as a difference with respect to
some other spectrum So(w). However, assuming a reference white power spectrum of
unit power density, So(w) = 1, we can write
H = - riT log S(w)dw. ( A . 50)
The constraints satisfied by S ( w ) are the given autocorrelation coefficients, i.e.,
where z = eJw. Using Lagrange multipliers and calculus of variations, we obtain the
Euler equation -
where M
F = - log S ( W ) + C XkS(w) (ejwk + e-jWk k=O )
This gives,
( A . 54)
APPENDIX A. LINEAR PREDICTION
Using FejCr-Riesz theorem [89, page 2311,
This clearly shows S(w) to be an all pole spectrum as obtained by linear prediction.
This coincidence of Linear Prediction and Maximum Entropy spectral estimation is
of course not accidental. Two key factors contribute to their similarity.
1. The definition of entropy contains a logarithm which is required so that the
entropies of independent events add.
2. Maximizing the entropy requires differentiation with respect to S , leading to the
l/S(w) term in Eq. (A.54) and a finite Fourier series. Thus the reciprocal power
spectrum as a function of frequency must be a band limited function which is
precisely the property of an all-pole function.
Thus linear prediction actually provides the best spectral estimate in the absence of
any prior information about the data sequence.
Appendix B
Quantization
Quantization is the process of mapping a large (possibly infinite) set of points in a
metric space to a smaller and finite set of points in the same space. An N-point
quantizer is defined as a mapping Q : X H C where X is the input set and
is the output set or codebook with size ICI = N. For the special case of X R, where
R is the real line, Q is called a scalar quantizer and the output points are simple
scalars also referred to as output levels or reproduction values. When X is non-scalar,
the quantizer is called a vector quantizer.
B .1 Scalar Quantization
An N-point scalar quantizer partitions the real line (or a subset of it) into N segments
or cells, R;, i = 1,2,. . . , N. The i th cell is given by
where y; are the output levels of the quantizer. It follows from this definition that
Ui R; = X and R; n Rj = 4 for i # j. The output values being scalar, we assume that
they are indexed such that
Y1 < Y2 < " ' < YN. P3) Then the cells R1 and/or RN may be unbounded depending on whether the input
space X is unbounded or not. The unbounded cells in a quantizer are called overload
APPENDIX B. QUANTIZATION 132
cells and the bounded cells are called granular cells. All the overload cells together
form the overload region and all the granular cells together form the granular region.
A scalar quantizer is regular if
a) each cell R; is an interval of the form (x; -~ , xi) together with one or both
of the endpoints, and
The values x; are called boundary points and for a regular quantizer they satisfy the
inequality
Xo < 91 < X l < 92 < X2 < ' - . < YN < X N w4) A typical symmetric regular quantizer, Q(x), is shown in fig. B.1. The horizontal
segments in Q(x) are called treads and the vertical discontinuities are called risers.
Figure B.l: A typical mid-tread scalar quantizer
A quantizer can be decomposed into two independent operators working in succes-
sion - an encoder, E, and a decoder, V. The encoder is a mapping E : X w Z, where
Z is the set of positive integers, and the decoder is the mapping V : Z H C. Thus if
Q(x) = y; then &(x) = i and V(i) = y,. This is the same as saying Q(x) = V(E(x)).
Sometimes a quantizer is assumed to generate both the index i and the output value
APPENDIX B. QUANTIZATION 133
y;, and a decoder is sometimes referred to as an inverse quantizer. In a communica-
tion system, only the index i is transmitted and the actual output value y; can be
obtained through a table lookup procedure at the receiving end.
B. 1.1 Performance Measures
The quantization process can be modeled as the addition of a random noise component
e = Q(x) - x to the input sample as indicated in figure B.2. Since the quantization
Quantization Noise e
Figure B.2: Additive noise model of quantization
error is modeled as a random variable, a measure of the performance of a quantizer
must be based on a statistical average of some function of the error. Most common
is the mean squared distortion measure defined as the expectation of e2
where p(x) is the probability density function of the input variable x . Frequently, the
performance of a quantizer is specified by a signal to noise ratio (SNR) defined as
SNR = 10 logl,(02/D) (B.6)
where 02 = E(x2) is the variance of the input signal. For high resolution quantization,
it is useful to write the average distortion as a sum over all quantization cells. For a
given input random variable, x, and a quantizer, Q = {y;, R;; i = 1,2, . . . , N), the
average distortion can be written as
APPENDIX B. Q UANTIZATION 134
For a regular scalar quantizer, this can be written as
N
D = C J z i (y; - x ) ~ ~ ( x ) dx. ;=I "1-1
For large N, each interval R; can be made quite small (with the exception of the
overload intervals R1 and RN) and it is reasonable to approximate the pdf p(x) as
being constant within each interval R;. Approximating p(x) by p(y;) when x is in Ri,
and p(x) = 0 in the overload regions, the above equation can be simplified to
where A; = x; - xi-1 is the length of the interval Ri and y; = (x; + ~ ; - ~ ) / 2 . The
reason for choosing y; as the centroid of Ri is given in subsection B.1.3, page 138. For
the special case of uniform quantization where the decision boundaries are equally
spaced so that A; = A, the step size of the quantizer, the mean squared error can be
further simplified to " 2 N-1
so that
It can also be shown that in this case
so the quantization error and the input signal are correlated. The average granular
distortion can also be written as
where y is the loading factor defined by A = 2oy/N, and r = log, N is the quantizer
resolution in bits. This expression shows that the distortion goes to 0 exponentially
as r + w. Now, the SNR can also be written as
SNR = 10
= K + 6.02r
APPENDIX B. QUANTIZATION 135
where P = l / y is the loading fraction, and IC = 1010gl,(3P2). This shows that for
the high resolution case, the SNR increases by about 6 dB for each additional bit
used for quantization. The above expression for SNR assumed a negligible overload
distortion which is not true for a loading factor less than 2 or 3. The total average
distortion D, can then be written as a sum of the granular distortion, D,, and the
overload distortion, D,.
It can be shown that for a symmetric quantizer satisfying
D, is a function only of the loading factor y and does not depend on a for a fixed y.
For the high resolution case, the granular distortion for a general class of regular
nonuniform scalar quantizers is given by the Bennett's integral.
where, X(x) is the point density function of the nonuniform quantizer.
B.1.2 Robust Quantization
An important concept in quantization is robust quantization, i.e., designing quantizers
whose performance is independent of the input signal pdf, p(x). To discuss robust
quantization, we have to introduce a model for treating nonuniform quantization -
known as the compander model. It can be shown [42] that any regular nonuniform
quantizer can be represented as a nonlinearity F(x), called the compressor, followed
by a regular uniform quantizer and an inverse nonlinearity F-'(x), called the expandor
(Fig. B.3).
APPENDIX B. QUANTIZATION
UNIFORM QUANTIZER -
Compressor I
Expandor
Figure B.3: Compander model of nonuniform quantization
The characteristic F ( x ) is a monotonically increasing odd function of x, ranging
from values -V to +V where V is the overload level of the quantizer. Every nonuni-
form quantizer can be modeled in this way with a suitable choice of F(x). It can
be shown that for large N, the average distortion of a nonuniform quantizer can be
written as
where g(x) = F1(x) is the compressor slope function. If the slope function is chosen
then Eq. (B.18) reduces to
so that the SNR a2 /D reduces to a constant 3N2/b2 which is independent of p(x).
Integrating Eq. (B.19) gives the compressor function as
for x > 0, where c is a constant. This shows that a logarithmic compressor would give
robust performance. It should be borne in mind that Eq. (B.18) neglects overload
noise and SNR will begin to drop when the input power level becomes large enough.
Also, the curve just computed (Eq. (B.21)) is not in fact realizable, since F(0) is
not defined. To circumvent this problem, a modified compressor curve is used which
behaves well for small values of x and retains the logarithmic behaviour elsewhere. A
compressor curve widely used in speech quantization is the p-law curve given by
APPENDIX B. Q UANTIZ4TION 137
for x > 0. For p >> 1 and px >> V, F ( x ) approximates Eq. (B.21). The p-law com-
panding is used in PCM systems in the United States, Canada, and Japan. Another
robust logarithmic characteristic is the A-law given as
The A-law characteristic is used in European PCM telephone systems. The parame-
ters p and A control the degree of compression measured by the companding advantage
which is the slope of the compressor curve at the origin and the typical values in use
are p = 255 and A = 87.6. F ( x ) being an odd function, its value for x < 0 is given by
F (x ) = -F(-x). p-law and A-law companding have been adopted as standards for
PCM coding of speech by ITU-T (formerly known as CCITT) in their G.711 recom-
mendation where the logarithmic characteristics are approximated by piecewise linear
functions.
B.1.3 Optimum Quantization
For a given input pdf, and a fixed number of quantization levels, N , an optimal quan-
tizer is one that produces minimum average distortion. In other words, a quantizer
Qopt is optimal if and only if
E[d(Qopt(x), x)] 5 E [d(Q(x), x)], V N-level quantizers Q. (B .24)
The problem of finding an optimum quantizer does not, in general, have a closed form
solution. However, necessary conditions for optimality can be derived that allow us
to use iterative algorithms to design optimal quantizers in some cases. The necessary
conditions are found by factoring the problem into two interdependent problems and
solving them independently. The subproblems that we address are -
1. for a given decoder, what is the optimal encoder; and
2. for a given encoder, what is the optimal decoder.
It can be shown that for a given decoder, the optimal encoder must satisfy the nearest
neighbour condition. That is, for a given codebook (set of output levels), C, the
partition cells satisfy
APPENDIX B. Q UANTIZATION
This is the same as saying that
Q(x) = y; only if d(x, y;) 5 d(x, yj) V j # i. (B .26)
For a given decoder, this condition is also a sufficient condition for optimality of an
encoder for any distortion measure that satisfies the properties of a distance function.
It should be noted that the nearest neighbour rule does not assign boundary points
to a specific region. Heuristics are used to assign boundary points to either of the
neighbouring regions. A simple way to resolve the ambiguity would be to assign the
boundary point to the cell on its left (right).
For the second subproblem, i e . finding an optimal decoder for a given encoder it
can be shown [42] that the decoder must meet the centroid condition. In other words,
given a partition {R;) and a distortion measure d(x, y) , the optimal codebook for a
random variable x is given by
y; = centroid(R;)
= argmin E[d(x,c)lx f R;].
For the case of squared error distortion measure, the centroid is given by the condi-
tional mean of the random variable x given that x is in the region R;.
Substituting Eq. (B.28) in Eq. (B.8), the average distortion can be written as
It can be seen that an analytical solution for x; is extremely difficult except possibly for
a very small N. Lloyd [73] and Max [81] independently discovered the necessary con-
ditions for optimality for mean square distortion measure (Max derived the necessary
conditions for a kth absolute mean square error criterion including k = 2) and came
up with effective algorithms for computing the optimum solution. The algorithms
known as the Lloyd (Method I) algorithm and the Lloyd (Method 11)-Max algorithm
iteratively computes the boundary points and the output levels while simultaneously
satisfying both necessary conditions. We will describe the Lloyd algorithm here in a
APPENDIX B. Q UANTIZATION 139
form that has been generalized for non-scalar quantization as well. To design an N-
level scalar quantizer using Lloyd's algorithm, one must start with an initial codebook
of N output levels. Then the Lloyd iteration works in two steps -
1. Given a codebook, find the optimum partition using the nearest neighbour prin-
ciple;
2. Compute new codebook for the newly computed partition using the centroid
rule.
The iteration is continued until the change in average distortion falls below a preset
limit or reaches zero. It can be seen that each of the steps above reduces the average
distortion and the algorithm is guaranteed to converge. Note that the Lloyd itera-
tion produces a codebook satisfying the necessary conditions of optimality but does
not guarantee that an optimal quantizer will be produced. Sufficient conditions for
optimality were derived by Fleischer [37]. He showed that if the probability density
function, p,(x), satisfies ,-I2
for all x, then there exists only one quantizer that can satisfy both the necessary
conditions. This guarantees that quantizers designed using Lloyd's iteration for dis-
tri butions satisfying Eq. (B.30) are indeed optimal.
Another approach to optimal quantizer design is using Bennett's high rate ap-
proximation (Eq. (B.18)) to compute average distortion and minimizing it over all
possible compressor slope functions g(x) having a constant area under it. This yields
a result that the optimum compressor slope function is proportional to the cube root
of the pdf [40]
gopt(x) = cl [ p ( ~ ) ] l ' ~ . (B.31)
By integrating we obtain the compressor characteristic
x Fopt(x) = c1 1 [p(a)]lf3 d a for x > 0 (B. 32)
where cl is a constant such that F ( V ) = V. It should be noted that Eq. (B.32) gives
the optimum quantizer for a given value of the overload point, V. A separate one-
dimensional minimization can be done for the best value of V. Computation of the
APPENDIX B. QUANTIZATION 140
minimum mean squared error obtained from quantizers designed with this approach
leads to values in good agreement with Max's tabulations (for a Gaussian pdf) even
for values of N as small as 6 [40]. Smith [98] studied optimal quantization based on
Laplacian pdf and found that the optimum compressor has the form
V( l - e-mx) F ( x ) = 1 - e-mv ' for x > 0
This is called the m-law compressor. While an optimum quantizer can give much
higher SNR for a given bit rate, the performance degrades fast as the input power
levels are changed. On the other hand, a robust quantizer like the one based on p-law
companding maintains a reasonably high SNR over a broader range of input power
levels.
The optimal quantization discussed above were based on the average distortion for
a given number of output levels as the cost function. However, from a user's viewpoint,
a better quantizer is one that uses the lowest number of bits per sample for a given
maximum distortion; or equivalently, for a given number of bits/sample, obtains the
highest SNR. This is the same as minimizing the output entropy for a given average
distortion. It has been shown [47, 1091 that if entropy coding is used after quantization
then the uniform quantizer performs better than the best non-uniform (Lloyd-Max)
quantizer.
For a given bit rate B (bits/sample), the SNRs for high rate quantization neglect-
ing overload conditions for different quantization strategies are as follows 1401.
Uniform quantization followed by entropy coding:
SNR = 6B - 1.50
Best non-uniform quantizer followed by entropy coding:
SNR = 6B - 2.45
Best non-uniform quantizer only:
SNR = 6B - 4.35
Uniform quantizer with loading factor of 4:
SNR = 6B - 7.3
APPENDIX B. QUANTIZATION 141
It is evident that a uniform quantizer followed by entropy coding gives the best per-
formance.
So far we have only considered scalar quantization without memory. For a source
with memory, where
better performance can be obtained by quantizing the difference between the sample
and its predicted value. These are called predictive quantizers and they provide a
performance improvement compared to a simple scalar quantizer by reducing the
variance of the signal at the input to the quantizer. Predictive quantizers have been
studied in great detail (e.g. [42, 111) and techniques like DPCM and ADPCM have
been very popular: ADPCM has also been adopted by ITU-T as a toll quality speech
coding standard at 32kbps in their recommendation G.723.
Much of the benefits of scalar predictive quantizers can be obtained by quantizing
a block of samples together. This is not only true for sources with memory but
also holds for a memoryless source. This is a fundamental result from Shannon's
rate distortion theory. The process of quantizing a block of scalars is called vector
quantization (VQ). The vectors quantized with a VQ (the term VQ is used both to
mean the process of vector quantization as well as a vector quantizer) need not be
collection of samples from a scalar process but can be samples from a vector process
as is the case for quantization of vocal tract spectral parameters.
B .2 Vector Quantization
A k-dimensional vector quantizer is a mapping Q : X I+ C, where X is a k-
dimensional metric space or its subset and C = {yl, y2,. . . , y N ) is a finite set of
vectors from the same space. In speech coding we are particularly interested in the
case where X = Rk. A VQ can be decomposed into two mappings: an encoder &
which assigns to each input vector x = (xo, 21,. . . , ~ k - 1 ) ~ a channel symbol &(x) in
some channel symbol set M, and a decoder 2) assigning to each channel symbol cr in
M a value in a reproduction alphabet C. The channel symbol set is often assumed to
be a space of binary vectors for convenience, e.g., M may be the set of all 2R binary
R-dimensional vectors.
APPENDIX B. Q UANTIZATION 142
If C has N elements, then the quantity R = log2 N is called the rate of the
quantizer in bits per vector and r = Rlk is known as the resolution or code rate in
bits per vector component.
An N point vector quantizer partitions Rk into N regions or cells R; for i E Z r
{1,2, . . . , N). The i th cell, defined by
is sometimes called the inverse image or pre-image of y; under the mapping Q and
denoted more concisely by R; = Q-l(y;). It follows that
so that (R;} form a partition of Rk. For k = 1, a VQ degenerates to a scalar quantizer.
A vector quantizer is called regular if
a) each cell, R;, is a convex set, and
b) for each i , y; E R;.
Just as in scalar quantization, a cell that is unbounded is called an overload cell and
all overload cells together form the overload region. A bounded cell is called a granular
cell and all granular cells together form the granular region.
A VQ is not merely a generalization of scalar quantization. It can be shown that
no coder can do better than a VQ. The following theorem is due to Gersho [42].
Theorem B. l For any given coding system that maps a signal vector into one of
N binary words and reconstructs the approximate vector from the binary word, there
exists a vector quantizer with codebook size N that gives exactly the same performance,
i.e., for any input vector it produces the same reproduction as the given coding system.
The reason vector quantizers outperform scalar quantizers in jointly quantizing a num-
ber of scalars (vector components) can be attributed to four interrelated properties
of vector components [75]:
a) linear dependency (correlation),
b) nonlinear dependency (statistical dependency),
APPENDIX B. QUANTIZATION 143
c) pdf shape, and
d) dimensionality (giving rise to a choice of cell shape for k > 1).
Although linear dependency or correlation can be removed by proper choice of the
basis vectors (as in KLT), statistical dependency cannot be removed and prevents
factoring out of the joint pdf into independent pdf's facilitating independent scalar
quantization of each vector component. Even if scalar quantizers could be designed
for marginal probability densities of each component, it could very well spend bits
quantizing regions of zero probability. A vector quantizer can take advantage of the
joint pdf and partition the space accordingly. Figure B.4 clearly shows this point for
k = 2. If the vector components XI and x2 are quantized independently using scalar
Figure B.4: A uniform joint pdf over a rectangular region (shown shaded) along with the marginal pdf's
quantizers designed using their marginal pdf's, the area quantized by the pair of scalar
APPENDIX B. QUANTIZATION 144
quantizers is shown as the dotted square. The pdf is assumed uniform over the shaded
rectangle and zero outside. It is clearly seen that bits will be spent unnecessarily in
quantizing a large region of zero probability. The case of two independent scalar
quantizers can be considered as a special case of vector quantization of the vector
x = (xl, x2) where the vector quantizer is given by
where Ql and Qg are the scalar quantizers for xl and x2 respectively. Such a vector
code is called a product code and the VQ will be called a product VQ because the
overall VQ is formed as a Cartesian product of smaller dimension VQs.
A particular advantage of VQ that is not always very evident stems from the fact
that k > 1. This gives rise to a wide choice in cell shapes. It can be shown for a
two-dimensional VQ that for a uniform pdf and neglecting edge effects, hexagonal
cells give a lower average distortion compared to square cells with the same number
of cells covering any given area [75] .
B.2.1 Vector Quantizer Performance
A vector quantizer is always defined over a metric space so that there exists at least
one valid distortion (distance) measure in that space. A distortion measure is essential
in designing a VQ as one goal of such a design effort is minimizing some distortion. If
d(x, y ) is the distance between the original vector x and the reconstructed vector y,
the the performance of the VQ may be quantified by the average distortion D defines
as
D = E[d(x, Y )I. (B .42)
In practice, the measure of performance is the long term sample average or time
average - 1 d = lim - d(x,, y,). (B.43)
M-.m M ,=I
If the vector process X is stationary and ergodic, the sample average in Eq. (B.43)
tends in the limit to the expectation in Eq. (B.42). In particular, if a VQ partitions
the input space into L regions,
APPENDIX B. QUANTIZATION
where P ( x E &) is the discrete probability that x is in R;, ~ ( x l i ) is the conditional
multidimensional probability density function (joint pdf of vector components) of x
given that x E R;, and the integral is taken over all vector components of x .
It is obvious that the distortion is dependent on the distance measure used. The
most common and widely used measure of distortion is the squared erroror the squared
L2 norm. This is defined as
We will only talk about real vectors here; for complex vectors, the transpose operator is
replaced by a conjugate transpose operator. The distortion measure used depends on
the physical interpretation of the vector and the vector components and a number of
different distortion measures have been explored in designing quantizers for the LPC
coefficients (and its various equivalent representations) [50]. A detailed review of the
relevant distance measures will be presented in a later chapter. Any distortion measure
that displays the above form of a summation over distortions due to individual vector
components is called an additive or single letter distortion measure. One general form
of an additive distortion measure is the pth power of the popular Lp norm and is
defined as k
Another distortion measure of interest is the quadratic form of the error vector
e = (x - y) or the weighted squared distortion defined as
APPENDIX B. QUANTIZATION 146
where W is a symmetric weighting matrix that takes into consideration the weighted
contributions from individual vector components to the total distortion and contribu-
tions from their interactions. In the simplest case, W is a diagonal matrix that helps in
placing different emphasis on each vector component for distortion computation. For
the familiar squared error measure, W = I, where I is the identity matrix. If W is not
symmetric then the distance measure is not symmetric as well and d(x, y ) # d(y, x).
This does not strictly qualify as a proper distance function. Sometimes the weighting
matrix W is made a function of the input vector x and the distortion measure is no
longer symmetric in x and y. The requirement of symmetry is relaxed to obtain a
more perceptually significant distortion measure. In the general case, the weighted
distortion measure looks like
where W ( x ) is symmetric and positive definite for all x .
B.2.2 Optimum VQ
An optimal VQ for a given input distribution and a distance measure, d(x, y ) , is
defined as
Qopt = {GPt : IZk H 2; Dopt : Z H C) (B.49)
(B. 50)
for all choices of Q(.). Usually, for all signals and distortion functions of interest,
the error surface shows multiple local minima and no special characteristics can be
associated with a global minima, if any. So, as in the case of scalar quantization, we
can only specify the necessary condition for an optimal encoder given a decoder and
vice versa.
Necessary Conditions of Optimality
For a given set of output codevectors, C = {cl,. . . , cN), an optimal encoder partitions
the input space such that each cell satisfies the nearest neighbour condition
APPENDIX B. Q UANTIZATION 147
Thus, given a decoder, the encoder is a minimum distortion mapping such that
d(x, Q(x)) = min d(x, c;). c i E C (B.52)
In case of a tie where more than one code vectors are equidistant from the input
vector x , the tie is broken with a heuristic. A common heuristic is to choose the code
vector with the minimum index.
The second necessary condition of optimality is the centroid condition which de-
fines the decoder given a partition of the input space and a distortion measure d(x, y).
That is, for a given partition {R; : i = 1, . . . , N; R; = X), the optimal code vec-
tors satisfy
(B.53)
(B. 54)
In other words,
E[d(x, ci)/x E R;] = min u E[d(x, u) lx E R;]. (B.55)
The above definition is valid for a discrete distribution as well, and can be evaluated
from the pmf (probability mass function) of the input distribution. Generally, input
distributions are not known and VQs are designed from what are called training
sets. A training set is a finite collection of sample vectors generated from the source
distribution in order to represent the statistics of the source with a finite set of data.
Usually, the training vectors are generated independently from the source. This gives
rise to a discrete model for the source where each of the M vectors in the training set
has a probability of l /M and the probability P ( x E R;) is estimated from the number
of training set vectors inside R,.
A third necessary condition for a codebook to be optimal is called the zero proba-
bility boundary condition. Consider the set
From the nearest neighbour condition, R; c R:. The set Bi = RI - R, is called the
boundary of R;. The zero probability boundary condition states that
APPENDIX B. QUANTIZATION
or equivalently,
P(x : d(x, c;) = d(x, cj) for some i # j) = 0. (B .58)
This condition is automatically satisfied when the input distribution is continuous
but may be violated in the discrete case when one or more of the training vectors are
equidistant from multiple code vectors. The above condition states that the resulting
quantizer may not be optimal even if the nearest neighbour and centroid conditions
are satisfied. A better quantizer may be found by breaking the tie in a different way
and proceeding with more iterations.
In simple words, a vector quantizer obeying the necessary conditions of optimality
is a mapping that partitions the input space into convex regions and allocates one
vector, the centroid of the region, to each vector in the region and the boundary
between two neighbouring regions is the perpendicular bisector of the line joining the
corresponding centroids (Fig. B.5).
Sufficiency Conditions
As pointed out earlier, it is impossible to derive sufficient conditions of optimality.
Here we would like to make more explicit what is meant by optimality of a VQ. Since
the nearest neighbour condition is necessary for optimality, let us assume that it is
satisfied for any given codebook C. That is, a VQ is uniquely defined by the codebook
as the partition always follows the nearest neighbour rule. Now, the average distortion
D is a function of the codebook only and the quantizer is locally op t imal if every small
perturbation of the code vectors does not lead to a decrease in D. A quantizer is called
globally op t imal if no other codebook exists that produces a lower value of D.
It is widely believed that if a codebook satisfies the Lloyd conditions (the necessary
conditions mentioned above), it is indeed locally optimal although no theoretical
derivation of this result has ever been obtained. For the discrete case such as a
sample distribution produced by a training set, however, it can be shown [48] that
a vector quantizer satisfying the necessary conditions is indeed locally optimal under
mild restrictions. This comes from the fact that in the discrete input case, a slight
perturbation of a code vector will not alter the partitioning of the (countable) set of
input vectors as long as none of the training vectors lies on a partition boundary. Once
the partition stays fixed, the perturbation causes a violation of the centroid condition
APPENDIX B. QUANTIZATION
Figure B.5: A vector quantizer satisfying the necessary conditions
APPENDIX B. Q UANTIZATION 150
with an accompanying increase in D. Thus under these conditions, a quantizer that
satisfies the necessary conditions will be locally optimal. It is worth pointing out that
locally optimal quantizers can be very suboptimal in the global sense.
B.2.3 VQ Design
The problem of designing a codebook with N vectors each with dimension k for a
given distribution or training set has no general solution, but the Lloyd's algorithm
described in the section on Scalar Quantization can be generalized to the vector case
for iterative improvement of a given codebook. We will describe the algorithm only
for the case of unknown distribution as that is the most common situation.
Generalized Lloyd Algori thm:
S t e p 0 Initialization: Given
a) number of levels = N;
b) distortion threshold = 6 2 0;
c) initial codebook Co;
d) training sequence 7 = {ti; i = 0, . . . , n - 1);
set m = 0 and D-l = oo
S t e p 1 Given Cm = {c;; i = 1 , . . . , N), find minimum distortion partition
of the training set:
t j E S; if d(tj , c;) 5 d(t j , cl) Vl
Compute the average distortion
S t e p 2 If 5 r , halt with Cm as the final codebook.
APPENDIX B. QUANTIZATION 151
Step 3 Find the optimal codebook cent(P(C,)) = {cent(S;); i = 1, . . . , N ) for
P(C,). Set - cent(P(C,)). Replace m by m + 1 and go to step 1.
Here, cent(.) stands for the centroid of a set.
The most popular technique of obtaining the initial codebook of size N is known as
the splitting algorithm (also known as the LBG algorithm) introduced by Linde et. al.
[72]. The algorithm starts with a codebook of size 1 and creates a larger size initial
codebook by splitting as described in the algorithm below.
LBG Algorithm:
Step 0 Initialization: Set M = 1 and define Co(l) = cent ( l ) , the centroid of the
entire training set.
Step 1 Given the reproduction alphabet Co(M) containing M vectors {c;; i = 1, . . . , M ) ,
"split" each vector c; into two close vectors c; + E and c; - E , where E is a fixed
perturbation vector. The collection of { c ; + E, c; - 6 ; i = 1, . . . , M ) has 2M
vectors. Replace M by 2M.
Step 2 Is M = N ? If so, set Co = E(M) and halt. Co is then the initial reproduction
alphabet (codebook) for the N-level quantization algorithm. If not, run the
generalized Lloyd algorithm for an M-level quantizer on C(M) to produce a
good reproduction alphabet Co(M), and return to step 1.
Note that the splitting algorithm always results in a codebook of size N where N is
an integral power of 2.
Appendix C
Pitch Computation Algorithm
The sequence of tests performed for computing average pitch is described in the
following list. If any test succeeds, the pitch value computed there is accepted as the
pitch period and subsequent tests are not performed. In the following, Cs and Ce are the number of peaks detected from the speech and residual signal respectively.
The values of the different constants used in the following description are chosen
empirically and the values used in our 1800 bps coder are given at the end of this
section. All variables used are defined in Chapter 5, section 5.4.1.
Test if all four conditions below are satisfied and set pitch, P = p(p,) if true.
If a,(ps) > El1 is also satisfied, compute pitch, P, using autocorrelation.
Test if all four conditions below are satisfied and set pitch, P = p(pe) if true.
APPENDIX C. PITCH COMPUTATION ALGORITHM 153
If a,(pe) > El1 is also satisfied, compute pitch, P, using autocorrelation.
Test if all four conditions below are satisfied and set pitch, P = min(p(p,), p(pe))
if true.
If min(a,(p,), a,(pe)) > El2 is also satisfied, compute pitch, P, using autocor-
relation.
Test if both the conditions below are satisfied and set pitch, P = p(p,) if true.
If a,(p,) > El1 is also satisfied, compute pitch, P, using autocorrelation.
Test if both the conditions below are satisfied and set pitch, P = p(pe) if true.
If o,(pe) > C12 is also satisfied, compute pitch, P, using autocorrelation.
Test if all the four conditions below are satisfied.
If also Ip,(O) - pe(0) 1 < Ptol, compute pitch, P, using autocorrelation within
the range Pcand f Ptol, where Pcand = P-l if tracklength > tracklength,;,, else
Pcand = p,(O). Otherwise, set pitch, P = 0.
APPENDIX C. PITCH COMPUTATION ALGORITHM
Test the following two conditions and proceed if both are true.
- If Ip,(O) - pe(0)l I Rowtol, look for more peaks at intervals of p,(O) and
compute u,(p,). If a,(p,) < C,, set P = ~ ( p , ) , otherwise set P = 0.
- If Ips(0) - p,(0) 1 > Plo,tol, look for a pitch cycle pe(k) in the residual such
that (P-1 - pe(k)l 5 PtOl. If such a pitch cycle pe(k) is found, compare
it with the pitch cycle p,(O) found from the speech waveform. If (pe(k) - p,(0) 1 5 fiowtol, then P = p,(O). Otherwise assume that all other pitch
cycles, pe(i), that are larger than pe(k) to be equal to multiple pitch cycles,
and break them into n divisions, where n = Ipe(i)/pe(k) + 0.51. Assume
all pairs of successive smaller pitch cycles as broken parts of a single pitch
cycle and merge them to form valid pitch cycles. Compute o,(pe) after all
breaking and merging is done and if the following conditions are true then
set P = p,(O), otherwise set P = 0.
- If no pitch cycle was found in the residual signal that satisfied IP-l -
pe(k)l < Ptol, compute pitch using autocorrelation.
0 Reverse the roles of the speech and the residual signal in the previous procedure
and do exactly the same.
If all previous steps failed and the pitch is not determined yet, but more than
one pitch periods were detected in the speech signal, i . e . C, > 2, then search
for a pitch period satisfying IP-l - p,(i)l < Ptol. If such a pitch period is found,
examine all other pitch periods and do merginglsplitting as required. Compute
a,(p,) and set P = ~ ( p , ) if a,(p,) < C,, otherwise set P = 0. If the pitch was
not set to 0 here, and tracklength 5 tracklengthmi,, check for pitch doubling
using autocorrelation.
APPENDIX C. PITCH COMPUTATION ALGORITHM 155
If no pitch cycle p,(i) satisfying IP-l - p,(i)l 5 Ptol could be found, the peak
picking algorithm has failed. If the pitch of the last voiced segment was P>,
then use autocorrelation to check for pitch values in the range P_", f Ptol. If
P> > 2Pm;,, then also check for pitch values in the range P>/2 f Ptol using
autocorrelation.
If the pitch could not be determined so far, set P = 0.
If the pitch, P, as determined above was outside the interval [P,;,, P,,,], it is set to
zero.
If the pitch is finally determined as zero, but the pitch for the previous segment,
P-l, was non-zero and autocorrelation was not used in the computation so far, auto-
correlation method is used to confirm that P = 0.
The following values of the empirical constants were used in the pitch detector for
our 1800 bps coder implementation.
Constant
x u
El1 El2 c m i n
Ptor Plowtol
P m a z
P m i n
tracklengt h,;,
Value
0.17 0.1 0.08
2 5 2
140 20 2
Table C.l: Values of empirical constants used in 1800 bps coder
Appendix D
List of Citations
1. R. Hagen. "Robust LPC Spectrum Quantization - Vector Quantization by a
Linear Mapping of a Block Code," IEEE Trans. Speech and Audio Processing,
Vol. 4, No. 4, pp. 266-280, July, 1996.
2. A. McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan. "A 2.4
kbits/s MELP Coder Candidate for the New U.S. Federal Standard," Proc.
IEEE Int. Conf. on Acoustics Speech and Signal Processing, pp. 1-200 - 1-203,
Atlanta, May 7-10, 1996.
3. W. LeBlanc, C. Liu and V. Viswanathan. "An Enhanced Full Rate Speech
Coder for Digital Cellular Applications," Proc. IEEE Int. Conf. on Acoustics
Speech and Signal Processing, pp. 1-200 - 1-203, Atlanta, May 7-10, 1996.
4. C.F. Barnes, S.A. Rizvi and N.M. Nasrabadi, "Advances in Residual Vector
Quantization: A Review," IEEE Transaction on Image Processing, Vol. 5, No.
2, pp. 226-262, Feb., 1996.
5. P. Lupini and V. Cuperman. "Nonsquare Transform Vector Quantization,"
IEEE Signal Processing Letters, Vol 3, No. 1, pp. 1-3, Jan., 1996.
6. F. Kossentini, M.J.T. Smith and C.F. Barnes. "Necessary Conditions for the
Optimality of Variable-Rate Residual Vector Quantizers," IEEE Trans. Infor-
mation Theory, Vol. 41, No. 6, pp. 1903-1914, Nov., 1995.
APPENDIX D. LIST OF CITATIONS 157
7. R. P. Ramachandran, M. M. Sondhi, N. Seshadri, and B. S. Atal. "A Two
Codebook Format for Robust Quantization of Line Spectral Frequencies," IEEE
Trans. Speech and Audio Processing, Vol. 3, No. 3, pp. 157-168, May, 1995.
8. D. Chang, Y. Cho and S. Ann. "Efficient Quantization of LSF Parameters
using Classified SVQ with Conditional Splitting," Proc. IEEE Int. Conf. on
Acoustics Speech and Signal Processing, pp. 736-739, Detroit, May 9 - 12, 1995.
9. H. P. Knagenhjelm and W. B. Kleijn. "Spectral Dynamics is More Important
than Spectral Distortion," Proc. IEEE Int. Conf. on Acoustics Speech and
Signal Processing, pp. 732-735, Detroit, May 9 - 12, 1995.
10. E. Shlomot. "Delayed Decision Switched Prediction Multi-Stage LSF Quanti-
zation," Digest of papers, IEEE workshop on Speech Coding for Telecommuni-
cations, pp. 45-46, 1995.
11. B.F. Johnson and N. Farvardin. "A Finite-State Two-Stage Vector Quantizer
for Coding Speech Line Spectral Parameters," Digest of papers, IEEE workshop
on Speech Coding for Telecommunications, pp. 47-48, 1995.
12. J.S. Collura, A McCree and T.E. Tremain. "Perceptually Based Distortion
Measurements for Spectrum Quantization," Digest of papers, IEEE workshop
on Speech &ding for Telecommunications, pp. 49-50, 1995.
13. A. McCree, K. Truong, E.B. George and T.P. Barnwell. "An Enhanced 2.4
kbit/s MELP Coder," Digest of papers, IEEE workshop on Speech Coding for
Telecommunications, pp. 101-102, 1995.
14. J.R.B. Demarca. "An LSF Quantizer for the north-american half-rate speech
coder," IEEE Transactions on Vehicular Technology, Vol. 43, No. 3, pp. 413-
419, 1994.
15. L. Dong, A.R. Kaye and S.A. Mahmoud. "Transmission of compressed voice
over integrated services frame relay networks - priority service and adaptive
buildout delay," IEE Proceedings on Communications, Vol. 141, p. 265, 1994.
16. A. Gersho. "Advances in Speech and Audio Compression," Proceedings of the
IEEE, Vol. 82, No. 6, pp. 900-918, 1994.
APPENDIX D. LIST OF CITATIONS 158
17. J. Pan and T.R. Fischer. "Vector Quantization - Lattice Vector Quantization
of Speech LPC Coefficients," Proc. IEEE Int. Conf. on Acoustics Speech and
Signal Processing, pp. 1-513-516, 1994.
18. W.Y. Chan and D. Chemla. "Low Complexity Encoding of Speech LSF Param-
eters using Constrained Storage TSVQ," Proc. IEEE Int. Conf. on Acoustics
Speech and Signal Processing, pp. 1-521-524, 1994.
References
[I] J . G. Ables. Maximum Entropy Spectral Analysis. Astron. Astrophys. Suppl.
Series, 15:383-393, 1974.
[2] J. P. Adoul, P. Mabilleau, M. Delprat, and S. Morissette. Fast CELP Coding
Based on Algebraic Codes. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal
Process., pages 1957-1960, April 1987.
[3] L. B. Almeida and J . M. Tribolet. Harmonic coding:a low bit-rate good-quality
speech coding technique. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal
Process., pages 1664-1667, Paris, 1982.
[4] J . Anderson and J. Bodie. Lease Squares Quantization in PCM. IEEE Trans.
Info. Theory, IT-21:379-387, July 1975.
[5] B. Atal and J . Remde. A New Model of LPC Excitation for Producing Natural
Sounding Speech at Low Bit Rates. In Proc. IEEE Inter. Conf. Acoust., Speech,
Signal Process., pages 614-61 7, Paris, 1982.
[6] B. S. Atal. Stochastic Gaussian Model for Low-Bit Rate Coding of LPC Area
Parameters. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages
2404-2407, 1987.
171 B. S. Atal, R. V. Cox, and P. Kroon. Spectral Quantization and Interpolation
for CELP Coders. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,
pages 69-72, Glasgow, Scotland, May 1989.
[8] B. S. Atal and S. L. Hanuer. Specch Analysis and Synthesis by Linear Prediction
of the Speech Wave. J. Acoust. Soc. Amer., 50(2 (Part 2)):637-655, 1971.
REFERENCES 160
[9] B. S. Atal and M. R. Schroeder. Predictive Coding of the Speech Signals. In
Proc. Conf. Speech Comm. and Processing, pages 360-361, Nov. 1967.
[lo] B. S. Atal and M. R. Schroeder. Adaptive Predictive Coding of the Speech
Signals. Bell Syst. Tech. J., 49:1973-1986, Oct. 1970.
[ll] B. S. Atal and M. R. Schroeder. Predictive Coding of Speech Signals and Sub-
jective Error Criteria. IEEE Trans. Acoust. Speech Signal Processing, ASSP-
27(3):247-254, June 1979.
[12] B. S. Atal and M. R. Schroeder. Stochastic Coding of Speech Signals at Very
Low Bit Rates. In Proc. Int. Conf. Comm., pages 1610-1613, May 1984.
[13] C. F. Barnes and R. L. Frost. Vector Quantizers with Direct Sum Codebooks.
IEEE Trans. Info. Theory, 39(2):565-580, Mar. 1993.
[14] C. F. Barnes, S. A. Rizvi, and N. M. Nasrabadi. Advances in Residual Vector
Quantization: A Review. IEEE Trans. Image Processing, 5(2) :226-262, Feb.
1996.
[15] B. Bhattacharya, W. LeBlanc, S. Mahmoud, and V. Cuperman. Tree Searched
Multi-Stage Vector Quantization for 4 kb/s Speech Coding. In Proc. IEEE Inter.
Conf. Acoust., Speech, Signal Process., pages 1-105 - 1-108, San Francisco, March
1992.
[16] M. S. Brandstein, P. A. Monta, J. C. Hardwick, and J. S. Lim. A Real-Time
Implementation of the Improved MBE Speech Coder. In Proc. IEEE Inter. Conf.
Acoust., Speech, Signal Process., pages 5-8, Albuquerque, April 1990.
[17] A. Buzo, A. H. Gray Jr., R. M. Gray, and J . D. Markel. Speech Coding Based
Upon Vector Quantization. IEEE Trans. Acoust. Speech Signal Processing, ASSP-
28(5):562-574, Oct. 1980.
[18] J. P. Campbell and T. E. Tremain. Voiced/Unvoiced Classification of Speech
with Applications to the U.S. Government LPC-1OE Algorithm. In Proc. IEEE
Inter. Conf. Acoust., Speech, Signal Process., pages 473-476, Tokyo, April 1986.
REFERENCES
[19] J . P. Campbell, T . E. Tremain, and V. C. Welch. The Federal Standard 1016
4800 bps CELP Voice Coder. Digital Signal Processing, 1(3):145-155, July 1991.
[20] W. Y. Chan, S. Gupta, and A. Gersho. Enhanced Multistage Quantization by
Joint Codebook Design. IEEE Trans. Comm., 4O(ll): 1693-1697, Nov. 1992.
[21] R. E. Crochiere, S. A. Weber, and J. L. Flanagan. Digital Coding of Speech in
Sub-Bands. Bell Syst. Tech. J., 55:1069-1085, Oct. 1976.
[22] V. Cuperman. On Adaptive Vector Transform Quantization for Speech Coding.
IEEE Trans. Comm., 37(3):261-267, March 1989.
[23] V. Cuperman. Speech coding. Advances in Electronics and Electron Physics,
82:97-196, 1991.
[24] V. Cuperman and A. Gersho. Vector Predictive Coding of Speech at 16 Kbit/s.
IEEE Trans. Comm., COM-33(7):585-696, July 1985.
[25] V. Cuperman, P. Lupini, and B. Bhattacharya. Spectral Excitation Coding of
Speech at 2.4 kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,
pages 496-499, Detroit, May 1995.
[26] A. Das, A.V. Rao, and A. Gersho. Variable-Dimension Vector Quantization of
Speech Spectra for Low-Rate Vocoders. In Proc. Data Compression Conference,
pages 421-429, 1994.
[27] G. Davidson and A. Gersho. Complexity Reduction Methods for Vector Excita-
tion Coding. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages
3055-3058, Tokyo, April 1986.
[28] L. D. Davisson. Rate-Distortion Theory and Application. Proc. IEEE, 60:800-
808, 1972.
[29] A. De and P. Kabal. Cochlear discrimination: An auditory information-theoretic
distortion measure for speech coders. In Proc. 16th Biennial Symp. on Commun.,
pages 419-423, Kingston, Canada, May 1992.
REFERENCES 162
[30] A. De and P. Kabal. Rate-Distortion Function for Speech Coding based on
Perceptual Distortion Measure. In Proc. Globecom, pages 452-456, Orlando,
Florida, Dec. 1992.
[31] H. Dudely. The Vocoder. Bell Labs Rec., 18:122-126, Dec. 1939.
[32] DVSI. INMARSAT M Voice Codec. USA, Feb. 1991. Version 1.3.
[33] M. Farvardin. A Study of Vector Quantization for Noisy Channels. IEEE Trans.
Info. Theory, IT-36:799-809, July 1990.
[34] N. Farvardin and R. Laroia. Efficient Encoding of Speech LSP Parameters Using
the Discrete Cosine Transform. In Proc. IEEE Inter. Conf. Acoust., Speech,
Signal Process., pages 168-171, Glasgow, May 1989.
[35] J. L. Flanagan. Speech Analysis, Synthesis and Perception. Springer-Verlag, New
York, 1972.
[36] J. L. Flanagan, M. R. Schroeder, B. S. Atal, R. E. Crochiere, N. S. Jayant, and
J . M. Tribolet. Speech Coding. IEEE Trans. Comm., COM-27(4):710-737, April
1979.
[37] P. Fleischer. Sufficient Conditions for Achieving Minimum Distortion in a Quan-
tizer. In IEEE Int Conv. Rec., pages 104-111, 1964.
[38] J . B. Fraleigh and R. A. Beauregard. Linear Algebra. Addison-Wesley Publishing
Company, second edition, 1990.
[39] S. Furui. Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker
Inc., New York, 1989.
[40] A. Gersho. Principles of Quantization. IEEE Trans. Circuits and Systems, CAS-
25(7):427-436, July 1978.
[41] A. Gersho. Advances in Speech and Audio Compression. Proceedings of the
IEEE, 82(6):900-918, June 1994.
[42] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer
Academic Publishers, 1992.
REFERENCES 163
[43] I. Gerson and M. Jasiuk. Vector Sum Excited Linear Prediction (VSELP) Speech
Coding at 4.8 kbps. In Proc. of Inter. Mob. Sat. Conf., pages 678-683, Ottawa,
1990.
[44] I. Gerson and M. Jasiuk. Vector Sum Excited Linear Prediction (VSELP) Speech
Coding at 8 Kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,
pages 461-464, Albuquerque, April 1990.
[45] 0 . Ghitza and J . L. Goldstein. Scalar LPC Quantization Based on Formant
JND's. IEEE Trans. Acoust. Speech Signal Processing, ASSP-34(4):697-708, Aug.
1986.
[46] J . D. Gibson and K. Sayood. Lattice Quantization. Advances in Electronics and
Electron Physics, 72:259-330, 1988.
[47] H. Gish and J.N. Pierce. Asymptotically Efficient Quantizing. IEEE Trans. Info.
Theory, IT-14:676-681, Sept. 1968.
[48] R.M. Gray, J.C. Kieffer, and Y. Linde. Locally Optimal Block Quantizer Design.
Inform. and Control, 45:178-198, May 1980.
[49] A.H. Gray Jr., R. M. Gray, and J.D. Markel. Comparison of Optimal Quan-
tizations of Speech Reflection Coefficients. IEEE Trans. Acoust. Speech Signal
Processing, ASSP-25:9-23, Feb. 1977.
[50] A.H. Gray Jr. and J.D. Markel. Distance Measures for Speech Processing. IEEE
Trans. Acoust. Speech Signal Processing, ASSP-24(5):380-391, Oct. 1976.
[51] A.H. Gray Jr. and J.D. Markel. Quantization and Bit Allocation in Speech
Processing. IEEE Trans. Acoust. Speech Signal Processing, ASSP-24(6):459-473,
Dec. 1976.
[52] D. W. Griffin. Multi-Band Excitation Vocoder. PhD thesis, Massachusetts Insti-
tute of Technology, 1987.
[53] D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short Time Fourier
Transform. IEEE Trans. Acoust. Speech Signal Processing, ASSP-32(2):236-243,
April 1984.
REFERENCES
[54] D. W. Griffin and J . S. Lim. Multiband Excitation Vocoder. IEEE Trans. Acoust.
Speech Signal Processing, 36(8): 1223-1235, August 1988.
[55] P. Hedelin. A Tone Oriented Voice-Excited Vocoder. In Proc. IEEE Inter. Conf.
Acoust., Speech, Signal Process., pages 205-208, 1981.
[56] F. Itakura. Line Spectrum Representation of Linear Predictive Coefficients of
Speech Signals. J. Acoust. Soc. Amer., 57, Supplement No. 1:S35, 1975.
[57] F. I. Itakura and S. Saito. Analysis-Synthesis Telephony Based on the Maximum
Likelihood Method. In Proc. 6th Intern. Congr. Acoust., pages C17-20, Tokyo,
August 21-28 1968.
[58] N. S. Jayant. Digital Coding of Speech Waveforms: PCM, DPCM, and DM
Quantizers. Proceedings of the IEEE, 62:611-632, May 1974.
[59] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, Englewood
Cliffs, New Jersey, 1984.
[60] B. Juang, D. Y. Wong, and A. H. Gray Jr. Distortion Performance of Vector
Quantization for LPC Voice Coding. IEEE Trans. Acoust. Speech Signal Pro-
cessing, ASSP-30(2):294-303, April 1982.
[61] P. Kabal and R.P Ramachandran. The Computation of Line Spectral Frequencies
Using Chebyshev Polynomials. IEEE Trans. Acoust. Speech Signal Processing,
ASSP-34(6):1419-1426, Dec. 1986.
[62] G.S. Kang and L.J. Fransen. Low-Bit Rate Speech Encoders Based on Line-
Spectrum Frequencies (LSFs). NRL Report 8857, Naval Research Laboratory,
Washington, D.C., Jan. 1985.
[63] W. B. Kleijn. Continuous Representations in Linear Predictive Coding. In Proc.
IEEE Inter. Conf. Acoust., Speech, Signal Process., pages 201-204, Toronto, May
1991.
[64] W. B. Kleijn and W. Granzow. Methods for Waveform Interpolation in Speech
Coding. Digital Signal Processing, 1(4):215-230, Oct. 1991.
REFERENCES 165
[65] A. M. Kondoz. Digital Speech (Coding for Low Bit Rate Communications Sys-
tems). John Wiley & Sons, Chichester, England, 1994.
[66] F. Kossentini, M. J . T . Smith, and C. F. Barnes. Necessary Conditions for
the Optimality of Variable-Rate Residual Vector Quantizers. IEEE Trans. Info.
Theory, 41(6):1903-1914, Nov. 1995.
[67] P. Kroon, E. Deprettere, and R. Sluyter. Regular-Pulse Excitation, A Novel
Approach to Effective and Efficient Multipulse Coding of Speech. IEEE Trans.
Acoust. Speech Signal Processing, ASSP-34: 1054-1063, 1986.
[68] G. Kubin, B. S. Atal, and W. B. Kleijn. Performance of Noise Excitation for
Unvoiced Speech. In Proc. IEEE Workshop on Speech Coding for Telecommuni-
cations, pages 35-36, 1993.
[69] W. P. LeBlanc, B. Bhattacharya, S. A. Mahmoud, and V. Cuperman. Efficient
Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters for
4 kb/s Speech Coding. IEEE Trans. Speech and Audio Processing, 1(4):373-385,
Oct. 1993.
[70] W. P. LeBlanc and S. A. Mahmoud. Structured Codebook Design in CELP. In
Proc. Inter. Mob. Sat. Conf., pages 667-672, Ottawa, June 1990.
[71] D. Lin. New Approaches to Stochastic Coding of Speech Sources at Very Low
Bit Rates. In I.T. Young et al., editors, Signal Processing 111: Theories and
Applications, pages 445-447. Elsevier, North-Holland, Amsterdam, 1986.
[72] Y. Linde, A. Buzo, and R.M. Gray. An Algorithm for Vector Quantizer Design.
IEEE Trans. Comm., COM-28(1):84-95, Jan. 1980.
[73] S.P. Lloyd. Least Squares Quantization in PCM. IEEE Trans. Info. Theory,
IT-28:129-137, March 1982. (Originally, unpublished memorandum, Bell Labo-
ratories, 1957).
[74] P. Lupini and V. Cuperman. Non-Square Transform Vector Quantization. IEEE
Signal Processing Letters, 3(1):1-3, Jan. 1996.
REFERENCES 166
[75] J. Makhoul, S. Roucos, and H. Gish. Vector Quantization in Speech Coding.
Proceedings of the IEEE, 73(11):1551-1588, Nov. 1985.
[76] J. D. Markel. The SIFT Algorithm for Fundamental Frequency Estimation. IEEE
Trans. Audio Electroacoust., AU-20:367-377, Dec. 1972.
[77] J. D. Markel and A. H. Gray. A Linear Prediction Vocoder Simulation Based
upon Auto-correlation Method. IEEE Trans. Acoust. Speech Signal Processing,
ASSP-23(2):124-134, April 1974.
[78] J. D. Markel and A. H. Jr. Gray. Linear Prediction of Speech. Springer Verlag,
Berlin, 1976.
[79] J . D. Markel and A. H. Gray Jr. Implementation and Comparison of Two Trans-
formed Reflection Coefficient Scalar Quantization Methods. IEEE Trans. Acoust.
Speech Signal Processing, ASSP-28(5):575-583, Oct. 1980.
[80] J. S. Marques, L. B. Almeida, and J. M. Tribolet. Harmonic Coding at 4.8
Kb/s. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages 17-20,
Albuquerque, April 1990.
[81] J . Max. Quantizing for Minimum Distortion. IEEE Trans. Info. Theory, IT-6:7-
12, March 1960.
[82] R. J. McAulay and T. F. Quatieri. Speech Analysis/Synthesis Based on a Si-
nusoidal Representation. IEEE Trans. Acoust. Speech Signal Processing, ASSP-
34(4):744-754, August 1986.
[83] R. J. McAulay and T. F. Quatieri. Low-Rate Speech Coding Based on the
Sinusoidal Model. In S. Furui and M. Sondhi, editors, Advances in Speech Signal
Processing, chapter 6, pages 165-208. Marcel Dekker Inc., New York, 1992.
[84] D. L. Neuhoff and N. Moayeri. Tree Searched Vector Quantization with Interblock
Noiseless Coding. In Proc. 1988 Conf. Infor. Scien. Sys., pages 781-783, Mar.
1988.
REFERENCES 167
[85] M. Nishiguchi, J . Matsumoto, R. Wakatsuki, and S. Ono. Vector Quantized MBE
with simplified V/UV decision at 3.0 kbps. In Proc. IEEE Inter. Conf. Acoust.,
Speech, Signal Process., pages 151-154, Minneapolis, April 1993.
[86] K. K. Paliwal and B. S. Atal. Efficient Vector Quantization of LPC parameters
at 24 bitslframe. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,
pages 661-664, Mar. 1991.
[87] K. K. Paliwal and B. S. Atal. Vector Quantization of LPC Parameters in the
Presence of Channel Errors. In IEEE Workshop on Speech. Coding for Telecom-
munications, pages 33-35, Sept. 1991.
[88] P. E. Pa.pamichalis and T. P. Barnwell, 111. Variable Rate Speech Compression
by Encoding Subsets of the PARCOR Coefficients. IEEE Trans. Acoust. Speech
Signal Processing, ASSP-31(3):704-713, June 1983.
[89] A. Pa.poulis. Signal Analysis. McGraw-Hill Book Co., Singapore, international
student edition, 1984.
[90] N. Phamdo a.nd N . Farvardin. Coding of Speech LSP Parameters Using TSVQ
with Interblock Noiseless Coding. In Proc. IEEE Inter. Conf. Acoust., Speech,
Signal Process., pages 193-196, 1990.
[91] N. Phamdo, N. Farvardin, and T. Moriya. Combined Source-Channel Coding of
LSP parameters Using Multi-Stage Vector Quantization. In IEEE Workshop on
Speech Coding for Telecommunications, pages 36-38, 1991.
[92] L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice
Hall, Englewood Cliffs, N.J., 1978.
[93] K.R. Rao and P. Yip. Discrete cosine transform : algorithms, advantages, and
applications. Harcourt Brace Jovanovich, Boston, 1990.
[94] M. J . Sabin and R. M. Gray. Product Code Vector Quantizers for Waveform and
Voice Coding. IEEE Trans. Acoust. Speech Signal Processing, ASSP-32(3):474-
488, June 1984.
REFERENCES 168
[95] R. A. Salami, L. Hanzo, and D. G. Appleby. A Fully Vector Quantized Self-
Excited Vocoder. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process.,
pages 124-127, Glasgow, May 1989.
[96] C. E. Shannon. Coding Theorems for a Discrete Source with a Fidelity Criterion.
In IRE Nut. Conv. Rec., Part 4, pages 142-163, Mar. 1959.
[97] Y . Shoham. High Quality Speech Coding at 2.4 to 4.0 Kbps Based on Time-
Frequency Interpolation. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal
Process., pages 167-170, Minneapolis, April 1993.
[98] B. Smith. Instantaneous Companding of Quantized Signals. Bell Syst. Tech. J.,
27:446-472, 1948.
1991 P.K. Soong and B.W. Juang. Line Spectrum Pair (LSP) and Speech Data Com-
pression. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal Process., pages
1.10.1-1.10.4, San Diego, CA, March 1984.
[loo] N . Sugamura and N. Farvardin. Quantizer Design in LSP Speech Analysis Syn-
thesis. IEEE Trans. Selected Areas in Corn., 6(2):432-440, Feb. 1988.
[loll Y . Tohkura and F. Itakura. Spectral Sensitivity Analysis of PARCOR Parameters
for Speech Data Compression. IEEE Trans. Acoust. Speech Signal Processing,
ASSP-27(3):273-280, June 1979.
[I021 Y . Tohkura, F. Itakura, and Hashimoto S. Spectral Smoothing Technique in
PARCOR Speech Analysis-Synthesis. IEEE Trans. Acoust. Speech Signal Pro-
cessing, ASSP-26(6):587-596, Dec. 1978.
[lo31 T. E. Tremain. The Government Standard Linear Predictive Coding Algorithm:
LPC- 10. Speech Technology, pages 40-49, April 1982.
[I041 F. F . Tzeng. Analysis-by-Synthesis Linear Predictive Speech Coding at 2.4 kbit/s.
In Proc. Globecorn, pages 1253-1257, 1989.
[I051 T. Umezaki and F Itakura. Analysis of Time Fluctuating Characteristics of
Linear Predictive Coefficients. In Proc. IEEE Inter. Conf. Acoust., Speech, Signal
Process., pages 1257-1261, 1986.
REFERENCES 169
[lo61 C. K. Un and D. T. Magill. The Redisual-Excited Prediction Vocoder with
Transmission Rate Below 9.6 kbitsls. IEEE Trans. Comm., COM-23(12):1466-
1474, Dec. 1975.
[lo71 R. Viswanathan and J. Makhoul. Quantization Properties of Transmission Pa-
rameters in Linear Predictive Systems. IEEE Trans. Acoust. Speech Signal Pro-
cessing, ASSP-23:309-321, June 1975.
[I081 D. Wong, B. Juang, and A. H. Gray Jr. An 800 bit/s Vector Quantization LPC
Vocoder. IEEE Trans, Acoust. Speech Signal Processing, ASSP-23(5):770-780,
Oct. 1982.
[log] R.C. Wood. On Optimum Quantization. IEEE Trans. Info. Theory, IT-15:248-
252, March 1969.
[I101 S. Yeldener, A.M. Kondoz, and B.G. Evans. High Quality Multiband LPC Coding
of Speech at 2.4 kbit/s. Electronics Letters, 27(14):1287-1289, July 4 1991.
[ I l l ] M. Young, G. Davidson, and A. Gersho. Encoding of LPC Spectral Parameters
Using Switched- Adaptive Interframe Vector Prediction. In Proc. IEEE Inter.
Conf. Acoust., Speech, Signal Process., pages 402-405, 1988.
[112] K. A. Zeger and A. Gersho. Zero Redundancy Channel Coding in Vector Quan-
tization. Electronics Letters, 23:654-656, May 1987.
[I131 R. Zelinski and P. Noll. Adaptive Transform Coding of Speech Signals. IEEE
Trans. Acoust. Speech Signal Processing, ASSP-25:299-309, Aug. 1977.