4. introduction to different speech coders 4.1...
Post on 07-Apr-2020
2 Views
Preview:
TRANSCRIPT
27
4. INTRODUCTION TO DIFFERENT SPEECH CODERS
4.1 INTRODUCTION TO PROPERTIES AND PRODUCTION OF SPEECH
Speech is an excellent way to communicate with other people. The speech is an
acoustic sound wave from the speaker‟s vocal organs to the listener‟s ears. The smallest
posited structural unit of the speech is a phoneme. The phonemes can be divided into two
groups .i.e. voiced and unvoiced phonemes. The voiced phonemes are periodic and
unvoiced phonemes are noisy without any periodicity. The voiced phonemes are periodic
in time domain and harmonic in frequency domain. There are peaks in the voiced
phoneme spectrum and those are vocal tract resonance frequencies and are called
formants. It is desirable to have higher formants included while transmitting the speech
signal.
The speech is produced from a filtering operation formed by the larynx, pharynx,
oral and nasal cavity [24]. The sound channel is a physiological filter, which shapes the
stimulus from the lungs. Different sounds are formed by changing the filtering
characteristics. The characteristic of the filter changes depending on the position of the
tongue and lips. The voiced sound forming starts from the lungs. Midriff muscles press
the lungs and causes over pressure to the trachea. The vocal cords start to vibrate at the
frequency called the fundamental frequency. The frequency of vibration is about 100-110
Hz for males and 200 Hz for females. The main difference between the voiced and the
unvoiced sound is that voiced sounds have greater amplitude than the unvoiced sounds
and they are formed in narrow or closed part of the sound channel. Unvoiced sounds are
similar to the random noise. One way to represent the speech production is to use a
simplified source-filter model of speech as shown in the figure 4.1. This kind of a model
28
is used to produce the synthetic speech. The voiced sounds are produced from the glottis
stimulus and the unvoiced from the noise. Both voiced and unvoiced stimuli are
connected to a binary switch. The switch O/P is connected to a linear filter and this
represents the sound channel [24]. The gain G is needed for balancing the speech signal
energy on every stimulus and filter combination. The voiced speech produced by exciting
the vocal tract is periodic pulse and is called the pitch. The short time spectrum of voiced
speech is characterized by its fine and formant structure. The formant structure is due to
the interaction of the source and the vocal tract. The spectral envelope is characterized by
a setoff peaks or formants. The location of the first three formants usually occurring
below 3 kHz and these locations are important in speech perception and can determine
type of sound was produced.
Pitch period
V Innovations
impulse train U(n) Speech signal
UV S(n)
White noise G LPC filter
Fig: 4.1 Source Filter model
4.2 EVALUATION OF SPEECH CODERS
Speech communication is at present the most dominant and common service in
telecommunication network. The attractions of digitally encoded speech are obvious. As
digitally encoded speech ultimately condenses down the binary sequence. All the
advantages offered by the digital systems are available for exploitation. The digitally
encoded speech signals are easy to regenerate, easy for signaling, are highly flexible and
H(z)
29
secured and can be integrated into the integrated service digital networks [ISDN]. The
digitally encoded speech has many advantages over its analog counterpart, but digital
signal requires extra bandwidth. This disadvantage can be overcome using speech
compression technique [25]. The speech encoding is defined as a digital representation of
the speech sound that provides efficient storage, transmission, recovery and faithful
reconstruction of the speech signal. Speech coding has become an intensive area of
research. All speech coding systems involve lossy compression where reconstructed
speech signal is not exact replica of the original signal and hence causes degradation in
the quality. As the complexity of the algorithm increases the implementation cost
increases hence the designer of communication system must strike a balance between
cost and quality.
Speech coding techniques are evaluated by its transmission rate, implementation
complexity, coding delay, robustness to the channel noise and the implementation cost.
The most important criterion is the quality of the reconstructed signal. The speech quality
can be measured using subjective measure or objective measure [8].
Subjective measurements are obtained from listening tests. Speech quality is the
result of a subjective perception-and-judgment process, during which a listener compares
the perceptual event (speech signal heard) to an internal reference of what is judged to be
good quality. Subjective assessment plays a key role in characterizing the quality of
emerging telecommunications products and services, as it attempts to quantify the end
user's experience with the system under test. Typical subjective measures are often
quoted are
Mean opinion score [MOS],
30
Diagnostic Rhyme test [DRT], and
Paired comparison test [PCT].
The MOS is most widely used and is obtained by averaging test results rated by
group of listeners who were asked to quantify their impression on a five point scale.
Commonly, the mean opinion score (MOS) test is used wherein listeners are asked to rate
the quality of a speech signal on a 5-point scale, with 1 corresponding to unsatisfactory
speech quality and 5 corresponding to excellent speech quality. The average of the
listener scores is termed as the subjective listening MOS, or as suggested by ITU-T
Recommendation, MOS-LQS (listening quality subjective). Formal subjective tests,
however, are expensive and time consuming, thus unsuitable for “on-the-fly"
applications.
The following attributes are
If MOS is 5 then obtained speech quality will be excellent.
If MOS is 4 then obtained speech quality will be good.
If MOS is 3 then obtained speech quality will be fair with noticeable
impairments.
If MOS is 2 then obtained speech quality will be poor with strong impairments.
If MOS is 1 then obtained speech quality will be bad and highly degraded.
In case of DRT experienced listeners are asked to distinguish between pairs of
single syllable words such as meat and beat etc. DRT is a quite widely used method and it
provides lots of valuable diagnostic information how properly the initial consonant is
recognized and it is very useful as a developing tool. However, it does not test any
vowels or prosodic features, so it is not suitable for any kind of overall quality evaluation.
31
Other deficiency is that the test material is quite limited and the test items do not occur
with equal probability, so it does not test all possible confusions between consonants.
Thus, confusions presented as matrices are hard to evaluate [8]. In case of PCT the
listener is asked to choose a coder O/P from a pair of coders.
The objective speech quality measurement replaces the listener panel with a
computational algorithm, thus facilitating automated real-time quality measurement.
Indeed, for the purpose of real-time quality monitoring and control on a network-wide
scale, objective speech quality measurement is the only viable option. Objective
measurement methods aim to deliver quality estimates that are highly correlated with
those obtained from subjective listening experiments. Objective quality measurement can
be classified as either signal based or parameter based. The widely used objective
measures include mean squared based errors and the most popular is the signal to noise
ratio (SNR).
12
0
12
0
( )
10logˆ[ ( ) ( )]
M
n
M
n
S n
SNR
S n S n
4.1
Where ( )S n is the original speech data and ˆ( )S n is the coded speech data and M is
number of samples. The SNR is the measure of accuracy of the reconstructed speech
signal. The segmental SNR (SEGSNR) is defined as the dB average of the short time
SNRs.
4.3 SPEECH CODING METHODOLOGY
Speech coders represent analog signal by a sequence of binary digits. The simplest
codec consists of a sampler and a quantizer, where in each sample is represented by a
32
digital code. Coding algorithms seek to minimize the bit rate in the digital representation
of a signal without an objectionable loss of signal quality in the process. High quality is
attained at low bit rates by exploiting signal redundancy as well as the knowledge that
certain types of coding distortion are imperceptible because they are masked by the
signal. Speech coding schemes can be broadly classified into three main sections as
follows,
Wave form coder
Hybrid coder
Vocoder coder
4.3.1 Waveform Coders
The waveform coders are low complex codec. These are signal independent and
work well with speech signals and non-speech signals. Waveform coders are
characterized by their attempt to present the general shape of the signal waveform. These
coders can work well on any input waveform bounded by certain limits in amplitude and
bandwidth. These coders produce high quality speech at rates above 16 kbits/s. When the
data rate is lowered below this level the reconstructed speech quality that can be obtained
degrades rapidly. The different types of waveform coders are
4.3.1.1 Pulse code modulation [PCM]: This merely involves sampling and quantization
of the input speech signal. Narrow-band speech is typically band-limited to 4 kHz and
sampled at 8 kHz. If linear quantization is used then to give good quality speech around
twelve bits per sample are needed, giving a bit rate of 96 Kbits/s. This bit rate can be
reduced by using non-uniform quantization of the samples. In speech coding an
approximation to a logarithmic quantizer is often used. Such quantizers give a signal to
33
noise ratio which is almost constant over a wide range of input levels, and at a rate of
eight bits/sample (or 64 Kbits/s) give a reconstructed signal which is almost
indistinguishable from the original.
4.3.1.2 Adaptive pulse code modulation [APCM]: A scheme for digital encoding of
audio in which successive values represent differences in the sampled wave instead of
absolute values.
4.3.1.3 Delta modulation [DM]: This is the simplest form of DPCM. In this codec the
difference between successive samples are encoded into n-bit data streams and its latest
approximation is done using only one bit of quantization. Since there is one bit
quantization the difference are being coded into two levels only. The quantizer in DM is
realized with a comparator with two bits 1 and 0. The demodulator is a simple integrator.
The two sources of noise in delta modulation are "slope overload" and this noise occurs
when steps are too small to track the original waveform, and "granularity” which occurs
when steps are too large to track the original waveform.
4.3.1.4 Adaptive pulse code modulation [ADPCM]: This codec quantizes the difference
between the speech signal and a prediction that has been made of the speech signal. If the
prediction is accurate then the difference between the real and predicted speech samples
will have a lower variance than the real speech samples. The obtained difference signal
will be accurately quantized by a fewer number of bits than would be needed to quantize
the original speech samples. At the decoder the quantized difference signal is added to
the predicted signal to give the reconstructed speech signal. The performance of the
codec is aided by using adaptive prediction and quantization, so that the predictor and
34
difference quantizer adapt to the changing characteristics of the speech being coded. This
codec is known as G721 and gives very good quality speech at 32 Kbits/sec.
4.3.1.5 Differential Pulse code modulation [DPCM]: When a signal is sampled at the
Nyquist rate the obtained samples are correlated samples. These correlated samples have
maximum redundant information. The DPCM is specifically designed to take the
advantage of the sample to sample redundancies in typical speech waveforms. DPCM is
based on predicting the next sample based on the previous decoded samples. Good
prediction results in a reduction in the dynamic range needed to code the prediction
residual and hence a reduction in the bit rate.
4.3.2 Vocoder
Vocoders are also known as voice coders. These devices take natural speech as
their input and use the same speech to generate various types of acoustic parameters
which usually take up less transmission bandwidth than that of original speech. These
parameters are then transmitted to a re-synthesis device that re-generates the speech.
Vocoders are speech specific in their principles as no attempt is made to preserve the
original speech waveform. This consists of an analyzer and synthesizer. The analyzer
used at the transmitter is used to extract a few set of parameters from the speech signal to
be transmitted. At the receiver the speech is synthesized using the above parameters. The
speech signal produced will be often crude with low toll quality. The different types of
Vocoders are
LPC
Homomorphic MBE
Channel formant phase sinusoidal
35
RELP
4.3.3 Hybrid Coders
Hybrid coders attempt to fill the gap between waveform coder and Vocoder.
Waveform coders are capable of providing good quality speech at higher bit rates but the
signal deteriorates as the bit rate reduces. Vocoders on the other hand can provide
intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at
any bit rate. To overcome the disadvantages of waveform coders and Vocoders the hybrid
coding methods have been developed which incorporate each of the advantages offered
by the above schemes. Hybrid coders are broadly classified into two sub-categories
Frequency domain hybrid coders
Time domain hybrid coders
4.3.3.1 Frequency domain hybrid coders: The basic concept in the frequency domain
coding is to divide the speech spectrum into frequency bands or components using either
a filter bank or a block transfer analysis. After encoding and decoding, these frequency
components are used to regenerate the replica of the input waveform by either filter bank
summation or inverse transform method. A primary assumption in frequency domain
coding is that the signal to be coded is slowly time varying, which can be locally modeled
with the short time spectrum. In frequency domain coding block of speech can be
represented by filter bank or block transformation. The two well known frequency
domain speech coding techniques are
Subband coding technique
Adaptive transform coding technique
4.3.3.1.1 Subband coding technique [SBC]:
36
Subband coding breaks the signal into a number of different frequency bands and
encodes each one of the data independently. Subband is generally viewed as a waveform
coding technique which uses wide band short time analysis and synthesis. After
partitioning the speech spectrum into a number of bands, each band is low pass translated
to zero frequency, sampled at Nyquist rate, quantized, encoded multiplexed and
transmitted. At the receiver the subbands are demultiplexed, decoded and translated back
to their original frequency position. The resulting subband signal is summed up to give an
approximation of the original speech signal. SBC exploits the deficiency of the human
auditory system. Human ears are normally sensitive to a wide range of frequencies, but
when a sufficiently loud signal is present at one frequency, the ear will not hear weaker
signals at nearby frequencies. We say that the louder signal masks the softer ones. The
louder signal is called the masker and the point at which masking occurs is known as the
masking threshold. The basic idea of SBC is to enable a data reduction by discarding
information about frequencies which are masked. The result differs from the original
signal, but if the discarded information is chosen carefully, the difference will not be
noticeable, or more importantly, objectionable. The speech spectrum can be split into
desired number of bands using several techniques like
Integer band sampling
Tree structure quadrature mirror filters
Discrete cosine transform
Parallel filter banks
4.3.3.1.2 Adaptive transform coder [ATC]:
37
This is a complex frequency analysis technique involving the block transformation of
windowed segments of the input speech. This speech is represented by a set of transform
coefficients. These coefficients are inverse transformed at the receiver to produce a
replica of the original speech. Adjacent segments are joined together to form the
synthesized speech. This has better resolution compared to the subband coding technique
but number of bits required to encode the data are more.
The advantage of frequency domain coder is its exploitation of the non-flat
spectral density of the speech signal, which allows unequal quantization to be applied to
the frequency bands.
4.3.3.2 Time domain hybrid coders: Time domain hybrid coders are dominated by
schemes employing linear predictors. The statistical characteristics of speech signals can
be accurately modeled by a source-filter model which assumes that the speech is resultant
of exciting a linear time varying filter with a periodic pulse train for voiced speech or a
random noise source for the unvoiced speech. These can be classified as analysis by
synthesis (AbS) LPC, in which the system parameters are determined by linear prediction
and the excitation sequence is determined by a closed loop or open loop optimization.
The optimization process determines an excitation sequence, which minimizes a measure
of the weighted difference between the input speech and the coded speech. The weighting
or filtering function is chosen such that the coder is optimized for the human ear. The
most commonly used excitation model used for AbS LPC are the multipulse, regular
pulse excitation and vector or code excitation. Since these methods combines the features
of model-based Vocoders, by representing the formant and the pitch structure of speech
and the properties of waveform coders, they are called hybrid. Although other forms of
38
hybrid codecs exist, the most successful and commonly used are time domain Analysis-
by-Synthesis (AbS) codecs. Such coders use the same linear prediction filter model of the
vocal tract as found in LPC Vocoders. However instead of applying a simple two-state,
voiced/unvoiced model to find the necessary input to this filter, the excitation signal is
chosen by attempting to match the reconstructed speech waveform as closely as possible
to the original speech waveform. A general model for AbS codecs is as shown in the
figure 4.2 below. The AbS codecs work by splitting the input speech to be coded into
frames, typically about 20 ms long. For each frame parameters are determined for a
synthesis filter, and then the excitation to this filter is determined. This is done by finding
the excitation signal which when passed into the given synthesis filter minimizes the
error between the input speech and the reconstructed speech. Thus the name Analysis-by-
Synthesis where in the encoder analyses the input speech by synthesizing many different
approximations to it.
Input Speech
ew(n)
U(n)
ˆ( )S n
U(n) ( )S n
Reproduced speech
Encoder and decoder
Fig: 4.2 Analysis by synthesis Codec Structure
Excitation
generation
generation
Synthesis
Filter
Excitation
Generation
Synthesis
Filter
Error
Weighting
Error
minimization
-
39
Finally for each frame the encoder transmits information representing the
synthesis filter parameters and the excitation to the decoder, and at the decoder the given
excitation is passed through the synthesis filter to give the reconstructed speech. The
synthesis filter is usually an all pole, short-term, linear filter of the form
1
( )( )
H zA z
4.2
Where 1
( ) 1p
i
i
i
A z a z
4.3
In the above equation A(z) is the prediction error filter determined by minimizing
the energy of the residual signal produced when the original speech segment is passed
through it. The order „P‟ of the filter is typically around ten. This filter is intended to
model the correlations introduced into the speech by the action of the vocal tract. The
synthesis filter may also include a pitch filter to model the long-term periodicities present
in voiced speech. Alternatively these long-term periodicities may be exploited by
including an adaptive codebook in the excitation generator so that the excitation signal
U(n) includes a component of the form Gu(n-α), where α is the estimated pitch period.
Generally MPE and RPE codecs will work without a pitch filter, although their
performance will be improved if one is included. For CELP codecs however a pitch filter
is extremely important, for reasons discussed below. The error weighting block is
used to shape the spectrum of the error signal in order to reduce the subjective loudness
of this error. This is possible because the error signal in frequency regions where the
speech has high energy will be at least partially masked by the speech. The weighting
filter emphasizes the noise in the frequency regions where the speech content is low.
Thus minimizing the weighted error concentrates the energy of the error signal in
40
frequency regions where the speech has high energy. Therefore the error signal will be at
least partially masked by the speech, and so its subjective importance will be reduced.
Such weighting is found to produce a significant improvement in the subjective quality of
the reconstructed speech for AbS codec‟s.
The distinguishing feature of AbS codecs is how the excitation waveform U(n) for
the synthesis filter is chosen. Conceptually every possible waveform is passed through
the filter to see what reconstructed speech signal the excitation would produce. The
excitation signal which gives the minimum weighted error between the original and the
reconstructed speech is then chosen by the encoder and is used to drive the synthesis
filter at the decoder. It is this `closed-loop' determination of the excitation signal which
allows AbS codecs to produce good quality speech at low bit rates. However the
numerical complexity involved in passing every possible excitation signal through the
synthesis filter is huge. Usually some means of reducing this complexity, without
compromising the performance of the codec is needed.
The time domain coders can be classified as
Adaptive predictive coding[APC]
Residual excited linear predictive coding [RELP]
Multipulse linear predictive coding [MP-LPC]
Code excited linear predictive coding [CELP]
Vector sum excited linear predictive coding [VSELP]
4.3.3.2.1 Adaptive predictive coding [APC]
41
This coder employs both short term and long term linear predictors. The resultant
signal after inverse filtering is scalar quantized on a sample by sample basis. The APC
scheme was proposed for 16kb/second or below with the variation on its treatment of the
residual signal.
4.3.3.2.2 Residual excited linear predictive coding [RELP]
This is basically APC which transmits only a portion of low frequency residual
signal. The motivation behind the RELP is that the residual information is assumed to be
concentrated in the low frequency baseband signal and encoding only this segment
reduces the number of bits [25]. In RELP the basic LPC analysis yields the spectral
coefficient which are transmitted as side information and also used to inverse filter the
speech signal to obtain e(n). The baseband signal b(n), Which is a low frequency message
signal is extracted by the low pass filter and is waveform coded. At the receiver RELP
receiver interpolates b(n) back to the original sampling rate Fs and attempt to reconstruct
the original signal. In RELP the low frequency signals are waveform coded and hence we
get a very good quality speech signal but high frequency signals are artificially
reconstructed hence the excitation is very poor. RELP [28] is noisier than the APC but
sounds natural than APC. The main advantage of the RELP is its ability to operate in the
noisy environment but the performance is limited at lower bit rate.
4.3.3.2.3 Multipulse linear predictive coding [MP-LPC]
This is an alternate method to reduce the bit rate for the LPC residual operated
in time domain. In MP-LPC the residual signal is represented by small number of pulses
per frame. In case of MP-LPC [28] a multipulse residual signal is constructed by
choosing the amplitude and the position to minimize the perceptually weighted spectral
42
error. The analysis of each frame of speech is done by considering multipulse residual
from the prior frame and the prior frame continues to excite an LPC synthesizer to yield
the output speech for the current frame. The so-called sky phone service employs a
9.6kb/s MP-LPC with half rate convolutional FEC which was chosen as an international
standard for aeronautical mobile satellite telecommunication. From the subjective test it
was found to be the best to satisfy all the requirements like burst and random error
tolerance and robustness to background noise levels [25]. The disadvantage of MP-LPC
is its relatively high computational load.
4.3.3.2.4 Code excited linear predictive coding [CELP]
CELP is mostly suited for the lower bit rate. Here the linear time varying filter are
used to represent the coarse and the fine spectral information [25]. The CELP algorithm
is based on four main ideas. The speech production is mainly done using source-filter
model through the linear prediction. This codec uses adaptive or fixed codebook. The
input excitation signal to the LP model is given from the codebook. The codebook search
is performed in closed loop in a perceptually weighted domain. The modeling of the
vocal tract is done using the Source-filter model [27] as explained below. One of the
main principles behind CELP is called Analysis-by-Synthesis [AbS], where in at encoder
analysis is performed by perceptually optimizing the decoded (synthesized) signal in a
closed loop. In theory, the best CELP stream would be produced by trying all possible bit
combinations and selecting the one that produces the best-sounding decoded signal. This
is obviously not possible in practice for two reasons. First the required complexity is
beyond any currently available hardware and second is the ``best sounding'' selection
criterion implies a human listener.
43
The working of the CELP is as follows:
1.The original speech signal x(n) is first partitioned into analysis frame of around 20-30
ms. The LPC analysis is performed on the frame of x(n) to get the set of LPC coefficients
which are used in the short term predictor[STP] to model the spectral envelope of the
speech.
2. The STP used is assumed to be memoryless type, hence stores only the present value.
Original speech
Weighted LPC coefficients
Zero excitation - +
Weighted LPC coefficients +
Zero excitation -
Weighted LPC coefficients
- +
Fig: 4.3 CELP Coder block schematic
3. Once the LPC coefficients are found these are further given to the long term predictor
[LTP]. The LTP analysis is performed on sub-multiples of LPC frame of 5-10ms. Both
W(z)
1/A(z)
1/A(z) 1/P(z)
Select ‘D’ and ‘β’ for
minimum error
1/A(z) 1/P(z) Codebook
Select index and
gain for minimum
error
+
+
+ +
44
the analysis methods introduces the delay „D‟ and associated scaling factors „β‟ and „i‟
representing then number of filter taps. The LTP introduces voice periodicity into the
synthesized speech.
4. Once the parameters of the filter are found using the coefficients, then the excitation
signal y(n) is selected from the codebook. From the codebook vector the minimum
squared objective error and the corresponding scaling factor is selected.
The block diagram of the standard CELP algorithm is as shown in the figure 4.3
The overall computation can be broken into three blocks
LPC analysis
LTP analysis
Codebook search
Short term prediction [STP]: The role of the STP is to represent the general spectral
shape of the speech signal. The STP coefficients are calculated on a frame by frame
basis. The important problems that can incur are delay and inaccuracy. The idea behind
the CELP [24] concept is to predict the signal x(n) using the linear combination of the
past samples
1
( )N
i
i
y n a x n i
4.4
Where y(n) is the linear prediction of x(n). The prediction error e(n) is thus given by
1
( )N
i
i
e n x n y n x n a x n i
The goal of the LPC analysis is to find the best prediction coefficients ai which
minimizes the quadratic error function
4.5
45
1 1 1
2 2 2
0 0 0 1
( ( )) ( ( )) ( ( ) ( ))L L L N
i
n n n i
E e n x n y n x n a x n i
4.6
This can be done by making all the derivatives equal to zero
12
0 1
( ( ) ( )) 0L N
i
n ii i
Ex n a x n i
a a
4.7
The coefficients ai for an Nth
order filter can be found by solving the system N*N
linear system Ra = r, where
(0) (1)............. ( 1)
(1) (0)............. ( 2)
........................................
........................................
( 1) ( 2)...... (0)
R R R N
R R R N
R
R N R N R
(1)
(2)
.
.
.
( )
R
R
r
R N
The R(m) is the autocorrelation function of the positive signal x(n) and R(m) is
computed as
1
0
( ) ( ) ( )N
i
R m x i x i m
4.8
The „R‟ can be computed using Toeplitz Hermitian algorithm or the Levinson-Durbin
algorithm. Theoretically this stabilizes the roots of A(z) by making the roots to lie within the unit
circle but practically because of finite precision we first multiply R(0) by a number slightly
above one which is equivalent of adding noise and then we use the auto-correlation function
which acts as a filter in frequency domain reducing the sharp resonances .
46
The forward LPC analysis cannot proceed until the whole of one frame or greater number
of samples are available for computation hence a delay of atleast one frame is introduced. Hence
to overcome this disadvantage we prefer to use backward LPC. But the backward LPC operates
successfully above 10kb/s.
The main drawback of LPC occurs when the transition regions which are believed to be
perceptually more important will fall within the frame. Hence this drawback can be overcome
using one of the popular techniques known as frame interpolation. Here we achieve an improved
spectrum representation by evaluating intermediate sets of parameters between the frames such
that the transition are introduced more smoothly at the frame edges without the need to increase
the coding capacity but with the increase in the delay.
Long Term Prediction [LTP]: The LTP has small number of coefficients compared to the
STP. This is given by the general form of equation
( )
1
( ) 1N
D i
i
P z z
4.9
The LTP used in CELP generates the long term correlation functions which
mainly depend on the pitch excitation. Hence the long term predictor is replaced by a
pitch predictor. There are two types of LTP‟s .They are
Open -loop LTP[OLM]
Closed loop LTP[CLM]
In case of open loop LTP a residual signal is obtained by inverse filtering the
original speech with the LPC coefficients. The delay D and the gain G are found. Usually
the delay D will be much greater than the length of the frame L else the effectiveness of
the LTP is reduced as D would not be able to adapt to the onset of the voiced speech as
quickly [25]. The disadvantage of the OLM is due to the error between the original
speech and the quantized speech. This drawback can be overcome using CLM.
47
In case of closed loop method the main aim is to reduce the error between the
synthesized speech signal and the original speech signal by finding the parameters like
gain G, delay D and scaling factor β. This is done in two steps. First we assume the G is
zero and then find the LTP parameters (D and β) such as to minimize the error and
second the LTP is maintained constant and then find G.
4.3.3.2.5 Vector sum excited linear predictive coding [VSELP]:
In normal CELP coding technique the main disadvantage is due to exhaustively
searching the codebook to get best match to synthesize the speech signal with minimum
error. This drawback can be overcome using VSELP. In case of VSELP the vector
combination of the codebook entries is done to minimize the error of the synthesized
signal. For the majority of the speech coding analysis in VSELP the mean square
approximation is used. In VSELP [25] it is very important to construct the basis vectors
in a perceptual way. In VSELP the LTP is treated as an adaptive codebook for LTP lag
values less than the subframe size and hence only the effect of STP filter is considered.
The total STP excitation is obtained by adding the gain scaled secondary excitation to the
LTP excitation. The main drawback of VSELP is its limited ability to encode non-speech
sounds and its performance reduces in the presence of background noise [29].
4.4 SUMMARY
This chapter deals with the basic properties of the speech signal and briefly
explains the technique to produce the speech. This chapter briefs the technique to
generate the speech along with the evaluation methods like subjective measure or an
objective measure. The evaluation methods are used to measure the quality of the speech.
Here we study different speech compression techniques like the waveform coder, vocoder
48
and hybrid coder. A comparative analysis is done between these coders. This chapter
clearly explains the CELP coder which mainly uses AbS technique to synthesis the
speech.
top related