color representation such as rgb is not always the most ... · web viewspeech compression system....

Project 1. Image compression system

The representation of color images called Bitmap or (BMP) representation is the most commonly used format for keeping half-tone information for Windows 95, Windows 98, Windows 2000, Windows NT, and Windows XP operation systems. Its particular case is the representation of color images using RGB components.

A BMP file consists of either 3 or 4 parts as shown in the diagram. The first part is a header, which is followed by an information section, if the image is indexed colour then the palette follows, and last of all is the pixel data. The image width and height as well as type of compression (or absence of compression) and the number of colours are contained in the information header.

Images in BMP representation can be color (4, 8, 16, 24 or 32 bits/pixel) or monochrome (1 bit/pixel). The data can be compressed by run length coding or be given without compression.

The header is 14 bytes in length and information is 40 bytes in length. The useful fields of the header are the type field (should be ‘BM’), the file size and the offset field which gives the number of bytes before the actual pixel data. The most important fields of the image info data are: the image width and height, the number of bits per pixel (should be 1,4,8, or 24), the number of planes (assumed to be 1 here), and the compression type (assumed to be 0 here).

The RGB format or the simplest 24 bit true colour images we deal with represents a colour image with header of length 54 bytes containing R, G, and B components of image. In this case the image data follows immediately after the information header, that is, there is no colour palette. Each pixel of each component takes 8 bits (has value in range 0…255) and each component represents an array of size bytes for the image of size pixels. In other words it consists of three bytes per pixel in B, G, R order. Each byte gives the saturation for that colour component. For example, a colour image of size 120 160 pixels represented in RGB format takes 54+120 160 3 =57654 bytes.

Color representation such as RGB is not always the most convenient. Components R, G, and B of the image are highly-correlated but they are processed independently. Since all components are equally important from the image reproduction point of view then we have to apply the same compression algorithm to each component and cannot prefer some of them. On the other hand other colour

Header

Info header

Optional palette

Image data

Fig.1 BMP format

formats are available that use colour components that are closely related to the criteria used to describe colour perception: brightness, hue, and saturation. Brightness describes the intensity of the light (revealing whether it is white, gray, or black) and this can be related to the luminance of the source. Hue describes what color is present (red, green, yellow, etc.) and this can be related to the dominant wavelength of the light source. Saturation describes how vivid the color is (very strong, pastel, nearly white) and this can be related to the purity or narrowness of the spectral distribution of the source.

Color spaces or color coordinate systems in which one component is the luminance and the other two components are related to hue and saturation are called luminance-chrominance representations. The luminance provides a grayscale version of the image (such as the image on a monochrome receiver), and the chrominance components provide the extra information that converts the grayscale image to a color image. Luminance-chrominance representations are particularly important for good image compression. One of the luminance-chrominance representations is called YUV format and the other is called YCbCr format. We can convert RGB format to YUV format and to YCbCr format using the following linear transforms, respectively

,,

,,.

The inverse transform can be described as follows,

,.

,,.

The components Y, U, and V (Y, Cb, Cr) are almost not correlated. Moreover the most important information is concentrated in the luminance component. Thus we do not loose much information if we decimate chrominance components. Usually U and V components are decimated with factor 2. In other words 4 neighboring pixels which form the square of size are described by 4 values of component Y, one value of component U, and one value of component V. Each chrominance value is computed as the rounded-off arithmetic mean of the corresponding 4 pixel values belonging to the considered square. As a result we obtain the so-called YUV 4:1:1 standard video format which is usually used as the input format for most of videocodecs. It is easy to compute that using this format we spend only 6 bytes for each square of size instead of 12 bytes spent by the original YUV format. Thus we already twice compressed the image without any visible distortions.

JPEG standard is based on DCT coding technique. Component Y and decimated components U and V are processed by blocks. Each block of Y contains 88 pixels and the corresponding blocks of U and V contain 44 pixels. The 2-D DCT is applied to each block of pixels.

The one dimensional DCT for sequence of length 8 is given by the formula

,

where

.

The inverse transform can be written as

.

We decompose the image using a set of eight different cosine waveforms sampled at eight points. The coefficient that scales the constant basis function ( ) is called the DC coefficient. The other coefficients are called AC coefficients.

Since DCT is a separable transform the two-dimensional DCT for blocks 88 can be performed by first applying 1-D transform to rows of each 88 block and then applying 1-D transform to the columns of the resulting block, that is

,

where denotes the pixel of the image component ( or ) and is the transform coefficient.

Notice that performing of 2-D DCT is equivalent to decomposition of the original image block using a set of 64 2-D cosine basis functions. These functions are created by multiplying a horizontally oriented set of 1-D 8-point basis functions by a vertically oriented set of the same functions. The horizontally oriented set of basis functions represents horizontal frequencies and the other set of basis functions represents vertical frequencies. By convention, the DC term of the horizontal basis functions is to the left, and the DC term for vertical functions is at the top. Because the 2-D basis functions are products of two 1-D DCT basis functions, the only constant basis function is in the upper left corner of the array. The coefficient for this basis function is called the DC coefficient, whereas the rest of the coefficients are called AC coefficients. The horizontal DCT frequency of the basis function increases from left to right and the vertical DCT frequency of the basis function increases from top to bottom.

The obtained transform coefficients are quantized by the uniform scalar quantizer. The quantization is implemented as rounding-off of the DCT coefficients divided by a quantization step. The values of steps are set individually for each DCT coefficient, using criteria based on visibility of the basis functions. Thus the quantized coefficient is ,

where is the th entry of the quantization matrix of size 8 8 . The standard JPEG uses two different quantization matrices. The first matrix is used to quantize the luminance component ( ) and has the form

.

The second matrix is used to quantize the chrominance components and looks like

.

The quantized DCT coefficients are coded by a variable-length coder. The coding procedure is performed in two steps. At the first step DC coefficient is DPCM coded, using the first order predictor that is the DC value of the previous 88 coded block is subtracted from the current DC coefficient. The AC coefficients are coded by run-length coder. At the second step the obtained values are coded by the Huffman code.

Let and denote the DC coefficients of the th and th blocks, respectively. Due to high correlation between DC coefficients they are DPCM coded, that is their difference is computed and then coded. For gray scale images (or one of the , , V components of color image) a pixel is represented by 8 bits. Thus, the difference takes values from the range [-2047, 2047]. This range is split into 12 categories, where the th category includes the differences with length of their binary representation equal to bits. These categories are the first 12 categories shown in Table 1.

Each DC coefficient is described by a pair (category, amplitude). If the value , then the amplitude is the binary representation of this value of

length equal to the category. If , then the amplitude is the codeword of the complement binary code for the absolute value of , which also has length equal to category. The category value is then coded by the Huffman code.

Example. Let and . Then the difference . It follows from Table 1 that the value belongs to the

category 4. The binary representation of value 11 is 1011 and the codeword of the complementary code is 0100. Thus, the value is represented as (4,0100). If the codeword of the Huffman code for 4 is 110 then is coded by the codeword 1100100 of length 7. The decoder first processes the category value (in our case it is 4) then the next 4 bits correspond to the value of . Since the most significant bit is equal to 0 then the value is negative. Inverting bits we obtain the binary representation of 11. Notice that using the categories simplifies the Huffman

code. Without using categories we would need the Huffman code for the alphabet of much larger size that is coding and decoding would be much more complicated.

Table 1. Categories of integer numbers.

Category Numbers0 01 -1,12 -3,-2,2,33 -7,...,-4,4,...74 -15,...,-8,8,...155 -31,...-16,16,...316 -63,...-32,32,...637 -127,...-64,64,...1278 -255,...-128,128,...2559 -511,...-256,256,...51110 -1023,..-.512,512,...102311 -2047,...-1024,1024,...204712 -4095,...-2048,2048,...409513 -8191,...-4096,4096,...819114 -16383,...-8192,8192,...1638315 -32767,...-16384,16384,...3276716 32768

For gray scale images or ( , or components) the AC coefficients can take values from the range [1023,1023]. After quantization many of these coefficients become zeros. In other words it is necessary to code only small number of non-zero coefficients simply indicating before their positions. To do this efficiently the 2-D array of DCT coefficients is rearranged into a 1-D linear array by scanning in the zigzag order as shown in Fig.2. This zigzag index sequence creates a 1-D vector of coefficients, where the lower DCT frequencies tend to be at lower indices. This zigzag sequence is an important part of the coding model, as it affects the statistics of the symbols. When the coefficients are ordered in this fashion the probability of coefficients being zero is an approximately monotonic increasing function of the index.

The run-length coder generates a codeword , where run-length is the length of zero run

followed by the given non-zero coefficient, amplitude is the value of this non-zero coefficient and category is the number of bits needed to represent the amplitude. The pair (run-length, category) is coded by the 2-D Huffman code and the amplitude is coded as in the case of DC coefficients and is added to the codeword.

Example. Let the nonzero coefficient preceded by 6 zeros be equal to 18. It follows from Table 1 that 18 belongs to the category 5. The codeword of the complement code is 01101. Thus, the coefficient is represented by ((6,5), 01101). The pair (6,5) is coded by the Huffman code and the value 01101 is added to the

codeword. If the codeword of the Huffman code for (6,5) is 1101, then the codeword for 18 is 110101101.

There are two special cases when we encode the AC coefficients.1. After a non-zero coefficient all other AC coefficients are zero. In this case the

special symbol (EOB) is transmitted which codes end-of-block condition.2. A pair (run-length, category) appeared which is not included in the table of

Huffman code. In this case a special codeword called escape-code followed by the uniform codes for run-length and non-zero value is transmitted.

Coding efficiency is usually evaluated as a compression ratio which is the ratio of the original file size in bits divided by the size of the compressed file in bits. The quality of the synthesized image is characterized by the signal-to-noise ratio (SNR) at the output of the decoder

(dB),

where denotes the energy of the original signal, is the energy of the quantization noise. represents the energy of the difference between the original and the reconstructed images. More often to characterize the synthesized image quality the peak signal-to-noise ratio (PSNR) is used. It is defined as follows

,

where 255 is the maximal pixel value.

Fig.2 The zigzag scanning.

1. Transform the image given in BMP format into YUV format.2. Decimate U, V components and compute PSNR for reconstructed U and V

components. Plot PSNR as a function of coefficient of decimation. Perform YUV to RGB transform. Compare the reconstructed and the original images for different coefficients of decimation.

3. Compute DCT coefficients for Y component. Uniformly scalar quantize the obtained coefficients. Reconstruct Y component using IDCT for quantized coefficients. Plot PSNR (for Y) as a function of quantization step.

4. Code the quantized transform coefficients for Y component using run-length variable encoder. Estimate entropy of the obtained streams of run lengths and levels.

5. Plot PSNR (for Y) as a function of estimated compression ratio. Compare the reconstructed and the original images for different compression ratios.

6. Use standard archivers to compress the original file. Compare compression ratio with entropy estimates.

7. Compress the original file using any available image editor. Compare compression ratios.

Project 2. Wavelet image coder.

The DFT and DCT are linear transforms based on decomposition of the input signal over system of orthogonal harmonic functions. The main shortcoming of these transforms is that basis functions are uniformly distributed over frequency axis. It means that all frequencies of the input signal are considered as equally important in the sense of recovering original signal from the transform coefficients. On the other hand it is clear that low-frequency components of the signal are more important than high-frequency components, that is, resolution of system of basis functions should be non-uniform over frequency axis. The problem of constructing such transform is solved by using filter banks. One of the most efficient transforms is based on wavelet filter banks and called wavelet filtering.

The wavelet filter banks have special properties. The most important feature of these filter banks is their hierarchical structure. The input signal is decomposed using two filters into low-frequency and high-frequency parts. Then each components of the input signal is decimated, that is the only even-numbered samples are kept. The downsampled high-frequency part represents a final output because it is not transformed again. Since this part of the signal contains rather insignificant part of the signal energy it can be encoded by using a small number of bits. The decimated low-frequency component contains the main part of the signal energy and it is filtered again by the same pair of filters. The decimated high-frequency part of the low-frequency signal component is not transformed again but the decimated low-frequency part of the low-frequency signal component can be filtered again and so on. By choosing filter bank in a proper way it is possible to provide twice as large compression ratio compared to DCT based codecs on the assumption of the same quality of the synthesized signal.

Thus the main idea of the wavelet transform is a hierarchical decomposition of the input sequence into the so-called reference (low-frequency) subsequences with diminishing resolutions and related with them the so-called detail (high-frequency) subsequences. At each level of decomposition the wavelet transform is invertible, that is, the reference signal of this level together with the corresponding detail signal provide perfect reconstruction of the reference signal of the next level (with higher

r n1( )

resolution). Fig.3 illustrates one level of wavelet decomposition followed by

reconstruction. The input sequence x n( ) is filtered by lowpass filter with impulse response h n0 ( ) and by highpass filter with impulse response h n1( ) . The downsampling step is symbolized by . The sequence is the reference signal (decimated result of lowpass filtering), and is the detail signal (decimated result of highpass filtering). It is evident that this scheme transforms one sequence of length

into two subsequences of length N / 2 each. In the theory of wavelet filter banks such pairs of filters and are

found that there exist pairs of the inverse filters and providing the perfect reconstruction of the input signal. To reconstruct the input signal from signals

and these signals are first upsampled with factor 2. In Fig. 3 upsampling is symbolized by . Then upsampled low-frequency and high-frequency components are filtered by the inverse lowpass filter with impulse response and the inverse highpass filter with impulse response , respectively. The sum of the results of filtering is the output signal . The wavelet transform (wavelet filtering) provides perfect reconstruction of the input signal, that is, the output signal is determined as

y n Ax n( ) ( ) ,where is the gain factor, and is the delay.

In the case of multilevel decomposition the reference signal represents the input signal of the next decomposition level. Filtering is performed iteratively as shown in Fig. 4. At the level of decomposition we obtain the reference signal with resolution times scaled down compared to the resolution of the input signal and the detail signals , , with resolution , times scaled down compared to the input signal, respectively. Each detail signal contains such information that together with the reference signal it allows to recover which represents the reference signal of the next level. At the level of decomposition the total length of reference and detail subsequences is

h n0 ( )

h n1( )

g n0 ( )

g n1 ( )

¯2

¯2

2

2 +

r n1( )

d n1( )

x n( ) y n( )

h n0 ( )

h n1( )

¯2

¯2 d n1( )

x n( )

h n0 ( )

h n1( )

¯2

¯2

r n2 ( )

d n2 ( )

Fig.4 Multiresolution wavelet decomposition

Fig.3 Wavelet decomposition.

Consider how wavelet filtering can be used in order to perform -levels wavelet decomposition of the image of size pixels (to be more exact usually we decompose one of the so-called , or components of the original image or a matrix of size ). It is evident that the two-dimensional wavelet decomposition is a separable transform. Thus first we perform the wavelet transform over matrix rows and then the obtained matrix is filtered over columns. At the first level of the wavelet hierarchical decomposition, the image is decomposed using times two subsampling into high horizontal-high vertical (HH0), high horizontal-low vertical (HL0), low horizontal-high vertical (LH0), and low horizontal-low vertical (LL0), frequency subbands. They correspond to filtering by highpass filter over rows and over columns, by highpass filter over rows and lowpass filter over columns, by lowpass filter over rows and highpass filter over columns and by lowpass filter over rows and columns, respectively. The LL subband is then further subsampled times two to produce a set of HH1, HL1, LH1 and LL1 subbands. This is done recursively times to produce an array such as that illustrated in Fig 5, where three subsampling have been used. As the result we obtain matrices of reducing size. Most of the energy is in the low-lowpass subband LL3. This upper left subimage is a coarse approximation of the original. The other bands add details. Bit allocation becomes crucial. Clearly, subimages with low energy levels should have fewer bits.

Each matrix is quantized by a scalar or vector quantizer and then encoded. The quantization step is chosen depending on the required compression ratio and bit allocation.

The quantized highpass subbands usually contain many zeros. They can be efficiently compressed using zero run lengths coding followed by the Huffman coding of pairs (run length, amplitude) or the arithmetic coding. The lowpass subbands usually do not contain zeros at all or contain small number of zeros and can be coded by the Huffman code or by the arithmetic code.

HL1

HH1 LH1

LH

HL2

HH2

HL3LL3

LH3 HH3

Fig.5 Wavelet decomposition of image

More advanced coding procedures used, for instance, in MPEG-4 standard try to take into account dependencies between subbands. One of such methods is called zero-tree coding. Fig 6. illustrates the parent-child dependencies of subbands. A single parent node has four child nodes corresponding to the same region in the image with times four subsampling. Each child node has four corresponding next generation child nodes with a further times four subsampling. Fig.7 is a flowchart for encoding a coefficient of the significance map or in other words this picture illustrates how zerotree based encoder implementation classifies wavelet coefficients and generates zerotree. The characteristic features of the zerotree coding method are the following. Typically, a large fraction of the bit budget when wavelet coefficients are coded must be spent on encoding the significance map, or the binary decision as to whether a coefficient has a zero or nonzero quantized value. To reduce the number of bits for significance map coding the zerotree method implies the following classification of wavelet coefficients. A coefficient is said to be an element of a zerotree for the given threshold if itself and all of its descendants (children) are insignificant with respect to this threshold. An element of a zerotree is a zerotree root if it is not the descendant of a previously found zerotree root, i.e. it is not predictably insignificant from the discovery of a zerotree root at a coarser scale. A zerotree root is encoded with a special symbol indicating that the insignificance of the coefficients at finer scales is completely predictable. Thus the following four symbol types are used: zerotree root, isolated zero, which means that the coefficient is insignificant but has some significant descendants, positive significant symbol and negative significant symbol.

HL1

HH1 LH1

LH2

HL2

HH2

HL3LL3

LH3HH3

Fig. 6 Wavelet hierachical subband decomposition and parent-child dependencies of subbands.

1. Perform wavelet decomposition of the component Y from Project1 using the following wavelet filters:

, ,

,.

Notice that the described above filter bank can be also given in the form :

Input coefficient

Is coefficient significant ?

What sign?

Does coefficient descend from a zerotree root ?



Doescoefficient hassignificant descendants ?

NO YES

YESPredictably insignificant Don’t code

NO

NOYES

Code positive symbol

Code negativesymbol

Code zerotree root symbol

(+) (-)

Code isolated zero symbol

Fig. 7 Zerotree coding

,

2. Uniformly scalar quantize wavelet coefficients ( use different quantization

steps for different subbands). Reconstruct component Y from the quantized wavelet coefficients using inverse wavelet filtering.

3. Estimate the entropy of the stream of coefficients for each subband. Estimate compression ratio. Plot PSNR as a function of the estimated compression ratio. Compare the reconstructed and the original images for different compression ratios.

4. Compress the original image using any available image editor. Compare compression ratios.

Project 3. Speech compression system.

The input signal is discrete-time speech signal sampled with rate 8 kHz. Each sample represents a 16-bit integer.

The input signal is represented in the so-called WAV file which uses the standard RIFF ( Resource Interchange File Format). The WAW file format is Windows native file format for storing digital audio data. The standard RIFF groups the file contents (sample format, digital samples, etc.) into separate chunks, each containing its own header and data bytes. The chunk header specifies the type and size of the chunk data bytes. In the simplest case the following description of data types in C can be used to access the header of WAV file:

typedef unsigned int WORD;typedef long DWORD;typedef unsigned char BYTE;typedef struct {BYTE riff[4]; // Chunk ID =symbols ‘RIFF’DWORD riff_size; // Chunk data size =file size - 8 bytes BYTE wavefmt[8]; // RIFF type=symbols ‘WAVEfmt ’DWORD frmt_size; // sum of chunk lengths (from 5 up to 10 )WORD wFormatTag; // format of audio data (1 – for PCM format) WORD nChannels; // number of channels( 1 or 2 )DWORD nSamplesPerSec; // sampling frequency ( 8000,11025,... )DWORD nAvgBytesPerSec; // average rate of the data streamWORD nBlockAlign; // alignment of data blockWORD wBitsPerSample; // number of bits per sample( 8 or 16 )BYTE data[4]; // symbols ‘data’DWORD data_size; // data size in bytes} WAVE_HEADER;

The length of the described above header is 44 bytes. Values of some of the chunks are dependent on each other :nAvgBytesPerSec = ( nChannels * nSamplesPerSec * wBitsPerSample ) / 8;

nBlockAlign = ( nChannels * wBitsPerSample ) / 8;If the header has the described above form then

data_size = riff_size - 36;Wave files usually contain only one data chunk, but they may contain more

than one if they are contained within a WAVE LIST Chunk.

The popular class of speech coders for bit rates between 4.8 and 16 kb/s are model-based coders. They use an linear prediction analysis-by-synthesis (LPAS) method. A linear prediction model of speech production (adaptive linear prediction filter) is excited by an appropriate excitation signal in order to model the signal over time. The parameters of both the filter and the excitation are estimated and updated at regular time intervals (frames). The compressed speech file contains these model parameters estimated for each frame.

Fig. 8 shows a model of speech generation. The linear prediction filter models changes which occur in the speech path formed by mouth, tongue and teeth when we speak. Roughly speaking, each sound corresponds to a set of filter coefficients. Rather often this filter is also represented by the poles of its frequency response called formant frequencies or formants. The filter excitation depends on the type of the sound: voiced, unvoiced, vowel, hissing or nasal. The voiced sounds are generated by oscillation of vocal cords and represent a quasi-periodical impulse train. The unvoiced signals are generated by noise-like signals. The simplest model shown in Fig.8 uses two types of the excitation signal (periodical impulse train and quasi-random noise) which are switched at each frame. The period of vocal cords oscillation is called pitch period or pitch. It is estimated using the original speech signal and determines the period of impulses in the impulse train.

Generator of periodical impulse train

Random noise generator

XAdaptive linear prediction filter

Pitch period

Tone/noise switch

)(nu

g

Original speech

Synthesized speech

Fig.8. Model of speech generation

In order to construct the linear prediction filter we use the so-called linear prediction method. Let be a sequence of samples of the input speech signal. Each sample is predicted by the previous samples according to the formula

, (1)

where denotes the predicted value of the th sample, are prediction coefficients, and is the order of prediction. The prediction error is determined as follows

. (2) Prediction coefficients are determined by minimizing the sum of squared

errors over a given finite interval (called frame). Let be an interval then sum of squared prediction errors is calculated as follows:

. (3)

By inserting (1) into (3) we obtain

=

.

(4)Differentiating (4) over , yields

.

Thus we obtain a system of linear equations with unknown quantities

, , (5а)

where

. (5b)

This system of linear equations is called the Yule-Walker equations. If are solutions of (5a) then we can evaluated the minimal achievable

prediction error. Insert (5b) into (4). We obtain that

. (6)

Using (5a) we reduce (6) to the expression

.

It is easy to see that equation (1) describes the th order predictor with transfer function equal to

(7)

It follows from (1), (2), and (7) that the z-transform for the prediction error has the form

In other words the prediction error is an output signal of the discrete-time system with transfer function

It is said that the problem of finding the optimal set of linear prediction coefficients reduces to the problem of constructing the optimal prediction filter of th order. This filter represents a discrete-time FIR filter.

Another name of the linear prediction (1) is the autoregressive model of signal . It is assumed that signal can be obtained as the output of the so-

called autoregressive filter with transfer function

,

that is it can be obtained as the output of the filter which is inverse with respect to the prediction filter. It is evident that this filter is a discrete-time IIR filter.

In order to find the optimal set of prediction coefficients it is necessary to solve the Yule-Walker equations (5). To do that we have to evaluate values . There are two approaches to estimating these values. Moreover it will be shown that the computational complexity of solving equations (5) depends on a way how is evaluated. The first method is called the autocorrelation method and the second one is called the covariation method.

Autocorrelation method

The values are computed as

.

We set and if , where is called the interval of analysis. In this case we can simplify the expression for ,

(8)

Assuming that is a stationary discrete-time signal with zero mean value we can conclude that the values normalized by coincide with estimates of autocorrelation function of discrete-time signal computed for interval

Dividing equations (5a) by we obtain the Yule-Walker equations for autocorrelation method

(9)

The system (9) can be given by the matrix equation ,

where , ,

. It is said that system of linear equation (9) relates the parameters of the autoregressive model of th order with the autocorrelation sequence.

Matrix of the autocorrelation method has two important properties. It is symmetric, that is and it has Toeplitz property, that is

The Toeplitz property of the matrix makes it possible to simplify the solution of (5a). For instance, the fast Levinson-Durbin recursive algorithm which will be considered below requires only operations. Notice that computational complexity of solving an arbitrary system of linear equations of order would require operations.

Covariation method

We choose , and the signal is not constrained in time. In this case the values can be expressed as follows

.

(10) Set then we can represent (10) in the form

(11)

Expression (11) resembles expression (8) used by the autocorrelation method but it has other range of definition for index . It is evident that (11) uses signal values

out of the range . In other words in order to use the covariation method for computing we have to know signal values

that is, we need to know the signal for interval samples instead of samples as for the autocorrelation method. Usually this is not a critical point since . This method leads to the so-called cross-correlation function between two similar but not exactly the same finite segments of signal (for instance, if , we obtain that represents correlation coefficient between segment and segment

which differ from each other in samples

and cannot be considered as a signal and its shift by samples)

It is easy to see that but is not function of as it held for the autocorrelation method. Dividing equations (5a) by we obtain the Yule-Walker equations for the covariation method

(12)

Equations (12) can be given by the matrix equation,

where , ,

.

Unlike the matrix of autocorrelation method the matrix is symmetric but not the Toeplitz matrix. Thus in general case it is required operations to solve (12).

Algorithms for the solution of the Yule-Walker equations

The computational complexity of solving the Yule-Walker equations depends on the method of evaluating values . Let us assume that are found by the autocorrelation method. In this case the Yule-Walker equations has form (9) and the matrix is symmetric and the Toeplitz matrix. These properties make it possible to find the solution of (9) by fast methods requiring operations. There are a few methods of this type: the Levinson-Durbin algorithm, the Euclidean algorithm, the Berlekamp-Massey algorihm.

Consider the so-called Levinson-Durbin recurrent algorithm. It was suggested by Levinson in 1948 and then was improved by Durbin in 1960. Notice that this algorithm works efficiently if matrix of coefficients is simultaneously symmetric and Toeplitz. As for the Berlekamp-Massey and the Euclidean algorithms they do not require the matrix of coefficients to be symmetric.

We consecutively solve equations (9) of order . Let denote the solution for the system of th order. Given we

find the solution for the st order. At each step of the algorithm we evaluate the prediction error of the th order system and an auxiliary coefficient . The formal description of the algorithm is given below.

Initialization

Recurrent procedureFor compute

(13a)

(13b)

At the last step of the algorithm, that is, when we obtain the solution

1. For the given speech file in WAV format find the coefficients using autocorrelation method (take the frame length

equal to 180-240 samples, ). 2. Using the Levinson-Durbin procedure construct the predictive filter for each frame. 3. Filter the input speech signal by the predictive filter to find the ideal excitation signal for each frame.4. Uniformly scalar quantize the filter coefficients. Reconstruct the speech signal using the ideal excitation signal and the quantized inverse filter. Compare the reconstructed and the original signals.5. Uniformly scalar quantize the excitation signal. Reconstruct the speech signal using the quantized excitation signal and the quantized inverse filter. Compare the reconstructed and the original signals. Estimate the entropies for the quantized excitation signal and for the quantized filter coefficients. Estimate the compression ratio. Estimate the MSE. Plot the MSE as a function of the estimated compression ratio. Find the bit rate for which quality of the synthesized speech signal is acceptable.

Project 4 Coding linear spectral pairs.

The representation of the prediction filter by its coefficients , is called filter representation in time domain. Let us consider the description of the prediction filter by means of the so-called linear spectral parameters (or pairs) (LSPs) in the frequency domain. Although LSPs are uniquely determined by the coefficients

, these parameters play very important role in speech coding techniques. The LSPs for a linear filter of th order represent an ordered sequence of numbers taking values from the close finite interval , where is sampling frequency. It is found by experiments that each of LSPs varies in rather narrow range. Moreover for voiced speech frames LPSs change much slower than the filter coefficients ,

. Due to the enumerated properties quantization of LSPs provides better rate-distortion function than quantization of the prediction filter coefficients.

Filter description by means of LSP

Let )(),...,2(),1()}({ Nxxxnx is a sequence of N speech samples. Its Fourier transform is determined as follows

N

n

nj

ez

j enxzXeXj

1

)()()( .

The Fourier transform of )(nx and the Fourier transform of the output sequence of the prediction filter can be written as follows

, where )( jeH is the frequency function of the prediction filter: )(/1)( jj eAeH . The amplitude function is )(/1)( jj eAeH . The frequencies corresponding to the poles of the transfer function or zeros of the polynomial )(zA , ij

ii erz , , are called formant frequencies or formants.

As it was mentioned in the Project 3 a speech signal can be obtained as the response of the linear system with time-varying parameters to the corresponding excitation ( for voiced sounds it can be quasi-periodical impulse sequence, for noise-like sounds it can be quasi-noise signal). The Fourier transform of the output (synthesized) speech signal is equal to the product of the Fourier transform of the excitation and the frequency response of the filter. The spectrum of the periodical excitation is a line-spectrum. The frequency response of the speech path formed by mouth, tongue and teeth when we speak is rather flat function of frequency and characterized by acoustic resonances which correspond to the resonance frequencies of this speech path and are called formants. The spectrum of the synthesized speech signal is a product of the line-spectrum and frequency response of the speech path and therefore is also a line-spectrum. Its envelope characterizes the frequency response of the speech path.

If the prediction filter is given by its coefficients then the problem of speech compression is reduced to quantization the filter coefficients . If filter is given in the frequency domain then we quantize the linear spectral parameters which are functions of the formant frequencies and can be obtained by using the algorithm described below.

Algorithm for computing the linear spectral parameters

The polynomial )(zA of degree has exactly roots. If the filter is stable its roots can be represented as the points located inside the unit circle on the complex plane. Since polynomial coefficients are real numbers then its roots are complex conjugate numbers. First using )(zA we construct an auxiliary symmetric polynomial and an auxiliary antisymmetric polynomial , both of degree +1. They have zeros located on the unit circle. Each of the polynomials has a trivial real root ( and , respectively). These zeros can be easily removed by dividing on and ( , respectively). Therefore we obtain two polynomials of the degree . Since their roots are points on the unit circle they can be given by their phases. Moreover the roots are complex conjugate numbers and they have opposite phases. By taking into account this fact we can reduce each of two equations of the power to the equation of the power /2 with respect to the cosines of phases. Then we compute arccosines of the roots and obtain exactly numbers , lying in the interval ],0[ . These numbers determine the

frequencies corresponding to the poles of the filter transfer function . The lower bound of the range (number 0) corresponds to the constant component of the signal and the upper bound (number ) corresponds to the maximal digital frequency , where is sampling frequency. The normalized values )2/(* fii , are called the linear spectral parameters. The detailed description of the algorithm is given below. Step 1. Constructing auxiliary polynomials )(zP and )(zQ using the polynomial

)(zA . Represent )(zA in the form ,

where . Construct polynomial )(zP of the degree according to the rule:

for , for .

For example, for the filter of the second order we get 1)1()2()( 2 zAzAzA ,

then 32))1()2(())1()2((1)( zzAAzAAzP .

Construct the polynomial )(zQ of the degree according to the rule: for ,

for .For example, for the filter of order we have

)(zQ 32))2()1(())2()1((1 zzAAzAA .

It is easy to see 2/))()(()( zQzPzA .Step 2. Reducing the degree of the polynomials by 1. Construct polynomial

)(zPL of degree from the polynomial )(zP as:10 pl , 1 kkk plppl for ,

for . Notice that if the filter is stable then the obtained polynomial has only

complex roots and they are located on the unit circle. For example, for the filter of order

2)1)2()1((1)( zzAAzPL , that equivalent to dividing )(zP by )1( z .

Construct )(zQL of degree as :10 ql , for ,

, for . If the filter is stable the roots of )(zQL are complex and are located on the unit circle. For filter of order

2)1)2()1((1)( zzAAzQL ,that equivalent to dividing )(zQ by ( 1z ).Step 3. Reducing degree to . Taking into account that roots of )(zPL and

)(zQL are located on the unit circle, that is, have the form ijez , where , we can easily reduce the degree of the equation which is necessary to

solve to find zeros of and Step 4. Construct polynomials )(* zP and )(* zQ of degree according to the formulas:

The formal rule is the following. In the formula for *0p in the brackets coefficient 1 is

followed by the sequence of alternate coefficients -2 and +2. In the remaining formulas the first coefficient is always equal to 1. The signs of the other coefficients alternate. The absolute value of the th coefficient, 1i in the series for is equal to the sum of absolute values of the th coefficient in the series for and i th coefficient in the series for . For example, in the formula for *

2p we obtain 4=3+1, 9=4+5, then will follow 16=7+9 and so on. The coefficients of )(* zQ can be obtained analogously from the coefficients of

)(zQL . In particular for the filter of order we obtain zAAzP 2/)1)1()2(()(* , zAAzQ 2/)1)2()1(()(* .

Step 5. Solve equations 0)(* zQ and 0)(* zP . These equations have real roots in the form iiz cos , где . For the filter of order we obtain

2/)1)1()2(( AAzp , 2/)1)2()1(( AAzq . For solving equations 0)(* zQ and 0)(* zP we can use any numerical method of solving algebraic equations. The most commonly used method is the following. We search roots in two steps. First we search for approximate root values

in the close interval -1, 1. To do this we split the interval into n equal subintervals, where value of n depends on the filter order. For example, for

we choose 128n . The presence of the root in the subinterval is detected by changing the sign of the polynomial. At the second step we refine the root value. The refined root value is computed by the formula

))()1((/()1(2)()( iFiFniFjrootjexroot , where i is the number of the subinterval where is located;

nijroot /)1(21)( ; )(iF is the value of the polynomial at the end of i th subinterval. Step 6. Denote by the roots of )(* zP and by

the roots of )(* zQ . Find LSPs by formulas

)arccos( ii zpp , )arccos( ii zqq , .

Sort the found values in the increasing order and normalize them by multiplying by sampling frequency and dividing by 2.

The obtained LSPs can be scalar or vector quantized and coded together with other parameters of the speech signal.

The decoder reconstructs filter coefficients from the quantized LSPs.Step 1. Quantized LSP values are split into two subsets. First of them contains LSPs with even numbers and the second contains LSPs with odd numbers. They

correspond to the polynomial and , respectively. The LSP values are multiplied by 2 and divided sampling frequency. We compute cosines of the obtained values which represent roots of )(* zP and )(* zQ .Step 2. Using the Viett theorem we can express coefficients of )(* zP and )(* zQ via their roots. For the filter of order we obtain 1,1,, *

1*1

*0

*0 qpzqqzpp

. Step 3. Using the given above formulas which connect coefficients of and

( )(),(* zQLzQ ) reconstruct )(),( zQLzPL . For the filter of order we obtain:

1,2,1 210 plzpplpl 1,2,1 210 qlzqqlql .

Step 4. Reconstruct polynomials )(),( zQzP of degree according to the rule:

100 , jjj plplpplp for , for ,

10 ,1 jjj qlqlqq for , for .

For the filter of order we obtain 1,),1()2(12,1 31210 pppAAzppp ,

1),1()2(),2()1(12,1 3210 qAAqAAzqqq . We reconstructed the following polynomials

32))2()1(())2()1((1)( zzAAzAAzP , 32))1()2(())2()1((1)( zzAAzAAzQ .

Step 5. Reconstruct :2/))()(()( zQzPzA .

It is evident that for this formula gives the correct solution.

1. Using filter coefficients obtained in the Project 3 find the LSP (frame length 180-240 samples, ).2. For each LSP determine the range in which this parameter varies.3. Uniformly scalar quantize the found LSPs. 4. Reconstruct the filter from the quantized LSPs.5. Reconstruct the speech signal by filtering the ideal excitation (see Project 3). Investigate the dependence of the subjective quality of the synthesized speech on the number of bits spent for LSP representations.5. Plot the MSE as a function of the number of bits for each LSP. 6. Estimate entropies of the quantized LSPs. Plot the MSE as a function of the number of bits spent for LSPs. 7. Using the Linde-Buzo-Gray algorithm construct the vector quantizer of LSPs. Estimate the efficiency of the nonuniform vector quantization compared to the uniform scalar quantization.

color representation such as rgb is not always the most ... · web viewspeech compression system....

Documents