Download - Speech Coders for Wireless Communication

8/13/2019 Speech Coders for Wireless Communication

1/53

Speech Coders forWireless Communication


2/53

2Courtesy: Communication Networks Research (CNR) Lab.EECS, KAIST

Digital representation of the

speech waveform

Sampler Quantizerx(t) x(n) = x(nt) x(n)

Continuous-time

Continuous-amp.

Discrete-time

Continuous-amp.

Discrete-time

Discrete-amp.


3/53

Three acoustic signalsFrequency

range

Sampling

rate

PCM

bits per samplePCM bit rate

Telephone

speech3003,400Hz* 8kHz 8 64kb/s

Wideband

speech507,000Hz 16kHz 14 224kb/s

Wideband audio 1020,000Hz 48kHz 16 768kb/s

* Bandwidth in Europe : 2003200Hz in the United States and Japan

Frequency response of Telephone transmission channel

Courtesy: Communication Networks Research (CNR) Lab.EECS, KAIST


4/53

Encoder

Decompress

Telephone Speech Music (CD Quality)

0 Hz 4kHz 7kHz 20kHz

Talk A/D Decoder ListenD/A

Storage

Compress

(play)(record/store)



5/53


6/53

Hybrid coders

Multi-Pulse Excitation

Efficient at medium bit rates.

A sequence of nonuniformly spaced pulses as an excitation signal

Amplitudes and positions are excitation parameters

Regular-Pulse Excitation (RPE) Efficient at medium bit rates.

A sequence of uniformly spaced pulses as an excitation signal

The position of first pulse within a vector and amplitudes are excitationparameters

Code-Excited Linear Prediction (CELP)

Efficient at low bit rates (below 8 kbps)

A code book of excitation sequences

Two key issues; the design and search of a codebook



7/53

7Communication Networks Research (CNR) Lab.EECS, KAIST

0 5 10 15 20

g1g3

n1g2

n2

n3g4

gk

n4

nk

a)

0 5 10 15 20

g1

g3

K

g2 g4g6

b)

g5

c)

Codevector# N

Codevector# 3Codevector# 2

Codevector # 1

Codevector# N

Codevector# 3Codevector# 2

Codevector # 1

Codebook

2M= N

(M = Transmission Bit)

Examples of excitationsa) multipulse

b) regular-pulsec) Code-excited

Linear Prediction


8/53


Speech Compression

Standards

64 kbps -law/A-law PCM(CCTT G.711)

64 kbps 7kHz Subband/ADPCM(CCITT G.722)

32 kbps ADPCM(CCITT G.721)

16 kbps Low Delay CELP(CCITT G.728)

13.2 kbps RPE-LTP(GSM 06.10)

13 kbps ACELP(GSM 06.60)

13 kbps QCELP(US CDMA Cellular)

8 kbps QCELP(US CDMA Cellular)

8 kbps VSELP(US TDMA Cellular)

8 kbps CS-ACELP(ITU G.729)

6.7 kbps VSELP(Japan Digital Cellular)

6.4 kbps IMBE(Immarsat Voice Coding Standard)

5.3 & 6.4 kbps True Speech Coder(ITU G.723)

4.8 kbps CELP(Fed. Standard 1016-STU-3)

2.4 kbps LPC(Fed. Standard 1015 LPC-10E)


9/53


Performance of speech codec

Speech Quality (SNR/SEGSNR, MOS, etc)

Bit Rate (bits per second)

Complexity (MIPS)

Coding Delay (msec)


10/53


Requirements of speech codec

for digital cellular

More channel capacity

Noise immunity

Encryption

Reasonable complexity and encoding delay


11/53

Vocoders


12/53

Anatomy of Speech Organs:

The source of most speech occurs in the larynx.

It contains two folds of tissue called the vocal folds

or vocal cords which can open and shut like a pair of

fans.

The gap between the vocal cords is called the glottis

and as air is forced through the glottis the vocal cords

will start to vibrate and modulate the air flow.

This process is known as phonation.

The frequency of vibration determines the pitch of

the voice

for a male is typically in the range 50-200Hz

for a female the range can be up to 500Hz.


13/53

Amplitude

Time (ms)

50

Opening

phaseClosing

phase

Closure

Period = 12.5ms

Fundamental frequency = 1/.0125 = 80Hz

Rosenberg JASM 49, 1971

Glottal Pulse


14/53

Spectrum of glottal pulseIntensity

Frequency (Hz)

Harmonics of spectrum spaced at 80 Hz, corresponding to

pitch period of 12.5ms.


15/53

Spectrum of glottal pulse

filtered by the vocal tract

Intensity

Frequency (Hz)

Harmonics of spectrum spaced at 80 Hz, corresponding to

pitch period of 12.5ms.


16/53

/ee/ /ar/ /uu/


17/53

Properties of Speech in Brief

ee in keyo in spotoo in blue e in again

Vowels

s in spot k in key

Consonants

Quasi-periodic

Relatively high signal power

Non-periodic (random)

Relatively low signal power


18/53

Wrong /r/ /o/ /ng/


19/53

Moving /m/ /uu/ /v/ /i/ /ng/


20/53

Southampton /s/ /ou/ /th/ /aa/ /m/ /p/ /t/ /a/ /n/


21/53

Digital speech model

A basic digital model for speech production

periodicsignal gen.

random

signal gen.

linear time

variant filterx

Gain


22/53

Vocoder

Send three kinds of information to the

receiver:

(1) voiced or unvoiced signal,

(2) if it is voiced, the period of the excitation

signal,

(3) the parameters of the prediction filter


23/53

Vocoder

voice

classification

pitch

recognition

determine

filter coeff.

digital filter

excitation

signal gen

Encoder/Decoder


24/53

LPC Introduction

This speech coders are called Vocoders (voice

coder).

Basic Idea

They usually provide more bandwidth compressionthan is possible with waveform coding (2400-

9600bps).

Estimate

parameters

Encode

Parameters

Decode

Parameters

Synthetise

Speech

Transmit

Parameters


25/53

Generalities

LP Model

Parameter Estimation

Typical Memory requirements


26/53

LP Model

Impulse

Generator

Pitch Period

WhiteNoise

Generator

All-pole

filter

Glottal filterVocal tract filter

Lip Radiation filter

Voice/Unvoice

Switch

Speech

Signal

Voice

Unvoice Gain


27/53

Parameter Estimation

Therefore, for each frame:

estimate LP coefficients (ais)

estimate Gain

estimate type of excitation (voice or unvoice).

Estimate pitch.


28/53

V/UV Estimation Several Methods

Energy of Signal

Zero Crossing Rate

Autocorrelation Coefficient

SU U

V


29/53

Speech Measurements (1)

Zero Crossing Rate

Log Energy Es

Normalized Autocorrelation Coefficient

N

n

s nSN

E

1

2 ))(1

log(10

))())(((

)1()(

1

0

2

1

2

11

N

n

N

n

N

n

nsns

nsns

C


30/53

V

U V

V

S

S

V U

U

U U

Comparison between actual data and

V/U/S determination results.


31/53

Pitch Detection

Voiced sounds

Produced by forcing air through glottis

Vocal cords oscillate and modulate air flow into quasi

periodic pulses Pulses excite resonances in reminder of vocal tract

Different soundsproduced as muscles work to change

shape of vocal tract

Resonant frequencies or formant frequencies

Fundamental frequency or pitchrate of pulses


32/53

Pitch Detection

Short sections of

Voiced speech

Unvoiced speech

0 100 200 300 400 500 600 700-400

-200

0

200

400

sample number

amplitude

0 100 200 300 400 500 600 700-400

-200

0

200

400

sample number

amplitude


33/53

Time-domain pitch estimation

Well studied area

Variations of

fundamental frequency

are evident

Time-domain

speech processing

should be capable of detecting pitch frequency

0 100 200 300 400 500 600 700-400

-300

-200

-100

0

100

200

300

400

sample number

amplitude


34/53

Pitch Period Estimation Using the Auto-

correlation Function

Periodic signals have periodic auto-correlation function

Basic problems in choosing window length:

Speech changes over time (N low) but at least 2 periods of the waveform

Approaches:

Choose window to catch longest period

Adaptive N

Use modified short-time auto-correlation function

kN

m

n mkwkmnxmwmnxkR1

0

'')]()()][()([)(


35/53

Pitch Period Estimation Using the Auto-

correlation Function (Contd)

Auto-correlation representation - retains too

muchof the information in the speech signal

=> auto-correlation function has many peaks

0 100 200 300 400 500 600 7000

2

4

6

8

10

12

14


36/53

Spectrum flatteners

techniques

Remove the effects of the vocal tract transfer function

Center clipping - nonlinear transformation, clipping value

depends on maximum amplitude

=> Strong peak at thepitch frequency

0 100 200 300 400 500 600 700-400

-300

-200

-100

0

100

200

300

400

sample number

amp

litude

0 100 200 300 400 500 600 7000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7


37/53

Fundamental Frequency

F0 estimation: (Hess) determining the mainperiod in quasi-periodic waveform usually using autocorrelation function and the average

magnitude difference function (AMDF)

where L is the frame length Npis number of point pairs(peak in ACF and valley in AMDF indicates F0)

mn

tt

p

t Lmnmnsns

N

mAMDF,

10|,)()(|1

)(


38/53

Typical Memory Requirements Pitch coefficient (6 bits).

Gain (5 bits)

Model parameters:

LP coefficients (8-10 bits)

Small changes in the LPC results in large changes

in the pole positions.

Reflection coefficients (6 bits)

If |rk| near 1, then large distortion.

Log-Area Ratio:

Represent a non-linear transformation of the

Reflection Coefficients to expand the scale near to

|rk| near 1.


39/53

The main difference of the LP vocoders is the

calculation of the source of excitation.

LPC 10


40/53

LPC-10

Impulse

Generator

Pitch

Period

(7 bits)

White NoiseGenerator

1/A(z)

10 Reflection

Coefficients.(5 bitsfor one

and 4 bitsfor the others).

Voice/Unvoice

Switch(1 bit)Synthesized

Speech Signal

Gain

(5 bits)

SpeechSignal

ADC

(8kHz)

Sample

SpeechReflection

Coefficients

(4 bits)

LP Analysis

(Covariance

Method)

Non-linear

warping

Window

(180 samples)

LAR

coefficients

(4 bits and 5 bits)

AMDF and

Zero CrossingVoice/Unvoice

Switch (1 bit)

Pitch Frequency

(7 bit)E

n

c

o

d

e

r

D

e

c

o

d

e

r

Channel


41/53

RELP

Simple vocoder offers poor sound quality and is usually

unsatisfactory.

An improvement is to use the prediction error rather than the

periodical pulse (for voiced signal) or the random noise (for

unvoiced signal) to excite the digital filter to reproduce the

speech. The prediction error is also called the residual.

This scheme is called Residual Excited Linear Prediction(RELP) coding.


42/53

RELP

determinefilter coeff.

digital filter

- quantization encoder


43/53

RELP

RELP follows essentially the same idea as DPCM.

However, in RELP the speech signal is divided into

blocks (20ms/block). The optimum linear predictor is designed for each block.

For each block, the filter coefficients and the prediction

error should be sent to the receiver.

In DPCM, the predictor can be fixed or adaptive.

Only the prediction error is sent to the receiver.

M d li f th di ti


44/53

Modeling of the prediction

error

In each block of speech signal (a frame), the

prediction error may also be correlated.

To decorrelate the prediction error, each frame isfurther divided into 4 sub-frames (5ms). The

prediction error u(n) is then modelled as

where M (40


45/53

Long-term prediction

The decorrelation of the prediction error is

called long-term prediction.

determine

filter coeff.

digital filter

-long-term

predictionencoder

u(n) e(n)

A(z)

U(z)s(n)


46/53

RPE-LTP

The RPE-LTP has been adopted as the

speech coding method in the GSM 06.10

standard

determine

filter coeff.

digital filter

-long-term

prediction

Regular pulse

selection and

coding


47/53

RPE-LTP

Speech is sampled at 8 kHz, quantised to 8 bits/sample

The speech signal is pre-processed to remove any DC

component and to pre-emphasis the high-frequencies

component, partly compensating for their low energy.

The signal is then dived into frames (20ms, 160 samples). An

eighth-order optimum linear predictor is designed using the

Shur algorithm.

The reflection coefficients (related to the filter coefficients) are

nonlinearly mapped to another set of values called log-area

ratio(LAR).


48/53

RPE-LTP

The 8 LAR parameters are quantized using 6,6,5,5,4,4,3,3 bits.

So a total of 36 bits for the LAR (or for the filter coefficients).

The frame is filtered using this filter and produces u(n).

u(n) is then divided into 4 sub-frames (5ms each, 40 samples).

Long-term prediction is performed for each sub-frame. The lag M is

quantized to 7 bits and the gain his represented by 2 bits.

Long-term prediction produces e(n).


49/53

RPE-LTP

e(n) is down-sampled by a factor 3. For each sub-frame,

there are 4 down-sample patterns. Need 2 bits to specify the

pattern used.

The down-sampled e(n) has 13 samples. The maximum of

them is quantized to 6 bits, others are normalised then

represented by 3 bits.

So in each sub-frame, e(n) is represented by 6+13*3=6+39 bits.

A frame has 4 sub-frames, 4*(6+39)=180 bits

The above method is called regular-pulse-excitation (RPE)


50/53

13 kbps RPE-LTP coder

- Encoder -

Input signal

Short termLPC

analysis

RPE grid

Selection and coding

Synthesis filter

1/A(z/5)

+

RPE grid

decoding

LTP

analysis

Pre-

processing

Short term

Analysis

filter

+

LTP parameters

RPE parameters

(13 pulses / 5 ms)

Reflection coefficients

(36 bit / 20 ms)

-


51/53

- Decoder -

RPE

parameters

Synthesis filter

1/A(z/5)

+RPE grid

decoding

Post-

processingShort term

synthesis

LTPparameters

Reflection

coefficients

Output signal


52/53

RPE-LTP

Summary 8 LAR coefficients 36 (bits)

For each sub-frame

pattern code 2

lag 7

gain 2

regular pulse 6+39

total 56

4 sub-frames 4*56=224

Total one frame 224+36=260

bit rate 260 bits/20 ms=13kbs


53/53

Download - Speech Coders for Wireless Communication

Top Related