speech-coding techniques chapter 3. internet telephony 3-2 introduction efficient speech-coding...
TRANSCRIPT
Speech-Coding Techniques
Chapter 3
3-2Internet Telephony
Introduction
Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the
quality RTP payload types Processing power
The better quality (for a given bandwidth) uses a more complex algorithm
A balance between quality and cost
3-3Internet Telephony
Voice Quality
Bandwidth is easily quantified Voice quality is subjective
MOS, Mean Opinion Score ITU-T Recommendation P.800
Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1
A minimum of 30 people Listen to voice samples or in conversations
3-4Internet Telephony
P.800 recommendations The selection of participants The test environment Explanations to listeners Analysis of results
Toll quality A MOS of 4.0 or higher
3-5Internet Telephony
Subjective and objective quality-testing techniques
PSQM – Perceptual Speech Quality Measurement ITU-T P.861 faithfully represent human judgement and
perception algorithmic comparison between the output
signal and a know input type of speaker, loudness, delay,
active/silence frames, clipping, environmental noise
3-6Internet Telephony
A Little About Speech
Speech Air pushed from the lungs past the vocal
cords and along the vocal tract The basic vibrations – vocal cords The sound is altered by the disposition of
the vocal tract ( tongue and mouth) Model the vocal tract as a filter
The shape changes relatively slowly The vibrations at the vocal cords
The excitation signal
3-7Internet Telephony
Speech sounds
Voiced sound The vocal cords vibrate open and close Interrupt the air flow Quasi-periodic pluses of air The rate of the opening and closing – the
pitch A high degree of periodicity at the pitch
period 2-20 ms
3-8Internet Telephony
Voiced speech Power spectrum density
3-9Internet Telephony
Unvoiced sounds Forcing air at high velocities through a
constriction The glottis is held open Noise-like turbulence Show little long-term periodicity Short-term correlations still present
3-10Internet Telephony
unvoiced speech Power spectrum density
3-11Internet Telephony
Plosive sounds A complete closure in the vocal tract Air pressure is built up and released
suddenly A vast array of sounds
The speech signal is relatively predictable over time
The reduction of transmission bandwidth can be significant
3-12Internet Telephony
Voice Sampling
A-to-D discrete samples of the waveform and
represent each sample by some number of bits
A signal can be reconstructed if it is sampled at a minimum of twice the maximum freq.
Human speech 300-3800 Hz 8000 samples per second
time
Each sample is encoded into an 8-bit PCM code word
(e.g. 01100101)
=> 8000 x 8 bit/s
3-13Internet Telephony
Quantization
How many bits is used to represent Quantization noise
The difference between the actual level of the input analog signal
More bits to reduce Diminishing returns
Uniform quantization levels Louder talkers sound better 11.2/11 v.s. 2.2/2
3-14Internet Telephony
Non-uniform quantization Smaller quantization steps at smaller signal
levels Spread signal-to-noise ratio more evenly
3-15Internet Telephony
DTX and Comfort Noise
DTX is Discontinuous Transmission Voice activity detector (VAD) detects if
there is active speech or not. When there is no active speech different
DTX procedures can be used: No Transmission at all Comfort Noise (CN) using RFC 3389 Codec built CN in like AMR SID (Silence
Descriptor) Frequency of Comfort Noise packets
varies but is usually some fraction of normal packet rate
3-16Internet Telephony
Type of Speech Coders
Waveform codecs Sample and code High-quality and not complex Large amount of bandwidth
source codecs (vocoders) Match the incoming signal to a math model Linear-predictive filter model of the vocal tract A voiced/unvoiced flag for the excitation The information is sent rather than the signal Low bit rates, but sounds synthetic Higher bit rates do not improve much
3-17Internet Telephony
Hybrid codecs Attempt to provide the best of both Perform a degree of waveform matching Utilize the sound production model Quite good quality at low bit rate
3-18Internet Telephony
G.711
The most commonplace codec Used in circuit-switched telephone network PCM, Pulse-Code Modulation
If uniform quantization 12 bits * 8 k/sec = 96 kbps
Non-uniform quantization 64 kbps DS0 rate mu-law
North America A-law
Other countries, a little friendlier to lower signal levels An MOS of about 4.3
3-19Internet Telephony
DPCM
DPCM, Differential PCM Only transmit the difference between the predicated
value and the actual value Voice changes relatively slowly It is possible to predict the value of a sample base on
the values of previous samples The receiver perform the same prediction The simplest form
No prediction
No algorithmic delay
3-20Internet Telephony
ADPCM
ADPCM, Adaptive DPCM Predicts sample values based on
Past samples Factoring in some knowledge of how speech varies
over time The error is quantized and transmitted
Fewer bits required G.721
32 kbps G.726
A-law/mu-law PCM -> 16, 24, 32, 40 kbps An MOS of about 4.0 at 32 kbps
3-21Internet Telephony
Analysis-by-Synthesis (AbS) Codecs
Hybrid codec Fill the gap between waveform and source
codecs The most successful and commonly used
Time-domain AbS codecs Not a simple two-state, voiced/unvoiced Different excitation signals are attempted Closest to the original waveform is selected MPE, Multi-Pulse Excited RPE, Regular-Pulse Excited CELP, Code-Excited Linear Predictive
3-22Internet Telephony
G.728 LD-CELP
CELP codecs A filter; its characteristics change over time A codebook of acoustic vectors
A vector = a set of elements representing various char. of the excitation
Transmit Filter coefficients, gain, a pointer to the vector
chosen Low Delay CELP
Backward-adaptive coder Use previous samples to determine filter coefficients Operates on five samples at a time
Delay < 1 ms Only the pointer is transmitted
3-23Internet Telephony
1024 vectors in the code book 10-bit pointer (index) 16 kbps
LD-CELP encoder Minimize a frequency-weighted mean-
square error
3-24Internet Telephony
LD-CELP decoder
An MOS score of about 3.9 One-quarter of G.711 bandwidth
3-25Internet Telephony
G.723.1 ACELP
6.3 or 5.3 kbps Both mandatory Can change from one to another during a
conversation The coder
A band-limited input speech signal Sampled at 8 KHz, 16-bit uniform PCM
quantization Operate on blocks of 240 samples at a time A look-ahead of 7.5 ms A total algorithmic delay of 37.5 ms + other
delays A high-pass filter to remove any DC component
3-26Internet Telephony
Various operations to determine the appropriate filter coefficients
5.3 kbps, Algebraic Code-Excited Linear Prediction
6.3 kbps, Multi-pulse Maximum Likelihood Quantization
The transmission Linear predication coefficients Gain parameters Excitation codebook index 24-octet frames at 6.3 kbps, 20-octet frames at
5.3 kbps
3-27Internet Telephony
G.723.1 Annex A Silence Insertion Description (SID) frames of
size four octets The two lsbs of the first octet
00 6.3kbps 24 octets/frame 01 5.3kbps 20 10 SID frame 4
An MOS of about 3.8 At least 27.5 ms delay
3-28Internet Telephony
G.729
8 kbps Input frames of 10 ms, 80 samples for 8
KHz sampling rate 5 ms look-ahead
Algorithmic delay of 15 ms An 80-bit frame for 10 ms of speech A complex codec
G.729.A (Annex A), a number of simplifications
Same frame structure Encoder/decoder, G.729/G.729.A Slightly lower quality
3-29Internet Telephony
G.729.B VAD, Voice Activity Detection
Based on analysis of several parameters of the input
The current frames plus two preceding frames DTX, Discontinuous Transmission
Send nothing or send an SID frame SID frame contains information to generate
comfort noise CNG, Comfort Noise Generation
G.729, an MOS of about 4.0 G.729A an MOS of about 3.7
3-30Internet Telephony
G.729 Annex D a lower-rate extension 6.4 kbps; 10 ms speech samples, 64
bits/frame MOS 6.3 kbps G.723.1
G.729 Annex E a higher bit rate enhancement the linear prediction filter of G.729 has 10
coef. that of G.729 Annex E has 30 coef. the codebook of G.729 has 35 bits that of G.729 Annex E has 44 bits 118 bits/frame; 11.8 kbps
3-31Internet Telephony
Other Codecs
CDMA QCELP defined in IS-733 Variable-rate coder Two most common rates
The high rate, 13.3 kbps A lower rate, 6.2 kbps
Silence suppression For use with RTP, RFC 2658
3-32Internet Telephony
GSM Enhanced Full-Rate (EFR) GSM 06.60 An enhanced version of GSM Full-Rate ACELP-based codec The same bit rate and the same overall
packing structure 12.2 kbps
Support discontinuous transmission For use with RTP, RFC 1890
3-33Internet Telephony
GSM Adaptive Multi-Rate (AMR) codec 20 ms coding delay Eight different modes 4.75 kbps to 12.2 kbps 12.2 kbps, GSM EFR 7.4 kbps, IS-641 (TDMA cellular systems) Change the mode at any time Offer discontinuous transmission
The SID (Silence Descriptor) is sent in every 8th frame and is 5 bytes in size
The coding choice of many 3G wireless networks
3-34Internet Telephony
The MOS values are for laboratory conditions G.711 does not deal with lost packets G.729 can accommodate a lost frame by
interpolating from previous frames But cause errors in subsequent speech frames
Processing Power G.728 or G.729, 40 MIPS G.726 10 MIPS
3-35Internet Telephony
iLBC
a FREE codec for robust VoIP 13.33 kbit/s with an encoding frame
length of 30 ms and 15.20 kbps of 20 ms Computational complexity in a range of
G.729A
3-36Internet Telephony
Speex
Open-source patent-free speech codec CELP (code-excited linear prediction) codec operating modes:
narrowband (8 kHz sampling rate) 2.15 – 24.6 kb/s delay of 30 ms
wideband (16 kHz sampling rate) 4-44.2 kb/s delay of 34 ms
ultra-wideband (32 kHz sampling rate) intensity stereo encoding variable bit rate (VBR) possible voice activity detection (VAD)
3-37Internet Telephony
Cascaded Codecs E.g., G.711 stream -> G.729
encoder/decoder Might not even come close to G.729
Each coder only generate an approximate of the incoming signal
Audio samples http://
www.cs.columbia.edu/~hgs/audio/codecs.html
3-38Internet Telephony
Effects of packetization
3-39Internet Telephony
Tones, Signal, and DTMF Digits
The hybrid codecs are optimized for human speech Other data may need to be transmitted Tones: fax tones, dialing tone, busy tone DTMF digits for two-stage dialing or voice-
mail G.711 is OK G.723.1 and G.729 can be unintelligible The ingress gateway needs to intercept
The tones and DTMF digits Use an external signaling system
3-40Internet Telephony
Easy at the start of a call Difficult in the middle of a call
Encode the tones differently from the speech
Send them along the same media path An RTP packet provides the name of the tone and
the duration Or, a dynamic RTP profile; an RTP packet
containing the frequency, volume and the duration
RFC 2198 An RTP payload format for redundant audio data Sending both types of RTP payload
3-41Internet Telephony
RTP Payload Format for DTMF Digits An Internet Draft Both methods described before A large number of tones and events
DTMF digits, a busy tone, a congestion tone, a ringing tone, etc.
The named events E: the end of the tone, R: reserved
3-42Internet Telephony
Payload format