speech signal analysis and coding - ernetpkalra/old-courses/siv864-2010/session-0… · mpeg-4 hvxc...
TRANSCRIPT
Speech Signal Analysis
and Coding
Dr. Arun Kumar
Centre for Applied Research in Electronics
(CARE), IIT Delhi
Contents
• Speech Processing Applications
• Speech Signal Understanding
– Speech Production
– Speech Signal Characteristics and Analysis
• Speech Coding
– Coding Standards
– Coder Attributes including Quality Evaluation
– Coding Methodologies
• Speech Transmission
– Trunk-line telephony
– Wireless telephony
• Speech Storage
– Voice Mail, Voice Memo, Answering
machines
• Speech Synthesis
– Text-to-speech-synthesis
– Automatic information services
Speech Processing Applications
• Speaker Verification and Identification
– Phone banking
– Secure entry
• Aids for the Handicapped
– Variable rate playback
– Hearing aids
– Reading machine for visually impaired
– Visual display of speech information for
hearing impaired
Speech Processing Applications
• Speech Enhancement
– Echo and noise cancellation
• Speech Recognition
– Automatic language translation
• Voice Personality Transformation
– Voice conversion from “source” to “target”
Speech Processing Applications
“ It is the variation of pressure, from atmospheric pressure, as a function of time, caused by traveling waves from the speaker’s mouth (apart from nostrils, cheeks and throat).”
The Speech Signal
Units:
SPL (Sound Pressure Level) in dB
relative to a reference level.
Reference: 10 –16 W/cm2
- Corresponds to ‘just barely audible’
The Intensity Level of Speech
0
20
55 60
70
80
100
120
d
B
Just barely audible
Whisper
Airplane
Rock concert
Heavy traffic Variations in normal voice
level (1 meter distance from
mouth)
The Intensity Level of Speech
• Energy of speech during 1 s
– 2 x 10 –5 Joules
(It takes 100 Joules to light a 100 W bulb for
1 s)
• Strongest vowel: /a/ as in “talk”
• Weakest vowel: /i/ as in “see”
• Strongest consonant: /r/ as in “run”
• Weakest consonant: /Θ/ as in “thin”
The Intensity Level of Speech
Audio
Signal
Category
Bandwid
th(Hz)
Sampling
Rate
(kHz)
Source
Rate
(kbps)
Telephone
Band
Speech
300-3400 8.0 128
Wideband
Speech50-7000 16.0 256
Wideband
Audio20-20,000 44.1/48.0 705/768
Speech & Audio Signal Specs.
Speech Articulation by the Vocal System
Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000
Speech Classes by Articulation
• Voiced speech
• Unvoiced speech
• Transient (stop) sounds
The relationship between speech sounds (phonemes) and their acoustic realizations
– Waveform
– Spectrum
– Spectrogram
Acoustic Analysis of Speech
Time Waveform of a Speech Sentence
0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4
- 1
- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
0 . 6
0 . 8
T im e ( s )
Am
plit
ud
e
ʓʓʓʓ(TH)
THIS IS GOOD
ɪɪɪɪ(i) s
(s)ɪɪɪɪ(i) s
(s)
ɡɡɡɡ (G) U (O) d
(D)
• Vowels– High energy, periodic, steady state utterance
• Unvoiced fricatives– Low energy, noise-like, steady-state utterance
• Voiced fricatives– Low energy, element of periodicity, steady-state
utterance
• Stops– Transient release, medium to low energy
• Nasals– Low-to-medium energy, periodic, steady-state
utterance
Waveform Analysis of a Speech
Fundamental frequency F0 / Pitch period
F0 Male Female
Average (Hz) 132 223
Range (Hz) 50-250 120-500
Acoustic Analysis of Vowels
• Stop Consonants
– Momentary blockage of the vocal tract (50-
100ms): Closure phase
– Release burst (shortest acoustic event)
– Voice – onset time (VOT)
• Fricatives
– Narrow constriction somewhere in vocal
tract
– Turbulent airflow through the constriction
Acoustic Analysis of Consonants
The
International
Phonetic
Alphabet
(IPA)
Universal Speech Production Model
Output speech
Impulse Train
Generator
Glottal Pulse Model
White Noise
Generator
Vocal Tract Filter
Voiced or Unvoiced switch
Radiation Model
Voiced Gain
Unvoiced Gain
Vocal Tract Model
• Time-varying all-pole linear filter excited by a
source signal.
• H(z) models the vocal tract system.
H(z)=1/A(z)
e[n] s[n]
)(
1
1
1)(
1
zAza
zHP
i
i
i
=
−
=
∑=
−
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag (
dB
)Voiced Speech Spectrum
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag (
dB
)Superimposed 2nd-order LP Envelope
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag (
dB
)Superimposed 2nd, 6th order LP Envelopes
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag (
dB
)Superimposed 2nd, 6th, &10th order LP Envelopes
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag (
dB
)Superimposed 2nd, 6th, 10th & 16th order LP Envelopes
Unvoiced Speech and 10th order LP Residual
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0-0 .1 9
-0 .1 8
-0 .1 7
-0 .1 6
-0 .1 5
-0 .1 4
-0 .1 3
-0 .1 2
-0 .1 1
- 0 . 1
T im e ( m s )
Am
plit
ud
e
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2
-0 .1 5
- 0 . 1
-0 .0 5
0
0 .0 5
0 . 1
0 .1 5
T im e ( m s )
Am
plit
ud
e
Voiced Speech and 10th-order LP Residual
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
0 . 6
T i m e ( m s )
Am
plit
ude
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
0 . 2
T i m e ( m s )
Am
plit
ud
e
• Short-term correlation
• Long-term correlation
Speech Coding
• For telephone band (or narrowband) speech:– Signal Bandwidth: 300-3400 Hz
– Sampling Rate: 8000 Hz
– Resolution: 16 bits / sample linear PCM
• Uncompressed bit rate:16 bits/sample x 8000 samples/s
= 128 Kbit/s
• What is the minimum coding rate for transmitting the message information?
Coding Rates
Coder Classes according to Bit-Rate
B > 16 Kbps High bit rate coders
4 < B <=16 KbpsMedium bit rate
coders
1 < B <=4 Kbps Low bit rate coders
B < 1 KbpsVery low bit rate
coders
• ITU-T: International Telecommunications Union (UN)
• MPEG: Motion Pictures Experts Group (ISO/UN)
• INMARSAT: Intl. Maritime Satellite Corporation – for geo-synchronous satellites
• US Government: DoD, NATO
• TIA: Telecom Industry Association - for North American Telecom standards
• ETSI: European Telecom. Standards Institute
Standards Organizations
Name Coding TypeBit-rate
(kbps)Organization Year
G.711/
G.712
PCM µ-law/
A-law64 ITU-T 1972
G.721/G.723
G.726/G.727ADPCM
32/24/40/
16ITU-T
1984/86/
88/90
G.728 LD-CELP 16 ITU-T 1992
G.729 CS-ACELP 8.0 ITU-T 1995
G.723.1 ACELP 6.3/5.3 ITU-T 1995
G.722(Wideband)
SB-ADPCM48/56/64 ITU-T 1985
Speech Coding Standards
Name Coding TypeBit-rate
(kbps)Organization Year
G.722.1(Wideband)
Transform 24/32 ITU-T 1999
Inmarsat IMBE 4.15 INMARSAT 1990
IS-54 (old) VSELP 7.95 TIA 1992
GSM-FR RPE-LTP 13 GSM 1991
GSM-HR CELP 5-6 GSM 1994
GSM-EFR CELP 12.2 GSM 1997
Speech Coding Standards
Name Coding TypeBit-rate
(kbps)Organization Year
IS-641(new) ACELP 7.4 TIA 1997
Iridium AMBE 2.4 Iridium 1996
MPEG-4 HVXC 2-4 MPEG/ISO 1999
MPEG-4 CELP 4-24 MPEG/ISO 1999
FS-1015 LPC-10 2.4 US-DoD
/NATO 1984
FS-1016 CELP 4.8US-DoD
/NATO1989
MELP MELP 2.4US-DoD
/NATO1996
Speech Coding Standards
• Coding Methodologies
– Waveform coding
– Vocoding or parametric coding
– Hybrid coding
Coding Methodologies
Classes according to Coding Type
Bit rate (Kbps)
Quality
Poor
Fair
Good
Excellent
Parametric Coders
Waveform
approximating
coders
1 42 168 32 64
Hybrid
Coders
Coding Standards
Bit rate (Kbps)
Quality
Poor
Fair
Good
Excellent
Parametric Coders
Waveform approximating
coders
1 42 168 32 64
Hybrid Coders
G.726G.711
Linear
PCM
GSM EFR
FS1015
G.723.1
G.729
G.728
IS96
GSM/2
GSM FR
MELP
PCM Coding
Q[.]x[n] x’[n]
i[n]
• Instantaneous, non-uniform quantization
• For time-varying energy signals eg speech, uniform quantization is inefficient.
• If signal energy is halved, SQNR falls 6 dB.
• SQNR is independent of signal level in Log quantizer.
ADPCM Coding
+ Q[.]
Encoder
+P
Decoder +
P
Input
x[n]- d[n]
x’[n]
c[n]d’[n]
x”[n]
c[n]
d’[n] x”[n]
x’[n]
Prediction in the context of Coding
0 5 1 0 1 5 2 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
0 . 6
T i m e ( m s )
Am
plit
ud
e
0 5 1 0 1 5 2 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
T i m e ( m s )
Am
plit
ude
Signal and first-difference signal
• DPCM with fixed predictor can give 4-11 dB improvement over PCM.
• PCM with adaptive quantization can give ~ 5
dB improvement over µ-law non-adaptive PCM.
• DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.
ADPCM Coding
Code Excited Linear Prediction (CELP) Coding
• Most coders in 4.8-16 kbps are based on Linear Prediction Analysis-by-Synthesis (LPAS) coding.
• CELP belongs to LPAS paradigm of speech coding.
Generic Linear Prediction Analysis Synthesis (LPAS) Coder
Excitation
Generator
Error
Minimization
Synthesis
Filter
LP Analysis
+
Input
speech
-
CELP Decoder
Excitation
GeneratorG/A(z)
Excitation parameters
LP and Gain parameters
Synthesized speech
• Speech Quality
– Objective measures
• Segmental SNR
• Itakura-Saito distance measure
• Spectral distortion (SD)
• ITU-T P.862 Recommendation
– Subjective measures
• Mean opinion score (MOS)
• Diagnostic Rhyme Test (DRT)
• Diagnostic Acceptability Measure (DAM)
Speech Quality Measurement
• Listening quality scale
Excellent 5
Good 4
Fair 3
Poor 2
Bad 1
Absolute Category Rating Tests (MOS)
• Measures speech intelligibility
• Listeners are presented with one of two words which differ only in leading consonant
– Examples:
• Meet - Beat
• Than - Dan
• Met - Net
• Jest - Guest
Diagnostic Rhyme Test
• Total possible pairs = 96
• Intelligibility score, S, is given by:
N(correct) – N(incorrect)
S = 100 x
N(test pairs)
Coder Rate (kbps) DRT MOS
FS1016 4.8 91.7 3.3
G.728 16 93.0 3.9
Diagnostic Rhyme Test
• Part of ITU-T P.862 standard
• Objective is to mimic sound perception by persons in real life
• PESQ simulates expts. in which subjects judge speech quality
• Physical signals are mapped to psychophysical representations that match internal representations in the head
Perceptual evaluation of speech quality (PESQ)
• Complexity
– Computational complexity
• Simplex/half-duplex/full-duplex real time
performance on a single DSP
• Fixed point vs. floating point
• CELP coders are computationally complex
– Memory requirement
• Storage of look-up tables, codebooks etc.
Speech Coder Complexity Issues
Timing Diagram for various Coding Delays
Buffer input
speech frame
Buffer input
speech frame 2
Buffer input
speech frame 3
Buffer input
speech frame 4
Buffer input
speech frame 5
Encode
frame 1Encode
frame 2
Encode
frame 3
Encode
frame 4
Transmit bits of
frame 1
Transmit bits of
frame 2Transmit bits of
frame 3
decode
frame 1decode
frame 2
decode
frame 2
Play back
decoded speech
frame 1
Play back
decoded speech
frame 2Total one way coding delay
Algorithmic
buffering delay
Encoder
processing
delay
Bit transmission
delay
Decoder
processing
delay
Sum of the
two is the
total processing
delay
0 1 2 3 4 5Time (frame index)
Thank You!