dsr front-end extension for tonal-language recognition and speech reconstruction
DESCRIPTION
DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction. Aurora Group Meeting, April 2003 By IBM & Motorola. Outline. Introduction Part I – Terminal Side Algorithm Description Part II – Server Side Algorithm Description Part III – Results vs . Requirements - PowerPoint PPT PresentationTRANSCRIPT
04/19/23 1
DSR Front-end Extension forTonal-language Recognitionand Speech Reconstruction
Aurora Group Meeting, April 2003
By IBM & Motorola
04/19/23 2
Outline• Introduction
• Part I – Terminal Side Algorithm Description
• Part II – Server Side Algorithm Description
• Part III – Results vs. Requirements– Algorithmic Requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation
04/19/23 3
Introduction
Historical SnapshotsJuly 2000 – Speech reconstruction defined as one of the areas
to be addressed by the committeeFeb. 2001 – Tonal Language Recognition added to speech
reconstructionJuly 2001 – New work item for extension of FE (WI-030)
openedApril 2002 – Joint-development contract signed between IBM
and MotorolaAugust 2002 – Work item for extension of AFE (WI-034)
opened
04/19/23 4
Introduction
System Overview
Pitch & ClassEstimation
Pitch Trackingand Smoothing
SpeechReconstruction
Pitch & Class
@ 800 bps
CHANNEL
ETSI StandardDSR Front-End
DSRBack-End
MFCC & log-E
@ 4800 bps
TonalInformation
SpeechIn
SpeechOut
04/19/23 5
Outline
• Introduction
• Part I – Terminal Side Algorithm Description
• Part II – Server Side Algorithm Description
• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation
04/19/23 6
Part I – Terminal Side Algorithm Description
• XFE block diagram
• XAFE block diagram
• Voice activity detection
• Low band noise detection
• Pre-processing of speech signal
• Pitch estimation
• Voicing classification
• Quantization of voicing class and pitch
• Bit-stream formatting and error protection
04/19/23 7
XFE Block Diagram ADC Offcom Framing PE W FFT MF LOG DCT
EC
Feature Compression
Bit Stream FormattingFraming
To transmission channel
Inputspeech
VAD
logE
PITCH
CLS
PP
MF
CCLBND
Abbreviations
EC - Energy computation
logE - Log energy measure computation
VAD - Voice activity detection
LBND - Low-band noise detection
PP - Pre-processing
PITCH - Pitch estimation
CLS - Classification
log
-E
P VC
FE blocks
Extension blocks
Interface blocks
04/19/23 8
XAFE Block Diagram
SECVADVC
PITCH
CLS
PP
LBND
Spectrum Estimation
Sin(n)
MF
P VC
Abbreviations
SEC - Spectrum and energy computation
MF - Mel-filtering
VADVC - Voice activity detection for voicing classification
LBND - Low-band noise detection
PP - Pre-processing
PITCH - Pitch estimation
CLS - Classification
Rest of the Noise Reduction Blocks
AFE blocks
Extension blocks
Interface blocks
04/19/23 9
U P D A T E_
F L A G
FU
PD
AT
E_F
LA
G
E(m)
V(m)
q(m)
Ech(m)En(m)
En(m+1)
NOISE ENERGY
SMOOTHER
NOISE ENERGY
ESTIMATE STORAGE
SPECTRAL DEVIATION ESTIMATOR
UPDATE DECISION
DETERMINER
VOICE METRIC CALCULATOR
VOICE ACTIVITY
DETERMINERvad_flag
hangover_flag
CHANNEL ENERGY
ESTIMATOR
F(m)
Ech(m)En(m)
CHANNEL SNR ESTIMATOR
SIGNAL SNR ESTIMATOR
PEAK TO AVERAGE
RATIO ESTIMATOR
201
205
202
203
204
206
207
208
209210
SNRq(m)
P2A(m)
To 205, 206, & 208
To 208
To 208
Voice ActivityDetection
INPUT
OUTPUT
Inputs – Filter bank output (23)
Outputs – vad_flag, hangover_flag
04/19/23 10
Low Band Noise Detection
Inputs – power spectrum, vad_flag, frame energy
Output – lbn_flag
Low-band – Below 380 Hz
Find max.power inhigh band
vad_flag ==false?
E >=enrg_thldStart
End End
Yes Yes
No No
Find max.power inlow band
Find ratiolow / high
Filter ratioratio >
ratio_thld?
Endlbn_flag =
false
lbn_flag =true Yes
No
04/19/23 11
Pre-Processing of Speech Signal
Inputs – input speech signal, lbn_flag
Outputs – low-pass filtered, down-sampled speech signal Slpds high-pass filtered speech signal Sub
Low-passFilter # 1
Low-passFilter # 2
High-passFilter
Down-sample
lbn_flag = TRUE
lbn_flag = FALSE
Slpds
Sub
Sin
04/19/23 12
Pitch Estimation
Inputs – vad_flag, lbnd_flag, low-pass filtered, down-sampled speech signal, fourier spectrum, power spectrum, spectral average, log-E
Output – pitch period P (P = 0 for unvoiced frames)
Frequency ranges (Hz)[200,420], [100,210], [52,120]
Stable track with frequency F0[0.666*F0,2.2*F0] the above 3 ranges
F0 candidates generation
Pitch selection
Found pitch?
Low-pass filtered, down-sampled speechSTFT, PS
Convert pitch and output
History update
Corrrelationcalculation
YesNo
Select nextfreq. range
04/19/23 13
Pitch Estimation• Find F0 among common integer dividers of spectral peak frequencies• Give preference to higher dividers
0 500 1000 1500 2000 2500 3000 3500 40000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Hz
F0
04/19/23 14
Pitch Estimation• Utility function generalizes concept of integer divider • Utility function – superposition of components generated by spectral peaks
-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.250
0.2
0.4
0.6
0.8
1
1.2
1.4
Fi/F0 - N
5121002,512651
)()1(
5.02,0
21,5.0
1,1
)(
)0()0(
DD
rIrI
rD
DrD
Dr
rI
where
FPeakFIPeakMagFUi
ii
One period of influence function I(r)
04/19/23 15
Pitch Estimation
0 100 200 300 400 500 600 700 8000
0.2
0.4
0.6
0.8
1
1.2
1.4Pitch candidates generated by peak at 700Hz
Hz
Utility
function
Utility function component generated by peak of unit magnitude at 700 Hz
F0min F0max
04/19/23 16
Pitch Estimation
F0 candidates generation and Correlation calculation
Process power spectrum –Double resolution, (doubleframe-size), de-emphasize,and smooth
Pick local peaks, scaledown high-freq. peaks,limit number of peaks,refine locations andamplitudes, normalize Build utility function, select
at most two FO candidateswith high spectral scores,giving preference to higherfrequencies, and frequenciesnear previous F0 estimates
Convert F0 candidatesinto corresponding lags
Compute correlation scoresat each lag using speechsegments having the highest energy & separated by the lag
To pitch selectionSTFT, PS Low-pass filtered, down-sampled speech
Process spectrum
Pick peaks
Compute correlation
Build utility function
04/19/23 17
Pitch Estimation
Pitch selectionClass1
(CS>0.79 AND SS>0.78)
OR (SS>0.68 AND
SS+CS>1.6)
Class2
(CS>0.7 AND SS>0.7) AND
(0.82Ref<F0<1.22Ref)
Class3
(CS>0.85 OR
SS>0.82)
Sort F0
Find best class1 cand.
Found?
Yes
No
Set pitch
Full list? NoYes
Stable Track?
Set ref. to stable pitch
Find best class2 cand.
Found?
Cont. pitch?
Yes
No
No
Yes
Find best class3 cand.
Found?
ss>.95& cs>.95?
Set pitch
Yes No
Set uv pitch
Set pitch
Set uv pitch
Set pitch
Set uv pitch
Set ref. to previous pitch
Find best class2 cand.
Found?No
Yes
Set pitch
Yes
No
No
Yes
04/19/23 18
Voicing Classification
Inputs – vad_flag, hangover_flag, input speech signal, high-pass filtered speech signal, frame energy, pitch period
Outputs – voicing class (non-speech, unvoiced, mixed-voiced, and fully-voiced speech)
Start
End
vad_flag ==false?
VC =non-speech
pitch period== 0?
No No
Yes
End
VC =unvoiced
Yes
(zcm >= zcm_thld ||ef_ub <= ef_ub_thld ||
hangover_flag == true)?
End
VC =mixed-voiced
Yes
End
VC =fully-voiced
No
04/19/23 19
Quantization of Voicing Class and Pitch
Class Quantization
Pitch QuantizationIn each frame-pair, the first frame’s pitch period (19 – 140) is absolutely quantized
using 7 bits; the second frame’s pitch period is differentially quantized using 5bits.
Voicing Class
(VC)
Pitch Index
(Pidx)
Class Index
(Cidx)
Non-speech 0 0
Unvoiced speech 0 1
Mixed-voiced speech > 0 0
Fully-voiced speech > 0 1
04/19/23 20
Quantization of Voicing Class and Pitch
Pitch indices of preceding three frames
Pidx (m-2)
Pidx (m-1)
Pidx (m)
Choice of reference pitch period and 31 quantization levels for (m+1)th frame
0 0 OR > 0 but
unreliable
0 No suitable reference is available. Use 5-bit absolute quantization. The 31 quantization levels are chosen to span the range from 19 to 140 uniformly in the log-domain.
Don’t care
Don’t care > 0 The quantized pitch period value of the mth frame is chosen as the reference. Out of the 31 quantization levels, 27 are chosen to cover the range from (0.8163*reference) to (1.2250*reference) uniformly in the log-domain. The other 4 levels depend on the reference value as follows: 19 <= reference <= 30 - (2.00, 3.00, 4.00, 5.00)*reference 30 < reference <= 60 - (1.50, 2.00, 2.50, 3.00)*reference 60 < reference <= 95 - (0.50, 0.67, 1.50, 2.00)*reference 95 < reference <= 140 - (0.25, 0.33, 0.50, 0.67)*reference
Don’t care
> 0 Reliable
0 The quantized pitch period value of the (m-1)th frame is chosen as the reference. The choice of quantization levels is the same as shown in the row below.
> 0 0 OR > 0 but
unreliable
0 The quantized pitch period value of the (m-2)th frame is chosen as the reference. Out of the 31 quantization levels, 25 are chosen to cover the range from (0.7781*reference) to (1.2852*reference) uniformly in the log-domain. The other 6 levels depend on the reference value as follows: 19 <= reference <= 30 - (1.50, 2.00, 2.50, 3.00, 4.00, 5.00)*reference 30 < reference <= 60 - (0.67, 1.50, 2.00, 2.50, 3.00, 4.00)*reference 60 < reference <= 95 - (0.33, 0.50, 0.67, 1.50, 1.75, 2.00)*reference 95 < reference <= 140 - (0.20, 0.25, 0.33, 0.50, 0.67, 1.50)*reference
04/19/23 21
Bit-Stream Formatting and Error Protection
Multi-Frame Format Sync Sequence Header Field Frame Packet Stream
<- 2 octets -> <- 4 octets -> <- 162 octets -> <- 168 octets ->
Header Field Format Bit 8 7 6 5 4 3 2 1 Octet
Ext MframeCnt FeType SampRate 1
EXP8 EXP7 EXP6 EXP5 EXP4 EXP3 EXP2 EXP1 2
P8 P7 P6 P5 P4 P3 P2 P1 3
P16 P15 P14 P13 P12 P11 P10 P9 4
Header Field Definitions Field No. Bits Meaning Code Indicator
SampRate 2 sampling rate 00 8 kHz 01 11 kHz 10 undefined 11 16 kHz FeType 1 Front-end specification 0 standard 1 Noise robust MframeCnt 4 multiframe counter xxxx Modulo-16 number Ext 1 Extended front-end 0 Not extended (4800 bps) 1 Extended (5600 bps) EXP2 - EXP9 8 Expansion bits (TBD) 0 (zero pad) P1 - P16 16 Cyclic code parity bits (see below)
04/19/23 22
Bit-Stream Formatting and Error Protection
Frame information for the mth and (m+1)th frames
Bit 8 7 6 5 4 3 2 1 Octet
Idx2,3(m) Idx0,1(m) 1
Idx4,5(m) Idx2,3(m) (cont) 2
Idx6,7(m) Idx4,5(m) (cont) 3
Idx10,11(m) Idx8,9(m) 4
Idx 12,13(m) Idx 10,11(m) (cont) 5
Idx0,1(m+1) Idx 12,13(m) (cont) 6
Idx2,3(m+1) Idx0,1(m+1) (cont) 7
Idx6,7(m+1) Idx4,5(m+1) 8
Idx8,9(m+1) Idx6,7(m+1) (cont) 9
Idx10,11(m+1) Idx8,9(m+1) (cont) 10
Idx 12,13(m) 11
Pidx(m) CRC(m,m+1) 12
Pidx(m+1) Pidx(m) (cont) 13
PC-CRC(m,m+1) Cidx(m+1) Cidx(m) 14
The first 11½ bytes correspond to FE or AFE. The last 2 bytes are added for the extension.
04/19/23 23
Outline• Introduction
• Part I – Terminal Side Algorithm Description
• Part II – Server Side Algorithm Description
• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation
04/19/23 24
Part II – Server Side Algorithm Description
• Bit-stream decoding and error mitigation
• Speech reconstruction block diagram
• Pitch tracking and smoothing
• Cepstra de-equalization (XAFE)
• Features transformation at 16kHz sampling rate (XAFE)
• Harmonic magnitudes reconstruction
• Harmonic phases synthesis
• Line spectrum to time-domain transformation
• Overlap-add
04/19/23 25
Bit-stream Decoding and Error Mitigation
• Extract pitch and voicing class indices and check PC-CRC
• Error free frame pair – decode– Decode voicing class using VC encoding table
– First frame – pitch index points to quantization level
– Second frame – decode pitch using Pitch encoding table
• Corrupt frame pair – keep receiving until error free pair is determined
• Assign pitch and class parameters of corrupt frames
04/19/23 26
Pitch, Class, & logE Assignment for Corrupt Frames
• B 2 – copy from last/first good frame
• 2 < B 12– copy pitch and class from last/first good frame
– “fully-voiced” class “mixed-voiced” class
– logE(n) = max(logE(n 1) – 2, 4.7)
• B > 12– class = “unvoiced”, pitch = 0, logE = 4.7
...
B First goodLast good B
04/19/23 27
Speech Reconstruction Block Diagram
• PTS – pitch tracking & smoothing• HIS – harmonics structure initialization• CDE – cepstra de-equalization only for XAFE• T16kHz – features transformation only for XAFE at 16 kHz• HOCR – high order cepstra recovery• UPH – unvoiced phase synthesis• SFEQ – solving front-end equations• CTM – cepstra to magnitudes transformation• COMB – combined magnitudes estimate
PTS
HSI
HOCR
SFEQ
CTM
COMB
APM
PF
VPH
LSTD
OLA
MFCC, logE
pitch
voicing class
speech
UPH
CDE T16 kHz
• APM – all-pole modeling
• VPH – voiced phase synthesis
• PF - postfiltering
• LSTD – line spectrum to time domain transformation
• OLA – overlap-add
04/19/23 28
Pitch Tracking and Smoothing
• 1-st stage– Handle short voiced segments– Find the most energetic set of
similar pitch values (track) and determine reference pitch value
– Do integer scaling
• 2-nd stage - correct outliers• 3-rd stage – smoothing by a 5-tap symmetric filter
…
10 8
most recentoldest
input
output
• Voicing class correction– Voiced Unvoiced – Voicing Class = “unvoiced”
– Unvoiced Voiced – Voicing Class = “mixed-voiced”
…
04/19/23 29
Pitch Contours Clean vs. Babble Noise
0 500 1000 1500 2000 25000
5
10
15
20
25
30
35
40
45
50
Time, msec
Pitch
samples
04/19/23 30
Pitch Contours XAFE vs. XFE
0 500 1000 1500 2000 25000
5
10
15
20
25
30
35
40
45
50
Time, msec
Pitch
samples
04/19/23 31
Speech Synthesis Input/Output
• Input– 13 low order cepstra (LOC): C0,C1,…,C12
– Pitch period p8kHz p=p8kHz out_sampling_rate / 8
– Log-energy logE
– Voicing class: • fully voiced
• mixed-voiced
• unvoiced (“unvoiced” + “non-speech”)
• Output speech signal– XFE: output sampling rate = input sampling rate - 8, 11, 16kHz
– XAFE: output sampling rate = 8kHz
04/19/23 32
Harmonic Model of Speech Frame
• Time-domain – sum of sinusoidal waves
hN
kkkk nfAns
1
2sin)(
• Frequency domain – line spectrum
h
k
N
kk
jk ffeAfS
1
)(
04/19/23 33
Harmonic Structure Initialisation
• Fully voiced frame - voiced harmonics array
??
2,,...,1,
kk
vvk
A
pfloorNNkpkf
• Unvoiced voiced frame - unvoiced harmonics array
?,20
2,,...,1,
kk
uuk
ARAND
FFTLNNkFFTLkf
• Mixed-voiced frame – voiced and unvoiced harmonic arrays
04/19/23 34
Cepstra De-equalization - XAFE
• Purpose: to reverse AFE blind equalization of the C1,…,C12 cepstra coefficients
new_biasbias
biasCC
RefCepCbiasnew_bias
stepSize
logEstepSizestepSize
999.0
)(
• Applied to quantized cepstra - regularization factor 0.999 guarantees stability
04/19/23 35
AFE 16kHz Features Transformation
• Purpose: to restore plain MFCC and energy representing [04kHz] frequency band
ElogEexplnlogE
valE
kik
valC
kknCval
kk
iik
nnk
)(
)9.1lnexp(
12,...,0,5.023
cos
26,...,1,)5.0(26
cos26
2
26
24
23
1
12
0
04/19/23 36
High Order Cepstra Recovery
• Purpose – to estimate high order cepstra (HOC) C13,…,C22 not transmitted from client side– Increases accuracy of harmonic magnitudes estimation
• Implemented through look-up table using pitch as parameter
• Pitch range is partitioned into sub-ranges
• Representative HOC vector for each sub-range stored in the table has been obtained by averaging over large speech database
• Further refinement of HOC is built in the magnitudes reconstruction procedure
04/19/23 37
Harmonic Magnitudes Reconstruction
• Two independent estimates of harmonic magnitudes are obtained by two methods:– Solving front-end equations (SFEQ)
– Cepstra to magnitudes transfomation (CTM)
• The estimates are mixed together using frequency and pitch dependent mixing ratio
04/19/23 38
Solving Front-End Equations (SFEQ)• Front-end equation ties harmonic parameters with mel-
filter bank outputs
niii
i
ham
ik
khamj
kn
i
nMFBB
MF
FTW
IDCT
iBfFFTLnFTWeAnMF k
)( :AFE
function htingfilter weg-melth -i
windowHamming of TransformFourier
))(exp(
23,...,1,
CB
• Linearization – especially applicable for voiced frames
23,...,1, iBfFFTLnFTWAnMF ik
khamkn
i
04/19/23 39
SFEQ
• 23 basis vectors are derived from mel-filter weighting functions sampled at harmonic frequencies
23,...,1,6.04.0 2 ifMFfMFfBV kikiki
18 20 22 24 26 28 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFT point
8-th mel-filter and basis vector.Pitch = 85.3 samples
04/19/23 40
SFEQ
• Harmonic magnitudes vector is represented as linear combination of basis vectors
23
1kii BVA
• Front-end equations in : BγBB
• Least square equations: BBBγIBBBB TT
04/19/23 41
SFEQ
• Built-in high order cepstra recovery for voiced frames
• 3 iterations
PreempBVA /.23
1
k
iiSFEQ
• SFEQ estimate:
Solve Equations
Compute HOC
Compute Magnitudes
HOC
HOC
LOC
ASFEQ
PITCH
04/19/23 42
Cepstra to Magnitudes Transformation (CTM)
• Modify cepstra to compensate influence of pre-emphasis and variable mel-channel width
fixCCC
• Find location of Mel-scaled harmonic frequency at Mel-scaled channel centers grid
),5.23(),,5.0(
24)64()(
)64()(
kkkk
Nyqist
kkk
MINMAX
HzMelFMel
HzMelfMelf
04/19/23 43
CTM
• Compute IDCT coefficient corresponding to (non-integer) index
harmn
knk NknCA ,...,1,)5.0(23
cos23
2 23
0
• Compute estimate of harmonic magnitudes as:
harm
kCTMk
kCTMk
Nk
AA
AA
,...,1
)exp( :XAFE
)exp( :XFE2/1
04/19/23 44
Combined Magnitudes Estimate (COMB) – Scaling SFEQ Magnitudes
• Unvoiced harmonics or short pitch period (p8kHz 55) – constant scaling factor:
SFEQCMTSF AA
• Long pitch period (p8kHz > 55) – frequency dependent scaling factor
NyquistHIGH
LOW
FkHzSF
kHzSF
2.1
2.10SFLOW
SFHIGH
SF
200 2500
Hz
04/19/23 45
COMB – Mixing SFEQ and CTM Magnitudes • Unvoiced harmonics: CMTSFEQ AAA 1.09.0
• Voiced harmonics – pitch dependent mixture ratio specified by a tableCTMSFEQ pλpλ AAA ))(1()(
20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(p)(XAFE)
04/19/23 46
All-pole Modeling of Spectral Envelope for Voiced Harmonics (APM)
Interpolate Magnitudes
Inverse DFT
Durbin Levinson
Long pitch period?
MagnitudesSynthesis
ACFA {an} YES
NOAnew
22
1
2
0
)...( 11
kM
i
fji
k
aa kM ea
aAargmin
a
04/19/23 47
Postfiltering (PF)
• Purpose – formants emphasis of voiced frames• Weighting of voiced harmonic magnitudes by filter
derived from all-pole model parameters
fjii ezzaz
zz
zzW
2
1
2
,1)(
5.01)95.0(
)75.0()(
a
a
a
• Weights W(exp(-j2fk)) are normalized bounded and applied to voiced harmonics
04/19/23 48
Voiced Phase Synthesis (VPH)
kcurrpreval
k fffk ),( 11
vexck
APk
alkk Nk ,...,1,
))(1arg( zAPk a
• Three additive components
• Linear in frequency phase providing alignment relative to previous frame
• Vocal tract phase derived from all-pole model parameters
• Pre-stored vocal cords excitation phase exck from table
04/19/23 49
Line Spectrum to Time Domain Transformation (LSTD)
• Mixed-voiced frames – low band (0 – 1200 Hz) voiced harmonics are combined with high band (1200 Hz – FNyquist) unvoiced harmonics
• Energy normalization– Simulate (non-windowed) analysis frame spectrum by
convolution of line spectrum with Dirichlet kernel
– Compute energy Eout
– Compute scaling factor
– Multiply harmonic magnitudes by SC
outElogESC )exp(
04/19/23 50
LSTD
• Synthesis of output frame discrete spectrum by convolution of line spectrum with Hann window Fourier transform 1,...,0,)( FFTLnfFFTLnFTWeAnS
kkhann
jkout
k
0 20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame Shift
• Inverse FFT
FFTFTWhann
IFFT soutSout
04/19/23 51
Overlap-add (OLA)
• FRAME_SHIFT samples long segment of reconstructed speech is available
• Next FRAME_SHIFT samples long segment is initialized
FS samples FS-1 samples
Output Buffer
+
Output
Output Buffer
mo
ve
sout(n), n=0,…,LFFT-1
04/19/23 52
Speech Samples (English)
Coder Clean Car noise (10 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 53
Speech Samples (French)
Coder Clean Street noise (15 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 54
Speech Samples (German)
Coder Clean Babble noise (15 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 55
Speech Samples (Italian)
Coder Clean Car noise (10 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 56
Speech Samples (Japanese)
Coder Clean Street noise (15 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 57
Speech Samples (Mandarin)
Coder Clean Babble noise (15 dB SNR)
Male 1 Female 1 Male 2 Female 2
ORIGINAL
LPC-10
MELP
XFE
XAFE
04/19/23 58
Outline• Introduction
• Part I – Terminal Side Algorithm Description
• Part II – Server Side Algorithm Description
• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition evaluation– Intelligibility Evaluation
04/19/23 59
Part III – Results vs. Requirements
• Algorithmic Requirements– Data Rate, Update Rate, and Latency
– Bit-stream Formatting and Error Protection
– Complexity
• Tonal Language Recognition Evaluation– Motorola Evaluation
– IBM Evaluation
• Intelligibility Evaluation– Diagnostic Rhyme Test (DRT)
– Transcription Test (TT)
04/19/23 60
Algorithmic Requirements
Data Rate, Update Rate, and Latency
Bit-stream Formatting and Error ProtectionRequirements: 1) Same multi-frame format, 2) EXP? bit to be used to
indicate extension, 3) 4-bit CRC may be extended to protect pitch and class bits
Actual implementation: 1) Same multi-frame format (24 extra bytes), 2) EXP1 bit indicates extension, 3) Separate 2-bit CRC used for pitch and class bits
Algorithm Parameter
Required Value Actual Value
Data Rate (bps) <= 5800 5600 Update Rate (ms) 10 10
Latency
Same as un-extended Front-ends
Same as un-extended Front-ends
04/19/23 61
Algorithmic Requirements
Complexity Complexity Requirements / Objectives
Measure Requirement WMOPS < 17
ROM size (kWords) < 15 RAM size (kWords) < 6
XFE Complexity (meets requirement) Measure FE
Extension XFE WMOPS 6.28 5.62 11.9
ROM size (kWords) 1.88 5.65 7.53 RAM size (kWords) 1.92 3.81 5.73
Assumed to be one half the complexity of the Advance Front End
XAFE Complexity (close to objective) Measure AFE* Extension XAFE WMOPS 12.55 5.21 17.76
ROM size (kWords) 3.752 5.396 9.148 RAM size (kWords) 3.830 3.290 6.864**
* From Motorola-France Telecom-Alcatel Advanced Front-End Proposal, January 31, 2002 ** With scratch memory reuse
04/19/23 62
FE/AFE Extension Complexity Analysis
Block WMOPS ROM RAM
LBND+Preprocessing 0.874 / 0.489 0.235 1.12 / 0.6
VAD 0.385 0.676 0.017
Pitch estimation 4.152 / 4.126 3.010 / 2.754 2.645
Voicing classification 0.164 0.047 0.007
Compression 0.046 1.684 0.021
Total 5.62 / 5.21 5.65 / 5.396 3.81 / 3.29
04/19/23 63
Tonal Language Recognition Evaluation
Motorola Results
Error Rate (%) OR Improvement (%) Cantonese Digits
Configuration Mandarin
Digits Mandarin
Commands Clean Training
Multi Training
C1 (no F0) 31.13 40.12 16.70 10.74 C2 (proprietary F0)
29.84 33.92 19.14 10.58
C3 (WI30 F0) 29.45 33.89 14.52 8.89 D21 4.14 15.45 -14.61 1.49 D31 5.40 15.53 13.05 17.23 D32 1.31 0.09 24.14 15.97
TLR Requirement: D31 >= D21 OR D32 >= 0
04/19/23 64
Tonal Language Recognition Evaluation
IBM Results
Error Rate (%) OR Improvement (%) Configuration Mandarin Digits Cantonese Digits
C1 (no F0) 3.31 4.00 C2 (proprietary F0) 3.08 4.41 C3 (WI30 F0) 3.04 3.99 D21 6.95 -10.25 D31 8.16 0.25 D32 1.30 9.52
TLR Requirement: D31 >= D21 OR D32 >= 0
04/19/23 65
Intelligibility Evaluation
Diagnostic Rhyme Test (DRT)
DRT Experiment I – Background Noise Conditions
Noise Type: Coder:
Clean Car 10dB
Street 15dB
Babble 15dB
Unprocessed 95.7 95.5 92.4 93.8
XFE Reconstruction 93.0 88.8 85.0 87.1
XAFE Reconstruction 92.8 88.9 87.5 87.9
LPC-10 86.9 81.3 81.2 81.2
MELP 91.6 86.8 85.0 85.3
DRT Requirement: XFE and XAFE scores >= LPC-10 scores DRT Objective: XFE and XAFE scores >= MELP scores
04/19/23 66
Intelligibility EvaluationDiagnostic Rhyme Test (DRT)
DRT Experiment II – Input Signal Levels
Input level: Coder:
-10dB
0 dB*
+10dB
XFE Reconstruction 91.7 93.0 93.3 XAFE Reconstruction 92.1 92.8 92.8 * From Experiment I
DRT Experiment III – Channel Errors Bit Errors: Coder:
None*
EP1 (C/I 10 dB)
EP2 (C/I 7 dB)
EP3 (C/I 4 dB)
XFE Reconstruction 93.0 92.6 92.1 83.4 XAFE Reconstruction 92.8 92.6 92.0 83.4
From Experiment I AFE Results 91.6 91.6 91.4 85.6
DRT Experiment IV – Sampling Frequencies
Sampling Frequency: Coder:
8 kHz*
11 kHz
16 kHz
XFE Reconstruction 93.0 92.9 94.2 XAFE Reconstruction 92.8 93.5 92.1 * From Experiment I
04/19/23 67
Intelligibility EvaluationTranscription Test (TT)
Number of Missed, Wrongly transcribed, and Partially transcribed Words
TT Requirement: XFE and XAFE error rates <= LPC-10 error rateTT Objective: XFE and XAFE error rates <= MELP error rate
Bckgnd noise
Coder
Clean Car Street Babble Clean Average Error (%)
Uncoded (Original) 1,1,2 1,0,1 0,2,4 3,9,3 0,4,1 0.5492
XFE Reconstruction 1,6,1 0,3,6 2,9,4 5,9,2 1,4,5 0.9954
XAFE Reconstruction 0,6,2 0,5,4 0,4,3 3,5,2 1,6,5 0.7894
LPC-10 Coder 8,18,6 62,26,7 67,22,7 47,12,3 18,10,9 5.5260
MELP Coder 0,3,1 1,6,3 4,6,2 16,10,3 1,9,5 1.2013
No. of words in msg. 1166 1153 1155 1149 1204 Tot: 5827