dsr front-end extension for tonal-language recognition and speech reconstruction

68
03/14/22 1 DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction Aurora Group Meeting, April 2003 By IBM & Motorola

Upload: ciaran-adams

Post on 31-Dec-2015

39 views

Category:

Documents


0 download

DESCRIPTION

DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction. Aurora Group Meeting, April 2003 By IBM & Motorola. Outline. Introduction Part I – Terminal Side Algorithm Description Part II – Server Side Algorithm Description Part III – Results vs . Requirements - PowerPoint PPT Presentation

TRANSCRIPT

04/19/23 1

DSR Front-end Extension forTonal-language Recognitionand Speech Reconstruction

Aurora Group Meeting, April 2003

By IBM & Motorola

04/19/23 2

Outline• Introduction

• Part I – Terminal Side Algorithm Description

• Part II – Server Side Algorithm Description

• Part III – Results vs. Requirements– Algorithmic Requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation

04/19/23 3

Introduction

Historical SnapshotsJuly 2000 – Speech reconstruction defined as one of the areas

to be addressed by the committeeFeb. 2001 – Tonal Language Recognition added to speech

reconstructionJuly 2001 – New work item for extension of FE (WI-030)

openedApril 2002 – Joint-development contract signed between IBM

and MotorolaAugust 2002 – Work item for extension of AFE (WI-034)

opened

04/19/23 4

Introduction

System Overview

Pitch & ClassEstimation

Pitch Trackingand Smoothing

SpeechReconstruction

Pitch & Class

@ 800 bps

CHANNEL

ETSI StandardDSR Front-End

DSRBack-End

MFCC & log-E

@ 4800 bps

TonalInformation

SpeechIn

SpeechOut

04/19/23 5

Outline

• Introduction

• Part I – Terminal Side Algorithm Description

• Part II – Server Side Algorithm Description

• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation

04/19/23 6

Part I – Terminal Side Algorithm Description

• XFE block diagram

• XAFE block diagram

• Voice activity detection

• Low band noise detection

• Pre-processing of speech signal

• Pitch estimation

• Voicing classification

• Quantization of voicing class and pitch

• Bit-stream formatting and error protection

04/19/23 7

XFE Block Diagram ADC Offcom Framing PE W FFT MF LOG DCT

EC

Feature Compression

Bit Stream FormattingFraming

To transmission channel

Inputspeech

VAD

logE

PITCH

CLS

PP

MF

CCLBND

Abbreviations

EC - Energy computation

logE - Log energy measure computation

VAD - Voice activity detection

LBND - Low-band noise detection

PP - Pre-processing

PITCH - Pitch estimation

CLS - Classification

log

-E

P VC

FE blocks

Extension blocks

Interface blocks

04/19/23 8

XAFE Block Diagram

SECVADVC

PITCH

CLS

PP

LBND

Spectrum Estimation

Sin(n)

MF

P VC

Abbreviations

SEC - Spectrum and energy computation

MF - Mel-filtering

VADVC - Voice activity detection for voicing classification

LBND - Low-band noise detection

PP - Pre-processing

PITCH - Pitch estimation

CLS - Classification

Rest of the Noise Reduction Blocks

AFE blocks

Extension blocks

Interface blocks

04/19/23 9

U P D A T E_

F L A G

FU

PD

AT

E_F

LA

G

E(m)

V(m)

q(m)

Ech(m)En(m)

En(m+1)

NOISE ENERGY

SMOOTHER

NOISE ENERGY

ESTIMATE STORAGE

SPECTRAL DEVIATION ESTIMATOR

UPDATE DECISION

DETERMINER

VOICE METRIC CALCULATOR

VOICE ACTIVITY

DETERMINERvad_flag

hangover_flag

CHANNEL ENERGY

ESTIMATOR

F(m)

Ech(m)En(m)

CHANNEL SNR ESTIMATOR

SIGNAL SNR ESTIMATOR

PEAK TO AVERAGE

RATIO ESTIMATOR

201

205

202

203

204

206

207

208

209210

SNRq(m)

P2A(m)

To 205, 206, & 208

To 208

To 208

Voice ActivityDetection

INPUT

OUTPUT

Inputs – Filter bank output (23)

Outputs – vad_flag, hangover_flag

04/19/23 10

Low Band Noise Detection

Inputs – power spectrum, vad_flag, frame energy

Output – lbn_flag

Low-band – Below 380 Hz

Find max.power inhigh band

vad_flag ==false?

E >=enrg_thldStart

End End

Yes Yes

No No

Find max.power inlow band

Find ratiolow / high

Filter ratioratio >

ratio_thld?

Endlbn_flag =

false

lbn_flag =true Yes

No

04/19/23 11

Pre-Processing of Speech Signal

Inputs – input speech signal, lbn_flag

Outputs – low-pass filtered, down-sampled speech signal Slpds high-pass filtered speech signal Sub

Low-passFilter # 1

Low-passFilter # 2

High-passFilter

Down-sample

lbn_flag = TRUE

lbn_flag = FALSE

Slpds

Sub

Sin

04/19/23 12

Pitch Estimation

Inputs – vad_flag, lbnd_flag, low-pass filtered, down-sampled speech signal, fourier spectrum, power spectrum, spectral average, log-E

Output – pitch period P (P = 0 for unvoiced frames)

Frequency ranges (Hz)[200,420], [100,210], [52,120]

Stable track with frequency F0[0.666*F0,2.2*F0] the above 3 ranges

F0 candidates generation

Pitch selection

Found pitch?

Low-pass filtered, down-sampled speechSTFT, PS

Convert pitch and output

History update

Corrrelationcalculation

YesNo

Select nextfreq. range

04/19/23 13

Pitch Estimation• Find F0 among common integer dividers of spectral peak frequencies• Give preference to higher dividers

0 500 1000 1500 2000 2500 3000 3500 40000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Hz

F0

04/19/23 14

Pitch Estimation• Utility function generalizes concept of integer divider • Utility function – superposition of components generated by spectral peaks

-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1

1.2

1.4

Fi/F0 - N

5121002,512651

)()1(

5.02,0

21,5.0

1,1

)(

)0()0(

DD

rIrI

rD

DrD

Dr

rI

where

FPeakFIPeakMagFUi

ii

One period of influence function I(r)

04/19/23 15

Pitch Estimation

0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

1.2

1.4Pitch candidates generated by peak at 700Hz

Hz

Utility

function

Utility function component generated by peak of unit magnitude at 700 Hz

F0min F0max

04/19/23 16

Pitch Estimation

F0 candidates generation and Correlation calculation

Process power spectrum –Double resolution, (doubleframe-size), de-emphasize,and smooth

Pick local peaks, scaledown high-freq. peaks,limit number of peaks,refine locations andamplitudes, normalize Build utility function, select

at most two FO candidateswith high spectral scores,giving preference to higherfrequencies, and frequenciesnear previous F0 estimates

Convert F0 candidatesinto corresponding lags

Compute correlation scoresat each lag using speechsegments having the highest energy & separated by the lag

To pitch selectionSTFT, PS Low-pass filtered, down-sampled speech

Process spectrum

Pick peaks

Compute correlation

Build utility function

04/19/23 17

Pitch Estimation

Pitch selectionClass1

(CS>0.79 AND SS>0.78)

OR (SS>0.68 AND

SS+CS>1.6)

Class2

(CS>0.7 AND SS>0.7) AND

(0.82Ref<F0<1.22Ref)

Class3

(CS>0.85 OR

SS>0.82)

Sort F0

Find best class1 cand.

Found?

Yes

No

Set pitch

Full list? NoYes

Stable Track?

Set ref. to stable pitch

Find best class2 cand.

Found?

Cont. pitch?

Yes

No

No

Yes

Find best class3 cand.

Found?

ss>.95& cs>.95?

Set pitch

Yes No

Set uv pitch

Set pitch

Set uv pitch

Set pitch

Set uv pitch

Set ref. to previous pitch

Find best class2 cand.

Found?No

Yes

Set pitch

Yes

No

No

Yes

04/19/23 18

Voicing Classification

Inputs – vad_flag, hangover_flag, input speech signal, high-pass filtered speech signal, frame energy, pitch period

Outputs – voicing class (non-speech, unvoiced, mixed-voiced, and fully-voiced speech)

Start

End

vad_flag ==false?

VC =non-speech

pitch period== 0?

No No

Yes

End

VC =unvoiced

Yes

(zcm >= zcm_thld ||ef_ub <= ef_ub_thld ||

hangover_flag == true)?

End

VC =mixed-voiced

Yes

End

VC =fully-voiced

No

04/19/23 19

Quantization of Voicing Class and Pitch

Class Quantization

Pitch QuantizationIn each frame-pair, the first frame’s pitch period (19 – 140) is absolutely quantized

using 7 bits; the second frame’s pitch period is differentially quantized using 5bits.

Voicing Class

(VC)

Pitch Index

(Pidx)

Class Index

(Cidx)

Non-speech 0 0

Unvoiced speech 0 1

Mixed-voiced speech > 0 0

Fully-voiced speech > 0 1

04/19/23 20

Quantization of Voicing Class and Pitch

Pitch indices of preceding three frames

Pidx (m-2)

Pidx (m-1)

Pidx (m)

Choice of reference pitch period and 31 quantization levels for (m+1)th frame

0 0 OR > 0 but

unreliable

0 No suitable reference is available. Use 5-bit absolute quantization. The 31 quantization levels are chosen to span the range from 19 to 140 uniformly in the log-domain.

Don’t care

Don’t care > 0 The quantized pitch period value of the mth frame is chosen as the reference. Out of the 31 quantization levels, 27 are chosen to cover the range from (0.8163*reference) to (1.2250*reference) uniformly in the log-domain. The other 4 levels depend on the reference value as follows: 19 <= reference <= 30 - (2.00, 3.00, 4.00, 5.00)*reference 30 < reference <= 60 - (1.50, 2.00, 2.50, 3.00)*reference 60 < reference <= 95 - (0.50, 0.67, 1.50, 2.00)*reference 95 < reference <= 140 - (0.25, 0.33, 0.50, 0.67)*reference

Don’t care

> 0 Reliable

0 The quantized pitch period value of the (m-1)th frame is chosen as the reference. The choice of quantization levels is the same as shown in the row below.

> 0 0 OR > 0 but

unreliable

0 The quantized pitch period value of the (m-2)th frame is chosen as the reference. Out of the 31 quantization levels, 25 are chosen to cover the range from (0.7781*reference) to (1.2852*reference) uniformly in the log-domain. The other 6 levels depend on the reference value as follows: 19 <= reference <= 30 - (1.50, 2.00, 2.50, 3.00, 4.00, 5.00)*reference 30 < reference <= 60 - (0.67, 1.50, 2.00, 2.50, 3.00, 4.00)*reference 60 < reference <= 95 - (0.33, 0.50, 0.67, 1.50, 1.75, 2.00)*reference 95 < reference <= 140 - (0.20, 0.25, 0.33, 0.50, 0.67, 1.50)*reference

04/19/23 21

Bit-Stream Formatting and Error Protection

Multi-Frame Format Sync Sequence Header Field Frame Packet Stream

<- 2 octets -> <- 4 octets -> <- 162 octets -> <- 168 octets ->

Header Field Format Bit 8 7 6 5 4 3 2 1 Octet

Ext MframeCnt FeType SampRate 1

EXP8 EXP7 EXP6 EXP5 EXP4 EXP3 EXP2 EXP1 2

P8 P7 P6 P5 P4 P3 P2 P1 3

P16 P15 P14 P13 P12 P11 P10 P9 4

Header Field Definitions Field No. Bits Meaning Code Indicator

SampRate 2 sampling rate 00 8 kHz 01 11 kHz 10 undefined 11 16 kHz FeType 1 Front-end specification 0 standard 1 Noise robust MframeCnt 4 multiframe counter xxxx Modulo-16 number Ext 1 Extended front-end 0 Not extended (4800 bps) 1 Extended (5600 bps) EXP2 - EXP9 8 Expansion bits (TBD) 0 (zero pad) P1 - P16 16 Cyclic code parity bits (see below)

04/19/23 22

Bit-Stream Formatting and Error Protection

Frame information for the mth and (m+1)th frames

Bit 8 7 6 5 4 3 2 1 Octet

Idx2,3(m) Idx0,1(m) 1

Idx4,5(m) Idx2,3(m) (cont) 2

Idx6,7(m) Idx4,5(m) (cont) 3

Idx10,11(m) Idx8,9(m) 4

Idx 12,13(m) Idx 10,11(m) (cont) 5

Idx0,1(m+1) Idx 12,13(m) (cont) 6

Idx2,3(m+1) Idx0,1(m+1) (cont) 7

Idx6,7(m+1) Idx4,5(m+1) 8

Idx8,9(m+1) Idx6,7(m+1) (cont) 9

Idx10,11(m+1) Idx8,9(m+1) (cont) 10

Idx 12,13(m) 11

Pidx(m) CRC(m,m+1) 12

Pidx(m+1) Pidx(m) (cont) 13

PC-CRC(m,m+1) Cidx(m+1) Cidx(m) 14

The first 11½ bytes correspond to FE or AFE. The last 2 bytes are added for the extension.

04/19/23 23

Outline• Introduction

• Part I – Terminal Side Algorithm Description

• Part II – Server Side Algorithm Description

• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition Evaluation– Intelligibility Evaluation

04/19/23 24

Part II – Server Side Algorithm Description

• Bit-stream decoding and error mitigation

• Speech reconstruction block diagram

• Pitch tracking and smoothing

• Cepstra de-equalization (XAFE)

• Features transformation at 16kHz sampling rate (XAFE)

• Harmonic magnitudes reconstruction

• Harmonic phases synthesis

• Line spectrum to time-domain transformation

• Overlap-add

04/19/23 25

Bit-stream Decoding and Error Mitigation

• Extract pitch and voicing class indices and check PC-CRC

• Error free frame pair – decode– Decode voicing class using VC encoding table

– First frame – pitch index points to quantization level

– Second frame – decode pitch using Pitch encoding table

• Corrupt frame pair – keep receiving until error free pair is determined

• Assign pitch and class parameters of corrupt frames

04/19/23 26

Pitch, Class, & logE Assignment for Corrupt Frames

• B 2 – copy from last/first good frame

• 2 < B 12– copy pitch and class from last/first good frame

– “fully-voiced” class “mixed-voiced” class

– logE(n) = max(logE(n 1) – 2, 4.7)

• B > 12– class = “unvoiced”, pitch = 0, logE = 4.7

...

B First goodLast good B

04/19/23 27

Speech Reconstruction Block Diagram

• PTS – pitch tracking & smoothing• HIS – harmonics structure initialization• CDE – cepstra de-equalization only for XAFE• T16kHz – features transformation only for XAFE at 16 kHz• HOCR – high order cepstra recovery• UPH – unvoiced phase synthesis• SFEQ – solving front-end equations• CTM – cepstra to magnitudes transformation• COMB – combined magnitudes estimate

PTS

HSI

HOCR

SFEQ

CTM

COMB

APM

PF

VPH

LSTD

OLA

MFCC, logE

pitch

voicing class

speech

UPH

CDE T16 kHz

• APM – all-pole modeling

• VPH – voiced phase synthesis

• PF - postfiltering

• LSTD – line spectrum to time domain transformation

• OLA – overlap-add

04/19/23 28

Pitch Tracking and Smoothing

• 1-st stage– Handle short voiced segments– Find the most energetic set of

similar pitch values (track) and determine reference pitch value

– Do integer scaling

• 2-nd stage - correct outliers• 3-rd stage – smoothing by a 5-tap symmetric filter

10 8

most recentoldest

input

output

• Voicing class correction– Voiced Unvoiced – Voicing Class = “unvoiced”

– Unvoiced Voiced – Voicing Class = “mixed-voiced”

04/19/23 29

Pitch Contours Clean vs. Babble Noise

0 500 1000 1500 2000 25000

5

10

15

20

25

30

35

40

45

50

Time, msec

Pitch

samples

04/19/23 30

Pitch Contours XAFE vs. XFE

0 500 1000 1500 2000 25000

5

10

15

20

25

30

35

40

45

50

Time, msec

Pitch

samples

04/19/23 31

Speech Synthesis Input/Output

• Input– 13 low order cepstra (LOC): C0,C1,…,C12

– Pitch period p8kHz p=p8kHz out_sampling_rate / 8

– Log-energy logE

– Voicing class: • fully voiced

• mixed-voiced

• unvoiced (“unvoiced” + “non-speech”)

• Output speech signal– XFE: output sampling rate = input sampling rate - 8, 11, 16kHz

– XAFE: output sampling rate = 8kHz

04/19/23 32

Harmonic Model of Speech Frame

• Time-domain – sum of sinusoidal waves

hN

kkkk nfAns

1

2sin)(

• Frequency domain – line spectrum

h

k

N

kk

jk ffeAfS

1

)(

04/19/23 33

Harmonic Structure Initialisation

• Fully voiced frame - voiced harmonics array

??

2,,...,1,

kk

vvk

A

pfloorNNkpkf

• Unvoiced voiced frame - unvoiced harmonics array

?,20

2,,...,1,

kk

uuk

ARAND

FFTLNNkFFTLkf

• Mixed-voiced frame – voiced and unvoiced harmonic arrays

04/19/23 34

Cepstra De-equalization - XAFE

• Purpose: to reverse AFE blind equalization of the C1,…,C12 cepstra coefficients

new_biasbias

biasCC

RefCepCbiasnew_bias

stepSize

logEstepSizestepSize

999.0

)(

• Applied to quantized cepstra - regularization factor 0.999 guarantees stability

04/19/23 35

AFE 16kHz Features Transformation

• Purpose: to restore plain MFCC and energy representing [04kHz] frequency band

ElogEexplnlogE

valE

kik

valC

kknCval

kk

iik

nnk

)(

)9.1lnexp(

12,...,0,5.023

cos

26,...,1,)5.0(26

cos26

2

26

24

23

1

12

0

04/19/23 36

High Order Cepstra Recovery

• Purpose – to estimate high order cepstra (HOC) C13,…,C22 not transmitted from client side– Increases accuracy of harmonic magnitudes estimation

• Implemented through look-up table using pitch as parameter

• Pitch range is partitioned into sub-ranges

• Representative HOC vector for each sub-range stored in the table has been obtained by averaging over large speech database

• Further refinement of HOC is built in the magnitudes reconstruction procedure

04/19/23 37

Harmonic Magnitudes Reconstruction

• Two independent estimates of harmonic magnitudes are obtained by two methods:– Solving front-end equations (SFEQ)

– Cepstra to magnitudes transfomation (CTM)

• The estimates are mixed together using frequency and pitch dependent mixing ratio

04/19/23 38

Solving Front-End Equations (SFEQ)• Front-end equation ties harmonic parameters with mel-

filter bank outputs

niii

i

ham

ik

khamj

kn

i

nMFBB

MF

FTW

IDCT

iBfFFTLnFTWeAnMF k

)( :AFE

function htingfilter weg-melth -i

windowHamming of TransformFourier

))(exp(

23,...,1,

CB

• Linearization – especially applicable for voiced frames

23,...,1, iBfFFTLnFTWAnMF ik

khamkn

i

04/19/23 39

SFEQ

• 23 basis vectors are derived from mel-filter weighting functions sampled at harmonic frequencies

23,...,1,6.04.0 2 ifMFfMFfBV kikiki

18 20 22 24 26 28 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FFT point

8-th mel-filter and basis vector.Pitch = 85.3 samples

04/19/23 40

SFEQ

• Harmonic magnitudes vector is represented as linear combination of basis vectors

23

1kii BVA

• Front-end equations in : BγBB

• Least square equations: BBBγIBBBB TT

04/19/23 41

SFEQ

• Built-in high order cepstra recovery for voiced frames

• 3 iterations

PreempBVA /.23

1

k

iiSFEQ

• SFEQ estimate:

Solve Equations

Compute HOC

Compute Magnitudes

HOC

HOC

LOC

ASFEQ

PITCH

04/19/23 42

Cepstra to Magnitudes Transformation (CTM)

• Modify cepstra to compensate influence of pre-emphasis and variable mel-channel width

fixCCC

• Find location of Mel-scaled harmonic frequency at Mel-scaled channel centers grid

),5.23(),,5.0(

24)64()(

)64()(

kkkk

Nyqist

kkk

MINMAX

HzMelFMel

HzMelfMelf

04/19/23 43

CTM

• Compute IDCT coefficient corresponding to (non-integer) index

harmn

knk NknCA ,...,1,)5.0(23

cos23

2 23

0

• Compute estimate of harmonic magnitudes as:

harm

kCTMk

kCTMk

Nk

AA

AA

,...,1

)exp( :XAFE

)exp( :XFE2/1

04/19/23 44

Combined Magnitudes Estimate (COMB) – Scaling SFEQ Magnitudes

• Unvoiced harmonics or short pitch period (p8kHz 55) – constant scaling factor:

SFEQCMTSF AA

• Long pitch period (p8kHz > 55) – frequency dependent scaling factor

NyquistHIGH

LOW

FkHzSF

kHzSF

2.1

2.10SFLOW

SFHIGH

SF

200 2500

Hz

04/19/23 45

COMB – Mixing SFEQ and CTM Magnitudes • Unvoiced harmonics: CMTSFEQ AAA 1.09.0

• Voiced harmonics – pitch dependent mixture ratio specified by a tableCTMSFEQ pλpλ AAA ))(1()(

20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(p)(XAFE)

04/19/23 46

All-pole Modeling of Spectral Envelope for Voiced Harmonics (APM)

Interpolate Magnitudes

Inverse DFT

Durbin Levinson

Long pitch period?

MagnitudesSynthesis

ACFA {an} YES

NOAnew

22

1

2

0

)...( 11

kM

i

fji

k

aa kM ea

aAargmin

a

04/19/23 47

Postfiltering (PF)

• Purpose – formants emphasis of voiced frames• Weighting of voiced harmonic magnitudes by filter

derived from all-pole model parameters

fjii ezzaz

zz

zzW

2

1

2

,1)(

5.01)95.0(

)75.0()(

a

a

a

• Weights W(exp(-j2fk)) are normalized bounded and applied to voiced harmonics

04/19/23 48

Voiced Phase Synthesis (VPH)

kcurrpreval

k fffk ),( 11

vexck

APk

alkk Nk ,...,1,

))(1arg( zAPk a

• Three additive components

• Linear in frequency phase providing alignment relative to previous frame

• Vocal tract phase derived from all-pole model parameters

• Pre-stored vocal cords excitation phase exck from table

04/19/23 49

Line Spectrum to Time Domain Transformation (LSTD)

• Mixed-voiced frames – low band (0 – 1200 Hz) voiced harmonics are combined with high band (1200 Hz – FNyquist) unvoiced harmonics

• Energy normalization– Simulate (non-windowed) analysis frame spectrum by

convolution of line spectrum with Dirichlet kernel

– Compute energy Eout

– Compute scaling factor

– Multiply harmonic magnitudes by SC

outElogESC )exp(

04/19/23 50

LSTD

• Synthesis of output frame discrete spectrum by convolution of line spectrum with Hann window Fourier transform 1,...,0,)( FFTLnfFFTLnFTWeAnS

kkhann

jkout

k

0 20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame Shift

• Inverse FFT

FFTFTWhann

IFFT soutSout

04/19/23 51

Overlap-add (OLA)

• FRAME_SHIFT samples long segment of reconstructed speech is available

• Next FRAME_SHIFT samples long segment is initialized

FS samples FS-1 samples

Output Buffer

+

Output

Output Buffer

mo

ve

sout(n), n=0,…,LFFT-1

04/19/23 52

Speech Samples (English)

Coder Clean Car noise (10 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 53

Speech Samples (French)

Coder Clean Street noise (15 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 54

Speech Samples (German)

Coder Clean Babble noise (15 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 55

Speech Samples (Italian)

Coder Clean Car noise (10 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 56

Speech Samples (Japanese)

Coder Clean Street noise (15 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 57

Speech Samples (Mandarin)

Coder Clean Babble noise (15 dB SNR)

Male 1 Female 1 Male 2 Female 2

ORIGINAL

LPC-10

MELP

XFE

XAFE

04/19/23 58

Outline• Introduction

• Part I – Terminal Side Algorithm Description

• Part II – Server Side Algorithm Description

• Part III – Results vs. Requirements– Algorithmic requirements– Tonal Language Recognition evaluation– Intelligibility Evaluation

04/19/23 59

Part III – Results vs. Requirements

• Algorithmic Requirements– Data Rate, Update Rate, and Latency

– Bit-stream Formatting and Error Protection

– Complexity

• Tonal Language Recognition Evaluation– Motorola Evaluation

– IBM Evaluation

• Intelligibility Evaluation– Diagnostic Rhyme Test (DRT)

– Transcription Test (TT)

04/19/23 60

Algorithmic Requirements

Data Rate, Update Rate, and Latency

Bit-stream Formatting and Error ProtectionRequirements: 1) Same multi-frame format, 2) EXP? bit to be used to

indicate extension, 3) 4-bit CRC may be extended to protect pitch and class bits

Actual implementation: 1) Same multi-frame format (24 extra bytes), 2) EXP1 bit indicates extension, 3) Separate 2-bit CRC used for pitch and class bits

Algorithm Parameter

Required Value Actual Value

Data Rate (bps) <= 5800 5600 Update Rate (ms) 10 10

Latency

Same as un-extended Front-ends

Same as un-extended Front-ends

04/19/23 61

Algorithmic Requirements

Complexity Complexity Requirements / Objectives

Measure Requirement WMOPS < 17

ROM size (kWords) < 15 RAM size (kWords) < 6

XFE Complexity (meets requirement) Measure FE

Extension XFE WMOPS 6.28 5.62 11.9

ROM size (kWords) 1.88 5.65 7.53 RAM size (kWords) 1.92 3.81 5.73

Assumed to be one half the complexity of the Advance Front End

XAFE Complexity (close to objective) Measure AFE* Extension XAFE WMOPS 12.55 5.21 17.76

ROM size (kWords) 3.752 5.396 9.148 RAM size (kWords) 3.830 3.290 6.864**

* From Motorola-France Telecom-Alcatel Advanced Front-End Proposal, January 31, 2002 ** With scratch memory reuse

04/19/23 62

FE/AFE Extension Complexity Analysis

Block WMOPS ROM RAM

LBND+Preprocessing 0.874 / 0.489 0.235 1.12 / 0.6

VAD 0.385 0.676 0.017

Pitch estimation 4.152 / 4.126 3.010 / 2.754 2.645

Voicing classification 0.164 0.047 0.007

Compression 0.046 1.684 0.021

Total 5.62 / 5.21 5.65 / 5.396 3.81 / 3.29

04/19/23 63

Tonal Language Recognition Evaluation

Motorola Results

Error Rate (%) OR Improvement (%) Cantonese Digits

Configuration Mandarin

Digits Mandarin

Commands Clean Training

Multi Training

C1 (no F0) 31.13 40.12 16.70 10.74 C2 (proprietary F0)

29.84 33.92 19.14 10.58

C3 (WI30 F0) 29.45 33.89 14.52 8.89 D21 4.14 15.45 -14.61 1.49 D31 5.40 15.53 13.05 17.23 D32 1.31 0.09 24.14 15.97

TLR Requirement: D31 >= D21 OR D32 >= 0

04/19/23 64

Tonal Language Recognition Evaluation

IBM Results

Error Rate (%) OR Improvement (%) Configuration Mandarin Digits Cantonese Digits

C1 (no F0) 3.31 4.00 C2 (proprietary F0) 3.08 4.41 C3 (WI30 F0) 3.04 3.99 D21 6.95 -10.25 D31 8.16 0.25 D32 1.30 9.52

TLR Requirement: D31 >= D21 OR D32 >= 0

04/19/23 65

Intelligibility Evaluation

Diagnostic Rhyme Test (DRT)

DRT Experiment I – Background Noise Conditions

Noise Type: Coder:

Clean Car 10dB

Street 15dB

Babble 15dB

Unprocessed 95.7 95.5 92.4 93.8

XFE Reconstruction 93.0 88.8 85.0 87.1

XAFE Reconstruction 92.8 88.9 87.5 87.9

LPC-10 86.9 81.3 81.2 81.2

MELP 91.6 86.8 85.0 85.3

DRT Requirement: XFE and XAFE scores >= LPC-10 scores DRT Objective: XFE and XAFE scores >= MELP scores

04/19/23 66

Intelligibility EvaluationDiagnostic Rhyme Test (DRT)

DRT Experiment II – Input Signal Levels

Input level: Coder:

-10dB

0 dB*

+10dB

XFE Reconstruction 91.7 93.0 93.3 XAFE Reconstruction 92.1 92.8 92.8 * From Experiment I

DRT Experiment III – Channel Errors Bit Errors: Coder:

None*

EP1 (C/I 10 dB)

EP2 (C/I 7 dB)

EP3 (C/I 4 dB)

XFE Reconstruction 93.0 92.6 92.1 83.4 XAFE Reconstruction 92.8 92.6 92.0 83.4

From Experiment I AFE Results 91.6 91.6 91.4 85.6

DRT Experiment IV – Sampling Frequencies

Sampling Frequency: Coder:

8 kHz*

11 kHz

16 kHz

XFE Reconstruction 93.0 92.9 94.2 XAFE Reconstruction 92.8 93.5 92.1 * From Experiment I

04/19/23 67

Intelligibility EvaluationTranscription Test (TT)

Number of Missed, Wrongly transcribed, and Partially transcribed Words

TT Requirement: XFE and XAFE error rates <= LPC-10 error rateTT Objective: XFE and XAFE error rates <= MELP error rate

Bckgnd noise

Coder

Clean Car Street Babble Clean Average Error (%)

Uncoded (Original) 1,1,2 1,0,1 0,2,4 3,9,3 0,4,1 0.5492

XFE Reconstruction 1,6,1 0,3,6 2,9,4 5,9,2 1,4,5 0.9954

XAFE Reconstruction 0,6,2 0,5,4 0,4,3 3,5,2 1,6,5 0.7894

LPC-10 Coder 8,18,6 62,26,7 67,22,7 47,12,3 18,10,9 5.5260

MELP Coder 0,3,1 1,6,3 4,6,2 16,10,3 1,9,5 1.2013

No. of words in msg. 1166 1153 1155 1149 1204 Tot: 5827

04/19/23 68

Summary

The Extended Mel-Cepstrum Front-End (XFE) and the Extended Advanced Front-End (XAFE) algorithms meet or exceed all the algorithmic, tonal language recognition performance, and intelligibility requirements!