Front-end Audio Processing: Reflections on Issues, Requirements, and Solutions
Tomas Gaensler
mh acoustics
www.mhacoustics.com
Summit NJ/Burlington VTUSA
Front-end Audio Processing
Processing to enhance perceived and/or measured sound quality
in communication and recording devices
Not So Famous Quotes (Acoustic Jewelry/Bluetooth Headset)
Gary Elko (mh/Bell labs colleague)
At IWAENC 1995: “Acoustic Echo cancellation will not be needed in the future when people wear acoustic jewelry”
Arno Penzias (1978 Nobel prize laureate)
“No one would want acoustic jewelry because people would think the users talking to themselves are crazy”
I’m glad the success of Bluetooth headsets show that both were completely wrong!
Classical Front-end Architectures - POTS
BPF
Receive side(Rx)
Send side (Tx)
BPF
Carbon microphone with expansion effect that reduces noise
Large coupling loss in handset mode
SwitchLoss
Switch loss in speakerphone supporting telephones
Classical Front-end Architectures – Cellphone 1995
BPF/ADC
Receive side(Rx)
Send side (Tx)
BPF/DAC
EQ
EQ
En
cod
er/
Decod
er
Vol
AEC
NLP
Classical Front-end Architectures – Cellphone 2005 - 2010
BPF/ADC
Receive side(Rx)
Send side (Tx)
BPF/DAC
EQ
EQ
En
cod
er/
Decod
er
Vol
AEC
TXLEV
NS NLP
RXLEV
NS
Cellphones and Handsfree
Common problems:
Far-end listener does not
hear near-end talker
Near-end listener does not
understand far-end talker
Why?
Form factor – Size
Limited understanding of
physics and acoustics(?)
Echo louder than near-end:
Linear AEC
ERLE 20-30 dB
After cancellation Residual
Echo to Near-end Ratio
(RENR):
RENR 90-20-70 = 0 dB
RX/TX Levels, Coupling and Doubletalk
>20 dB of residual echo
suppression required
Duplexness suffers
Far-end 95—100 dBSPL at loudspeaker
85—90dBSPL at mic
Near-end talker 55—70 dBSPL at mic
SPL [dB]
110
70
RAIL (e.g. 32768 or 1)
Digital Level
Speech lev.
Q-noise (white)
14 bits
Mic S
NR
=65 dB
26Mic circuit noise (1/f)
94
29
Room noise lev.43
TX: Dynamic Range and Noise
Echo 90 dBSPL Peak echo 105-110 dB
No saturation of echo in TX path
ADC
Near-end speechLevel: 70 dBSPL
Actual speech to room noise ratio is
only about 27 dB at best
Echo Level: 90 dBSPL
Gain is required to get loud enough output
Perceived noise level is ~20 dB above normal room
noise level
TX: Fixed-point Processing and Quantization Noise
N=64 Q-noise increases by 36 dB
Double-precision “required”
ADC
AFB(FFT)
SFB(FFT)
Q-noise increases by 6log2(N) dB!
SPL [dB]
110
70
RAIL (e.g. 32768 or 1)
Digital Level
Speech lev.
LSB for 16-bits14
Q-noise from 64-point FFT processing
50
6log2(64)
EQDAC
RX: Dynamic Range and Distortion
Small loudspeakers have rather high cut-off frequency (high-pass)
EQ often required to get acceptable “sound” (frequency response). However EQ means:
Loss of signal loudness and dynamic range
Increased (analog) distortion
Many manufacturers compensate the loss of signal level by excessive digital gain and therefore get (digital) saturation
To AEC
Digital gainAnalog gain
What Can or Should be Done?
Minimize acoustical coupling by good physical design
TX
Use noise suppression but not excessively
Double-precision, block scaling, or floating-point
RX
Compression instead of fixed gain
10% or less loudspeaker/driver THD is desired
What about Non-linear AEC Algorithms?
Interesting problem proposed and worked on for many years
Not practical in most AEC applications since
Complicated model Gain and therefore saturation possibly in both TX and RX
paths
Added complexity and system cost
Often slow convergence
Difficult to fine-tune in field
Even when non-linear cancellation works perfectly, the user still perceives a distorted loudspeaker signal!
Classical Front-end Architectures – Cellphone 2005 - 2010
BPF/ADC
Receive side(Rx)
Send side (Tx)
BPF/DAC
EQ
EQ
En
cod
er/
Decod
er
Vol
AEC
TXLEV
NS NLP
RXLEV
NS
Why RX NS?
Why TX NS?
Single Channel Noise Suppression
Basic single channel noise suppressor
An extremely successful signal processing invention by
Manfred Schroeder in the 1960s
Musical tones – is it a (solved) problem?
How do we evaluate and improve quality?
How about convergence rate?
Background to Single Channel Noise Suppressors
Block processing:
Frequency domain model:
Linear Time-varying filter:
Wiener filter:
speech
NS)()()( nvnsny )(ˆ ns
noise
“enhanced”speech
( , ) ( , )( , )( , )
( , ) ( , ) ( , )y vs
s v y
P k m P k mP k mH k m
P k m P k m P k m
ˆ( , ) ( , ) ( , )S k m H k m Y k m
12 /
0
( , ) ( ) ( )K
j kn K
n
X k m w n x m n e
( , ) ( , ) ( , )Y k m S k m V k m
Background to Single Channel Noise Suppressors
Estimation of spectra is often done recursively:
Frequency smoothing:
2 2( , ) [ ( , 1) ( , ) ] ( , )y yP k m P k m Y k m Y k m
2 2( , ) [ ( , 1) ( , ) ] ( , )v vP k m P k m Y k m Y k m , when speech is “not” present
, time-dimension averaging constants
'
( , ) ( ', ) ( ', )b
b
k
k k
H k m b k k H k k m
( ', )b k k frequency-dimension averaging constants
, , ( ', )b k k and are critical for musical tone control
Musical Tones – Is it a (Solved) Problem?
Examples Original (“Sally Sievers’ reel, June-Sept. 1964” by Manfred Schroeder
and Mohan Sondhi at Bell Labs)
Original + noise (iSNR ~ 6 dB)
Schroeder – 1960s
“Generic spectral subtraction” – Boll 1979
IS-127 – 1995
“A problem of last century”, only a constraint in design
Controlling variance of suppression gains
Any NS algorithm should be constrained not to have musical tones
Must only have a small impact on voice quality
Quality Metrics
Most importantly: Listen!
SNR
Total
Segmental
During speech
Distortion metrics:
ISD (Itakura-Saito distance)
ITU-T P.862: PESQ/MOS-LQO
Quality Metric – P.862 (PESQ/MOS-LQO)
MOS-LQO (MOS Listening
Quality Objective)
Alg-1/2 – Wiener methods with
12 dB noise suppression
P.862.2
1.5
2
2.5
3
3.5
4
4.5
0 5 10 15 20 25 30 35 40 45 50 55 60
SNR (dB)
MO
S-L
QO
unproc Alg-1 Alg-2
What can the best noise suppressor achieve?
Quality Metric – “My Rule of Thumb”
P.862.2
1.5
2
2.5
3
3.5
4
4.5
0 5 10 15 20 25 30 35 40 45 50 55 60
SNR (dB)
MO
S-L
QO
unproc Alg-1 Alg-2 Bound (12 dB)
12 dB
Ideal MOS (PESQ) performance
bound is given by shifting the
unprocessed PESQ-curve to
the left
Example for 12 dB suppression
12 dB shift to the left
Convergence Rate
Important performance criterion:
Non-stationary noise conditions
Frame loss
Main objective:
Maximize convergence rate while maintaining speech
quality
Convergence Rate – A Useful Test
a) Input sequence
b) IS-127
c) Wiener Based
d) A spectral
subtraction m-script
retrieved from the
internet
Convergence Rate and MOS-LQO
a) “Normal”
b) “Fast”
c) MOS-LQO
Current Applications and Drivers of NS Technology
Where is NS going in industry now?
Beyond “12 dB” of suppression
Multi-microphone solutions
Two- or more channel suppressors
Linear beamforming
Applications
Mobile phones (a few two-microphone models have
reached the market)
Bluetooth headsets: great "new" application for signal
processing (Ericsson BT headset 2000)
Background to Linear Beamforming
N : Number of microphones
Broadside linear beamforming (e.g. delay-sum)
Directional gain: 10log(N)
White Noise Gain (WNG)>0
Practical size: “large” (~30cm)
Endfire differential beamforming
Directional gain: 20log(N)
WNG<0
Practical size: “small” (1.5-5cm)
Pro
cess
ing
Endfire direction
Broadside direction
Differential beamformers more suitable for small form-factors
Background to Linear Beamforming
What do we gain?
Less reverberation (increased intelligibility)
Less (environmental) noise
No (or low) distortion on axis
Possible interference rejection by spatial zero(s)
Some Issues:
Performance is given by critical distance!
Increase in sensor noise (WNG, differential beamforming)
Beamforming: Critical Distance
Critical distance (Reverberation radius): reverberant-to-direct path energy ratio is 0 dB:
DI = Directivity Index: gain of direct to reverberant energy over an omni-directional microphone
Order of finite differences used. 1st : 2 mics, 2nd : 3 mics etc)
1/2
60
0.1cV
rT
( /10)DI directivity factor = 10
OrderDI [dB]
00
16 2.0
29.5 3.0
312 4.0
cr
0r
0r
0r
cr
0r
First-Order Differential Beamforming
0 1
11 0 1 1 0 1
1
1 1
1( , ) [ cos( )], ( )
( ) ( , ) ( ) [ cos( )] [ (1 )cos( )] ,/
: (1 )cos( )
L
L
dE P T H f
c
d TY E H T P P
c T d c
Beamformer response
( , )E ( )Y m1
m2
d
T1
- HL(w)
0P
Classical First-Order Beamformer Responses
1 0.5 1 0.25 1 0.0 Cardioid Hypercardioid Dipole
Beamforming Demo: DEWIND processing