or morgan, me & pitch - columbia universitydpwe/talks/2015-03-morgan-pitch.pdfrobustness,...
TRANSCRIPT
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /161
1. Robustness and Separation2. An Academic Journey3. Future
Dan Ellis Columbia / ICSI
[email protected] http://labrosa.ee.columbia.edu/
COLUMBIA UNIVERSITYIN THE CITY OF NEW YORK
or
Morgan, Me & Pitch
Robustness, Separation & Pitch
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
1953: How To Separate Speech?• The “Cocktail Party Problem” [Cherry ’53]
• Spatial information: ATC over a single speaker• Pitch differences via gender differences
• Auditory Scene Analysis [Bregman ’90]
2
• Grouping cues• Onset• Harmonicity• Common Fate• “Schema”
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
The Usefulness of Pitch• Common pitch can link energy
from a single source
3
!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Brungart et al.’01
Normal mix
“Pitchless”
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
1984: Perception-Inspired Separation• Model the periodicity information
in the auditory nerve
4
Weintraub 1985
Lyon 1984
Mix
Female
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
1996: An Academic Journey• Dan in 1996: The “weft”
5
5: Results 115
the energy visible in the fu ll signa l qu ite closely; note, however , the holes inthe weft envelopes, par t icu la r ly a round 300 Hz in the second weft ; a t thesepoin ts, the tota l signa l energy is fu lly expla ined by the background noiseelement , and, in the absence of st rong evidence for per iodic energy from theindividua l cor relogram channels, the weft in tensity for these channels hasbeen backed off to zero.
200400100020004000f/Hz "Bad dog"
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6501002004001000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6time/s
−70
−60
−50
−40
dB
200400100020004000f/Hz Wefts1,2
501002004001000
200400100020004000f/Hz Noise1
Click2 Click3
200400100020004000f/Hz Click1
Figu re 5.7: The “bad dog” sound example, represen ted in the top panes by it st ime-frequency in tensity envelope and it s per iodogram (summaryautocor rela t ions for every t ime step). The noise and click elements expla in ing theexample a re displayed as their in tensity envelopes; weft elements addit iona llydisplay their per iod-t rack on axes match ing the per iodogram.
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
1999: “Size Matters”• All you need is a BDNN
• .. and the data (and patience) to train it
6
500
1000
2000
4000
9.25
18.5
37
74
32
34
36
38
40
42
44
Hidden layer / unitsTraining set / hours
WER
%
WER for PLP12N nets vs. net size & training data
1 2 5 10 20 50
178 GCUP
356 GCUP
712 GCUP
1.42 TCUP
2.8 TCUP
5.7 TCUP
11.4 TCUP
100 200 50032
34
36
38
40
42
44WER vs. frames/weight
WER
%
frames/weight
Ellis & Morgan’99
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2001: Overlap Remains
• Meeting Recorder Project• natural speech interactions• ~10% of speech frames have overlaps
7
Janin, Baron, Edwards, Ellis, Gelbart, Morgan, Peskin, Pfau, Shriberg, Stolcke, Wooters ’03
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
mr-2000-06-30-1600
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop20
40
breathnoise
crosstalk
backchannel floor seizure
interruptions
speaker Bcedes floor
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2003: EARS• “Pushing the envelope (aside)”
8
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2004: Pitch Based Separation• Literal implementations of the process
described in Bregman 1990:• compute “regularity” cues:
- common onset- gradual change- harmonic patterns- common fate
9
Hu & Wang 2004
Original v3n7
Brown 1992
Ellis 1996
Hu & Wang 2004
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2004: Model-Based Separation
• Data-driven separation• Learn codebooks for individual speakers• Find best combination of sources
• Pitch gives the “grist”
10
Roweis ’01Kristjansson, Attias, Hershey ’04
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2006: Pitch for VAD• Pitch is the most robust perceptual cue to
speech
11
Lee & Ellis’06
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2004-2011: The Epic
12
Processing and Perception of Speech and Music
Speech and Audio Signal Processing
Ben Gold Nelson Morgan Dan Ellis
S E C O N D E D I T I O N
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
2012: Project Babel• Noisy speech is a challenge:
• How to disentangle speech and interference?• Energy peaks are speech (spectral subtraction)• Energy troughs are noise (Wiener, log-mmse)• Speech has a known form (Factorial HMM)• Voiced speech is periodic (Pitch-based)
13
freq /
kHz IARPA_BABEL_OP1_204_73990_20130521_162632_inLine
0
1
2
3
4
50 10 15 20 25 time / sec level / dB
-40
-20
0
20
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
Classification-based Pitch Tracker• Subband Autocorrelation Classification
(SAcC) Pitch Tracker: • Trained on noisy speech with true pitch targets
• Subband autocorrelation features• PCA to reduce dimensions
14
Lee & Ellis ’12
Cochleafilterbank Pitch
Sound
short-timeautocorrelation
delay line
frequ
ency
chan
nels
Correlogramslice
freqlag
time
BN
B3B2B1
c1
ck
uv60.061.863.665.4
392.1403.6
Neural network classifier
Viterbismoother
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
Flat-Pitch Processing• Time-varying filtering is tricky
• if pitch variation and filter impulse response are on a similar time-scale
15
• Solution: Flatten the pitch• use local pitch
estimateto resample
• process constant-pitch
• resampling is (near) invertible
noisyspeech
enhancedspeech
SAcCpitch
tracker
time-varyingresampling
pitch-flattentime map
time-varyingresampling
flat-pitch-domainenhancement
inversetime map
freq
/ Hz
Noisy signal
0
500
1000
Resampled to pitch = 200 Hz
Filtered − comb
time / s
Resampled back to original pitch and mixed with original
0 0.5 1 1.5
pitchpr(vx)
Robustness, Separation, Pitch - Dan Ellis 2015-03-14 - /16
Conclusions• Pitch is a key feature for speech separation
• “marking” signal against other speech or noise
• Some ideas don’t go away• .. though they can change shape
• Impact from collaboration• you can’t do good work with someone
without the human connection
16