datadrivenfeatures
DESCRIPTION
http://fvalente.zxq.net/presentations/DataDrivenFeatures.pdfTRANSCRIPT
![Page 1: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/1.jpg)
Data-Driven Discriminative Speech Analysis Module in
DARPA GALE
Fabio Valente and Hynek Hermansky
IDIAP Research Institute, Martigny, Switzerland
![Page 2: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/2.jpg)
Motivation
ASR requires knowledge and knowledge comes from data
– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)
– task-specific knowledge (e.g. language and its phonotactics, environment,…)
data-drivenfeatures
derived fromEnglish
classifier
train on small amountsof task-specific data
Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)
PROBLEM
For some tasks, amounts of data may be limited
ONE SOLUTION
Acquire speech-specific knowledge from large amounts of American English data
![Page 3: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/3.jpg)
TANDEM and its training
TANDEMHermansky, Ellis and SharmaICASSP 2000
evid
ence
TANDEM
tran
sfor
med
phon
eme
post
erio
rs
trainingdata
training data for TANDEM : can be from other application domain
TANDEM trained on OGI stories
amount of task-specific training datafor training of the HMM models
wo
rd e
rro
r ra
te
WER on OGI digit data(Sivadas and Hermansky ICASSP 2002)
PLP
100%20 %
70%
0%
![Page 4: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/4.jpg)
• preprocessing of input data for TANDEM is beneficial – e.g. TRAP technique (nonlinear and data-driven)
linearprocessing
evidence: anything that carries the relevant information
time
frequ
ency
featuresfor HMM trained NN
some functionof phoneme posteriors
TANDEM
evidence
posteriogram
/f/
/ay/
/v/
data
![Page 5: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/5.jpg)
The Current Research• Where is the information?• study linear preprocessing using LDA
– data-driven technique– straightforward interpretation in terms of basis functions
time
freq
uenc
y
spectral projections
FIR RASTA filters
2-D projections
Applied earlier to American English portion of OGI stories (about 3 hours of telephone quality monologues from 210 adult talkers.
Hypothesis: If extracting speech-specific information, general conclusions should hold for different database (30 hours of RM and Switchboard from SRI)
![Page 6: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/6.jpg)
Spectral Projections
![Page 7: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/7.jpg)
Spectral sensitivity of projections
• Perturbation analysis– project Gaussian shape (σ = 250 Hz) on
the first 16 spectral basis and evaluate the effect of the shift in µ by 30 Hz as the function of µ
Consistent with auditory (Bark) spectral scale
log spectralEuclidean distancedue to the shift in µ
µ
Shift in µ constant on the Hz scale
Shift in µ constant on the Bark scale
µ
![Page 8: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/8.jpg)
Relative importance of spectral regions
• Hilbert envelope of the first 15 spectral basis averaged
frequency [Hz]
• Importance of each frequency region for articulation and intelligibility [Fletcher 1953]
100 1000 10000frequency [Hz]
![Page 9: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/9.jpg)
Central 600 ms of temporal discriminants(impulse responses of FIR RASTA filters)
time [ms]
ampl
itude
of
impu
lse
resp
onse
![Page 10: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/10.jpg)
2-D discriminants
4.2 % 3.7 % 3.3% 3.2% 3.0%
2.8% 2..8% 2.7% 2.1% 2.0%
1.7% 1.6% 1.5% 1.4% 1.4%
1.2% 1.1% 1.1% 1.1% 1.1%
1.1% 1.1% 1.1% 1.1% 1.1%
freq
uenc
y [H
z]
o
4000
0-200 200time [ms]
![Page 11: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/11.jpg)
Multi-RASTA
0-500 500time [ms]
time
averagefrequencyderivative
3 criticalbands
frequency
time
freq
uen
cy
example(out of 32 possible)
matched training and test miss-matched channel
conventional (PLP) 5.2 % 13.5 %
Multi-RASTA 3.7 % 3.8 %
![Page 12: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/12.jpg)
Conclusions
• Data-driven (ANN-based TANDEM) feature extraction module as means for implementing speech-specific (task-independent) knowledge – aim for reducing need for large task-specific acoustic training data
• LDA guided pre-processingResults qualitatively consistent for different databases (OGI Stories
and forcefully aligned SRI Broadcast News and Switchboard data)– optimality of Bark-like frequency scale– need for larger (about 500 ms) temporal context in feature extraction – dominant time-frequency discriminants as outer products of spectral
discriminants and temporal discriminants
• Linear pre-processing of data for TANDEM– multi-RASTA (projections on zero-mean variable temporal resolution
basis)• demonstrated improvements on small vocabulary task
![Page 13: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/13.jpg)
Initialize temporal basis R
Project Spectro-Temporal matrix on R
Estimate spectral basisL by LDA
Project Spectro-Temporal matrix on L
Estimate temporal basisR by LDA
2-D Linear Discriminant Analysis(Ye, Janardan and Li, NIPS 2005)
![Page 14: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/14.jpg)
Eigenvalues
spectral discriminants temporal discriminants 2-D discriminants
![Page 15: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/15.jpg)
Temporal discriminants across critical bands
first discriminant second discriminant third discriminant
![Page 16: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/16.jpg)
• preprocessing of input data for TANDEM is beneficial – TRAP technique (nonlinear and data-driven)
– multi-RASTA filters (linear and “knowledge” guided)
linearprocessing
evidence: anything that carries the relevant information
time
frequ
ency
featuresfor HMM trained NN
some functionof phoneme posteriors
TANDEM
evidence
posteriogram
/f/
/ay/
/v/
data
![Page 17: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/17.jpg)
Temporal discriminants(frequency responses of FIR RASTA filters)
modulation frequency [Hz]
log
mag
nitu
de s
pect
rum
[dB
]
![Page 18: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/18.jpg)
Experimental Setup
• 30 hours of phoneme-labeled (forced alignment) data from SRI (Switchboard, broadcast news,…)
• spectral vectors: 129 samples of LPC log power spectrum (12th order, 30 ms window, 10 ms step)
• temporal vectors: 2010 ms long (201 samples), labeled by the phoneme in the center, 10 ms step
• spectro-temporal matrix: 201 time samples x 129 spectral samples, 10 ms step
![Page 19: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/19.jpg)
Motivation
ASR requires knowledge and knowledge comes from data
– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)
– task-specific knowledge (e.g. language and its phonotactics, environment,…)
CONVENTIONAL WAY
features
classifier trained on
English
adapt on small amountsof task-specific data
ALTERNATIVE
data-drivenfeatures
derived fromEnglish
classifier
train on small amountsof task-specific data
Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)
PROBLEM
For some tasks, amounts of data may be limited
ONE SOLUTION
Acquire speech-specific knowledge from large amounts of American English data
![Page 20: DataDrivenFeatures](https://reader034.vdocument.in/reader034/viewer/2022042901/568c34fe1a28ab023592842c/html5/thumbnails/20.jpg)
Frequencies around 600 Hz are the most important for decoding of nonsense syllables