learning long-term temporal features
DESCRIPTION
Learning Long-Term Temporal Features. A Comparative Study Barry Chen. Log-Critical Band Energies. Log-Critical Band Energies. Conventional Feature Extraction. Log-Critical Band Energies. TRAPS/HATS Feature Extraction. What is a TRAP? (Background Tangent). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/1.jpg)
May 4, 2004 Speech Lunch Talk
Learning Long-Term Temporal Features
A Comparative Study
Barry Chen
![Page 2: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/2.jpg)
May 4, 2004 Speech Lunch Talk
Log-Critical Band Energies
![Page 3: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/3.jpg)
May 4, 2004 Speech Lunch Talk
Log-Critical Band Energies
ConventionalFeature Extraction
![Page 4: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/4.jpg)
May 4, 2004 Speech Lunch Talk
Log-Critical Band Energies
TRAPS/HATSFeature Extraction
![Page 5: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/5.jpg)
May 4, 2004 Speech Lunch Talk
What is a TRAP? (Background Tangent)
• TRAPs were originally developed by our colleagues at OGI: Sharma, Jain (now at SRI), Hermansky and Sivadas (both now at IDIAP)
• Stands for TempRAl Pattern
• TRAP = a narrow frequency speech energy pattern over a period of time (usually 0.5 – 1 second long)
![Page 6: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/6.jpg)
May 4, 2004 Speech Lunch Talk
Example of TRAPS
Mean Temporal Patterns for 45 phonemes at 500 Hz
![Page 7: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/7.jpg)
May 4, 2004 Speech Lunch Talk
TRAPS Motivation
• Psychoacoustic studies suggest that human peripheral auditory system integrates information on a longer time scale
• Information measurements (joint mutual information) show information still exists >100ms away within single critical-band
• Potential robustness to speech degradations
![Page 8: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/8.jpg)
May 4, 2004 Speech Lunch Talk
Let’s Explore• TRAPS and HATS are examples of a
specific two-stage approach to learning long-term temporal features
• Is this constrained two-stage approach better than an unconstrained one-stage approach?
• Are the non-linear transformations of critical band trajectories, provided in different ways by TRAPS and HATS, actually necessary?
![Page 9: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/9.jpg)
May 4, 2004 Speech Lunch Talk
Learn Everything in One Step
![Page 10: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/10.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 11: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/11.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 12: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/12.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 13: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/13.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 14: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/14.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 15: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/15.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 16: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/16.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 17: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/17.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 18: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/18.jpg)
May 4, 2004 Speech Lunch Talk
Learn in Individual Bands
![Page 19: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/19.jpg)
May 4, 2004 Speech Lunch Talk
One-Stage Approach
![Page 20: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/20.jpg)
May 4, 2004 Speech Lunch Talk
2-Stage Linear Approaches
![Page 21: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/21.jpg)
May 4, 2004 Speech Lunch Talk
PCA/LDA Comments
• PCA on log critical band energy trajectories scales and rotates dimensions in directions of highest variance
• LDA projects in directions that maximize class separability measured by between class covariance over within class covariance
• Keep top 40 dimensions for comparison with MLP-based approaches
![Page 22: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/22.jpg)
May 4, 2004 Speech Lunch Talk
2-Stage MLP-Based Approaches
![Page 23: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/23.jpg)
May 4, 2004 Speech Lunch Talk
MLP Comments• As with the other 2-stage approaches, we first
learn patterns independently in separate critical band trajectories, and then learn correlations among these discriminative trajectories
• Interpretation of various MLP layers:1. Input to hidden weights – discriminant linear
transformations2. Hidden unit outputs – Non-linear discriminant
transforms 3. Before Softmax – transforms hidden activation space
to unnormalized phone probability space 4. Output Activations – critical band phone probabilities
![Page 24: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/24.jpg)
May 4, 2004 Speech Lunch Talk
Experimental Setup• Training: ~68 hours of conversational telephone
speech from English CallHome, Switchboard I, and Switchboard Cellular
– 1/10 used for cross-validation set for MLPs
• Testing: 2001 Hub-5 Evaluation Set (Eval2001) – 2,255,609 frames and 62,890 words
• Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke for all his help)
![Page 25: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/25.jpg)
May 4, 2004 Speech Lunch Talk
Frame Accuracy Performance
62.0%
63.0%
64.0%
65.0%
66.0%
67.0%
68.0%
1 5 B a n d s x 5 1 F ra me s P C A 4 0 L D A 4 0 H A T S B e fo re S ig mo id H A T S T R A P S B e fo re S o ftma x T R A P S P L P 9 F ra me s
Fra
me
Acc
ura
cy
15 Bands x 51 Frames
PCA 40
LDA 40
HATS Before Sigmoid
HATS
TRAPS Before Softmax
TRAPS
PLP 9 Frames
![Page 26: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/26.jpg)
May 4, 2004 Speech Lunch Talk
Standalone Feature System
• Transform MLP outputs by:1. log transform to make features more Gaussian
2. PCA for decorrelation
• Same as Tandem setup introduced by Hermansky, Ellis, and Sharma
• Use transformed MLP outputs as front-end features for the SRI recognizer
![Page 27: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/27.jpg)
May 4, 2004 Speech Lunch Talk
Standalone Features
36.0%
38.0%40.0%
42.0%44.0%
46.0%48.0%
50.0%
15B
ands
x
LDA
40
HA
TS
TR
AP
S
Wo
rd E
rro
r R
ate
15 Bands x 51 Frames
PCA 40
LDA 40
HATS Before Sigmoid
HATS
TRAPS Before Softmax
TRAPS
PLP 9 Frames
![Page 28: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/28.jpg)
May 4, 2004 Speech Lunch Talk
Combination W/State-of-the-Art Front-End Feature
• SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then heteroskedastic discriminant analysis (HLDA) transforms this 52 dimensional feature vector to 39 dimensional HLDA(PLP+3d)
• Concatenate PCA truncated MLP features to HLDA(PLP+3d) and use as augmented front-end feature– Similar to Qualcom-ICSI-OGI features in
AURORA
![Page 29: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/29.jpg)
May 4, 2004 Speech Lunch Talk
Combo W/PLP Baseline Features
32.0%
33.0%
34.0%
35.0%
36.0%
37.0%
38.0%
H L D A (P L P +3 d ) 1 5 B a n d s x 5 1
F ra me s
P C A 4 0 L D A 4 0 H A T S B e fo re
S ig mo id
H A T S T R A P S B e fo re
S o ftma x
T R A P S P L P 9 F ra me s H A T S + P L P 9
F ra me s
Wo
rd E
rro
r R
ate
HLDA(PLP+3d)
15 Bands x 51 Frames
PCA 40
LDA 40
HATS Before Sigmoid
HATS
TRAPS Before Softmax
TRAPS
PLP 9 Frames
HATS + PLP 9 Frames
![Page 30: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/30.jpg)
May 4, 2004 Speech Lunch Talk
Ranking Table
System Frame Acc. Standalone Combination15 Bands x 51 Frames 6 6 6PCA 40 5 2 2LDA 40 4 3 2HATS Before Sigmoid 3 4 2HATS 1 1 1TRAPS Before Softmax 2 4 5TRAPS 7 7 7
![Page 31: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/31.jpg)
May 4, 2004 Speech Lunch Talk
Observations
• Throughout the three various testing setups:
1. HATS is always #1
2. The one-stage 15 Bands x 51 Frames is always #6 or second last
3. TRAPS is always last
4. PCA, LDA, HATS before sigmoid, and TRAPS before softmax flip flop in performance
![Page 32: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/32.jpg)
May 4, 2004 Speech Lunch Talk
Interpretation• Learning constraints introduced by the 2-stage
approach is helpful if done right.• Non-linear discriminant transform of HATS is
better than linear discriminant transforms from LDA and HATS before sigmoid
• The further mapping from hidden activations to critical-band phone posteriors is not helpful– Perhaps, mapping to critical-band phones is too
difficult and inherently noisy
• Finally, like TRAPS, HATS is complementary to the more conventional features and combines synergistically with PLP 9 Frames.
![Page 33: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/33.jpg)
May 4, 2004 Speech Lunch Talk
![Page 34: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/34.jpg)
May 4, 2004 Speech Lunch Talk
Frame Accuracy Performance
System Frame Acc. Rel. Improvement15 Bands x 51 Frames 64.7% -
PCA 40 65.5% 1.2%LDA 40 65.5% 1.2%HATS Before Sigmoid 65.8% 1.7%HATS 66.9% 3.4%TRAPS Before Softmax 65.9% 1.7%TRAPS 64.0% -1.2%
PLP 9 Frames 67.6% N/A
![Page 35: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/35.jpg)
May 4, 2004 Speech Lunch Talk
Standalone Features WER
System WER Rel. Improvement15 Bands x 51 Frames 48.0% -
PCA 40 45.3% 5.6%LDA 40 46.5% 3.1%HATS Before Sigmoid 45.9% 4.4%HATS 44.5% 7.3%TRAPS Before Softmax 45.9% 4.4%TRAPS 48.2% -0.4%
PLP 9 Frames 41.2% N/A
![Page 36: Learning Long-Term Temporal Features](https://reader035.vdocument.in/reader035/viewer/2022062500/56815904550346895dc637e5/html5/thumbnails/36.jpg)
May 4, 2004 Speech Lunch Talk
Combo W/PLP Baseline FeaturesSystem WER Rel. ImprovementHLDA(PLP+3d) 37.2% -
15 Bands x 51 Frames 37.1% 0.3%PCA 40 36.8% 1.1%LDA 40 36.8% 1.1%HATS Before Sigmoid 36.8% 1.1%HATS 36.0% 3.2%TRAPS Before Softmax 36.9% 0.8%TRAPS 37.2% 0.0%PLP 9 Frames 36.1% 3.0%100.0%Inverse Entropy ComboHATS + PLP 9 Frames 34.0% 8.6%