speech recognition in mumis judith kessens, mirjam wester & helmer strik
TRANSCRIPT
Speech recognition in MUMIS
Judith Kessens, Mirjam Wester
& Helmer Strik
Manual transcriptions
• Transcriptions made by SPEX:– orthographic transcriptions– transcriptions on chunk level (2-3 sec.)
• Formats:– *.Textgrid praat– xml-derivatives:
• *.pri – no time information• *.skp – time information
Manual transcriptions
Total amount of transcribed matches on ftp-site (including the demo matches):
• Dutch: 6 matches
• German: 21 matches
• English: 3 matches
Extensions:
Dutch (_N), German (_G), English (_E)
Automatic speech recognition
1. Acoustic preprocessing
• Acoustic signal features
2. Speech recognition
• Acoustic models
• Language models
• Lexicon
Automatic transcriptions
• Problem of recorded data:
Commentaries and stadium noise are mixed Very high noise levels
Recognition of such extreme noisy data is very difficult
Examples of data
Yug-Ned match
• Dutch
• English
• German
“op _t ogenblik wordt in dit stadion de opstelling voorgelezen”
“and they wanna make the change before the corner”
“und die beiden Tore die die Hollaender bekommen hat haben”
Examples of data
Eng-Dld match
• Dutch
• English
• German
“geeft nu een vrije trap in _t voordeel van Ince”
“and phil neville had to really make about three yards to stop <dreisler*u> pulling it down and playing it”
“wurde von allen englischen Zeitungen aus der Mannschaft”
Evaluation of aut. transcriptions
insertions+deletions+substitutionsnumber of words
WER(%) =
WER can be larger than 100% !
WERs (all words)
Dutch English German
Yug-Ned 84.5 84.5 77.4
Eng-Dld 83.2 83.3 90.8
WERs (player names)
Dutch English German
Yug-Ned
names
84.5
53.0
84.5
48.2
77.4
40.9
Eng-Dld
names
83.2
55.0
83.3
56.2
90.8
77.4
WERs versus SNR
Dutch English German
Yug-Ned
SNR
84.5
9
84.5
12
77.4
19
Eng-Dld
SNR
83.2
8
83.3
11
90.8
7
Automatic transcriptions
The language model (LM) and lexicon (lex) are adapted to a specific match
• Start with a general LM and lex• Add player names of the specific match• Expand the general LM and lex when more
data is available
WERs for various amounts of data
76
80
84
88
92
96
0 50,000 100,000 150,000 200,000 250,000
number of words to train the language model
WE
R (
%)
Yug-Ned (Dutch) lex: 1CDEng-Dld (Dutch) lex: 1CDYug-Ned (German)lex: 1CDYug-Ned (German)lex: 7CDsYug-Ned (German)lex: 19CDsEng-Dld (German)lex: 7CDs
Oracle experiments - ICLSP’02
Due to limited amount of material we started off with oracle experiments:
• Language models are trained on target match
• Acoustic models are trained on part of target match or other match
Much lower WERs
Summary of results
Acoustic model training:
• Leaving out non-speech chunks does not hurt recognition performance
• Using more training data is benificial, but more important:
• The SNRs of the training and test data should be matched
Summary of results
• WERs are SNR-dependent
0
20
40
60
80
100
0 5 10 15 20
SNR (dB)
WER
(%) Dutch
English
German
(tested on Yug-Ned match)
Summary of results
0
20
40
60
80
Dutch English German
WER
(%)
function
content
names
all
Split words into categories, i.e. function words, content words and football player’s names:WER function words > WER content words > WER names
(tested on Yug-Ned match)
Summary of results• Noise reduction tool (FTNR) small improvement
WERs with and without FTNR
0
25
50
75
NL Eng Dld
WE
R (
%)
No FTNR FTNR
Ongoing work
Techniques to lower WERs• Tuning of the generic language model
– Defining different classes – Reduction of OOV words in lexicon and in the
language model (using more material)• Speaker Adaptation in HTK
(note: all other experiments are being carried out using Phicos)
Ongoing work
Noise robustness
• Extension of the acoustic models by using double deltas.
• Histogram Normalization and FTNR.
• SNR dependent acoustic models.
Recommendations
Acoustic modeling
• Record commentaries and stadium noise separately
• Speaker adaptation:
- Transcribe characteristics of commentator
- Collect more speech data of commentator
Recommendations
Lexicon and language modeling
• Collect orthographic transcriptions of spoken material, instead of written material
- Subtitles
- Close captions