advances in wp2 torino meeting – 9-10 march 2006
TRANSCRIPT
![Page 1: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/1.jpg)
Advances in WP2
Torino Meeting – 9-10 March 2006
www.loquendo.com
![Page 2: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/2.jpg)
2
Activities on WP2 since last meeting
• Study of innovative NN adaptation methods– Models: Linear Hidden Networks
• Test on project adaptation corpora:– WSJ0 Adaptation component
– WSJ1 Spoke-3 component
– Hiwire Non-Native Corpus
![Page 3: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/3.jpg)
4
LIN Adaptation for HMM/NN
• LIN means “linear input network”
• LIN in a classical technique for speaker and channel adaptation in HMM/NN [Neto 1996];
• The LIN is placed before an MLP already trained in a speaker independent way (SI-MLP)
• The input space is rotated by a linear transform, to make the target conditions nearer to the training conditions
• The linear transform is implemented with a linear neural network inserted between the input layer and the 1st hidden layer
![Page 4: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/4.jpg)
5
LIN Adaptation
Output layer
….
….
….
Input layer
1st hidden layer
2nd hidden layer
Emission Probabilities
Acoustic phonetic Units
Speech Signal parameters….
Speaker
Independent
MLP
SI-MLP
LIN
![Page 5: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/5.jpg)
6
LIN Training
• The global SI-MLP+LIN system is trained with
vocal material from the target speaker;
• The LIN is initialized with an identity matrix;
• LIN weights are trained with error back-
propagation through the global net;
• The original NN weights are kept frozen
![Page 6: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/6.jpg)
7
LHN Adaptation
• LHN means “linear hidden network”• The activations of the last hidden layer are
linearly transformed to improve acoustic matching of the adaptation material
• The activation values of a hidden layer represent an internal structure of the input pattern in a space more suitable for classification and adaptation
• The linear transform is implemented with a linear neural network layer inserted between the last hidden layer and the output layer
![Page 7: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/7.jpg)
8
LHN Adaptation
Output layer
….
….
….
Input layer
1st hidden layer
2nd hidden layer
Emission Probabilities
Acoustic phonetic Units
Speech Signal parameters….
Speaker
Independent
MLP
SI-MLP
LHN
![Page 8: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/8.jpg)
9
LHN Training
• The global SI-MLP+LHN system is trained with
vocal material from the target speaker;
• The LHN is initialized with an identity matrix;
• LHN weights are trained with error back-
propagation through the last layer of weights;
• The original NN weights are kept frozen
![Page 9: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/9.jpg)
10
Paper at Icassp-2006
ADAPTATION OF HYBRID ANN/HMM MODELS USING LINEAR HIDDEN TRANSFORMATIONS AND CONSERVATIVE TRAINING
Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface and Renato De Mori
![Page 10: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/10.jpg)
11
WSJ0 LIN-LHN Adaptation
• Train: standard WSJ0 SI-84 train set, 16 kHz
• SI Test : 8 speakers and ~40 sentences for each speaker
• Vocabulary: 5K words, with a standard bigram LM
• Adaptation : the same 8 speakers of SI test, with 40 adaptation
sentences for each of them
Adaptation Model
Spk:WV1_440
Spk:WV1_441
Spk:WV1_442
Spk:WV1_443
Spk:WV1_444
Spk:WV1_445
Spk:WV1_446
Spk:WV1_447
Average (E.R.)
Baseline 89.2 85.2 88.4 92.2 87.3 91.0 95.1 87.8 89.5 -
LIN Adapt. 93.1 87.6 89.2 90.1 90.0 89.9 95.1 89.6 90.6 (10.5%)
LHN Adapt. 92.6 90.7 89.2 92.2 91.4 92.2 95.3 89.0 91.6 (20.0%)
![Page 11: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/11.jpg)
12
WSJ1 – SPOKE-3 LIN-LHN Adaptation• Spoke-3 is the standard WSJ1 case study to evaluate
adaptation to non-native speakers
• There are 10 non-native speakers (40 adaptation sentences
and ~40 test sentences)
• Train: standard WSJ0 SI-84 train set, 16 kHz
• Vocabulary is 5K words, with standard bigram LM
Adaptation Model
4N0 4N1 4N3 4N4 4N5 4N8 4N9 4NA 4NB 4NC Average (E.R.)
Baseline 15.6 16.3 16.1 31.6 50.7 79.0 63.5 70.1 57.4 63.5 45.8 -
LIN Adapt. 35.6 17.2 33.7 34.6 63.0 77.9 71.8 72.8 59.3 72.6 53.5 (14.2%)
LHN Adapt. 62.0 38.4 69.3 59.8 69.3 83.1 80.5 76.9 68.3 82.4 69.4 (43.5%)
THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER
![Page 12: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/12.jpg)
13
Comments on WSJ0 – WSJ1 Results
• LIN does work for speaker adaptation:
E.R. 10.5% on WSJ0 and 14.2% on WSJ1
• However, with LIN in some cases performances does not
improve or decrease
• LHN is a more powerful method:
E.R. 20.0% on WSJ0 and 43.5% on WSJ1
• with LHN performances always increase
![Page 13: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/13.jpg)
14
Hiwire Non-Native Corpus (1)
• The database consists of English sentences uttered by
non-native speakers.
• These speakers are from French, Italian, Greek and
Spanish origins (plus an additional set of extra-European
speakers).
• The uttered sentences belong to a command language
used by aircraft pilots.
• The vocabulary contains 134 words.
• Each speaker has pronounced 1 list of 100 sentences.
![Page 14: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/14.jpg)
15
Hiwire Non-Native Corpus (2)
Corpus composition:
• French speakers: 31
• Italian speakers: 20
• Greek speakers: 20
• Spanish speakers: 10
• World speakers: 10
![Page 15: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/15.jpg)
16
Experimental conditions
• Starting models:
- standard Loquendo ASR EN-US
- Telephone models (8 kHz)
- Training set: LDC Macrophone
• Adaptation: first 50 utterances of each speaker
• Test: last 50 utterances of each speaker
• LM: Hiwire grammar (134 words voc.)
• Signal proc.: down-sampling to 8 kHz
![Page 16: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/16.jpg)
17
Results on Hiwire corpus
• Recognition model: ANN/HMM
• Adaptation Model: LIN - LHN
Nationality # of speakers
Default models
Adapted LIN Adapted LHNWA ER % WA ER %
French 31 88.4 91.3 25.1 91.8 29.1
Italian 20 88.3 91.7 28.8 91.7 28.9
Greek 20 88.8 93.4 41.0 94.5 50.9
Spanish 10 87.5 92.5 40.0 93.6 48.6
World 10 88.7 92.7 35.2 94.0 47.1
Total 91 88.4 92.2 32.1 92.8 37.9
![Page 17: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/17.jpg)
18
Discussion
• The adaptation of Acoustic Models gives a good
contribution also in the case of non-native speakers
• State-of-art LIN is a feasible and practical way to adapt hybrid NN-HMM models
• LHN (transformation of hidden layers activations) is a new NN adaptation method introduced in the project
• LHN outperforms LIN
![Page 18: Advances in WP2 Torino Meeting – 9-10 March 2006](https://reader035.vdocument.in/reader035/viewer/2022062515/56649cc95503460f949907eb/html5/thumbnails/18.jpg)
20
Workplan
• Selection of suitable benchmark databases (m6)
• Baseline set-up for the selected databases (m8)
• LIN adaptation method implemented and experimented on the
benchmarks (m12)
• Experimental results on Hiwire database with LIN (m18)
• Innovative NN adaptation methods and algorithms for acoustic
modeling and experimental results (m21)
• Further advances on new adaptation methods (m24)
• Unsupervised Adaptation: algorithms and experimentation (m33)