an unsupervised probability model for speech-to ...aanastas/research/emnlp2016slides.pdfgriko -...
Post on 17-Mar-2021
1 Views
Preview:
TRANSCRIPT
An Unsupervised Probability Model for Speech-to-Translation Alignment
of Low-Resource Languages
Antonios Anastasopoulos1, David Chiang1, Long Duong2
1 University of Notre Dame, USA 2 University of Melbourne, Australia
Why Speech-based MT?
90% of languages do not have a writing system
Motivation
2
Why Speech-based MT?
90% of languages do not have a writing system
Motivation
2
Why Speech-based MT?
90% of languages do not have a writing system
Motivation
2
Endangered languages documentation
Use speech with translations
Motivation
3
Using the Aikuma (Bird 2010) app to collect parallel speech
Low-resource languages
Utilize translations rather than transcriptions
Motivation
4
Task Description
5
Source side: Frames of the speech signal
Task Description
5
…
Source side: Frames of the speech signal
Target side: Translation text
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Source side: Frames of the speech signal
Target side: Translation text
Task: find best alignment between source and target side
Task Description
5
…a
little bit of
knowledge
Our method outperforms both baselines
The big picture
6
The big picture
go dog go
6
vaya vaya perro
The big picture
go dog go
6
vaya vaya perro
The big picture
go dog go
6
vaya vaya perro
The big picture
go dog go
6
vaya vaya perro
f8 : f8 :f13 :
The big picture
go dog go
6
vaya vaya perro
f8 : f8 :f13 :
Distortion model
Controls the reordering of the target words - based on fast-align [Dyer et al.]
7
Distortion model
Controls the reordering of the target words - based on fast-align [Dyer et al.]
Original
7
Modification
Distortion model
Controls the reordering of the target words - based on fast-align [Dyer et al.]
7
Span Start Span End
Clustering model
Assuming a “prototype” for each cluster
8
Clustering model
Assuming a “prototype” for each cluster
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Clustering model
Assuming a “prototype” for each cluster
f8
8
Training
Expectation-Maximization
9
Training
Expectation-Maximization
9
Initialize spans and clusters
Training
Expectation-Maximization
9
Initialize spans and clusters
Training
Expectation-Maximization
9
Initialize spans and clusters
Training
Expectation-Maximization
9
Initialize spans and clusters
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes• E step:
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes• E step:
- Assign cluster and align
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes• E step:
- Assign cluster and align
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes• E step:
- Assign cluster and align
Training
Expectation-Maximization
9
Initialize spans and clusters• M step:
- Re-estimate prototypes• E step:
- Assign cluster and align- We restrict the search space:
- voice activity detection - phone boundary detection [Khanaga et al.]
Training
Expectation-Maximization
9
Experiments
10
Language Pair Dataset Number of utterances
Griko - Italian [Lekakou et al] 330
Spanish - EnglishCALLHOME (sample) 2k
CALLHOME (all) 17kFisher 143k
Baselines
11
Baselines
11
• Naive:• frames/word ~ #characters• along the diagonal
Baselines
11
• Naive:• frames/word ~ #characters• along the diagonal
Austin is great
Baselines
11
• Naive:• frames/word ~ #characters• along the diagonal
• Neural [Duong et al]:• DNN optimised for direct translation of speech• convert attention mechanism weights to alignments
Austin is great
Alignment F-score
12
Results
Alignment F-scoreF-
scor
e
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
12
Results
Alignment F-scoreF-
scor
e
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
27.8
35.735.8
46.7
12
Results
Alignment F-scoreF-
scor
e
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
26.229.126.427.0 27.8
35.735.8
46.7
12
Results
Alignment F-scoreF-
scor
e
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
30.8
38.638.8
53.8
26.229.126.427.0 27.8
35.735.8
46.7
12
Results
Alignment Precision
13
Results
Alignment PrecisionPr
ecis
ion
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
13
Results
Alignment PrecisionPr
ecis
ion
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
24.0
31.831.9
42.2
13
Results
Alignment PrecisionPr
ecis
ion
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
24.726.123.824.6 24.0
31.831.9
42.2
13
Results
Alignment PrecisionPr
ecis
ion
0
10
20
30
40
50
60
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
33.338.438.8
56.6
24.726.123.824.6 24.0
31.831.9
42.2
13
Results
ResultsWord-level F-score
14
Word Length (frames)
aver
age
F-sc
ore
Example
15
Conclusion
Alignment model
Extension of IBM-2 with fast-align for speech-to-translation
k-means clustering with DTW and DBA
Improvements in F-score and particularly Precision
https://bitbucket.org/ndnlp/speech2translation
16
Alignment Recall
17
Results
Alignment RecallRe
call
0
10
20
30
40
50
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
17
Results
Alignment RecallRe
call
0
10
20
30
40
50
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
33.2
40.740.8
52.2
17
Results
Alignment RecallRe
call
0
10
20
30
40
50
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
27.832.9
29.830.033.2
40.740.8
52.2
17
Results
Alignment RecallRe
call
0
10
20
30
40
50
Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)
naive neural ours
28.7
38.838.9
51.2
27.832.9
29.830.033.2
40.740.8
52.2
17
Results
Example
18
Example
19
Proper model
20
Deficient:
Proper model
20
Proper:
Proper model
20
Proper:
The proper model performs much worse. -It favours too long or too short spans
Example
21
Background: DTW and DBA
Dynamic Time Warping (DTW)
22
DTW Barycenter Averaging (DBA)
Background: DTW and DBA
22
M-step:
Prototype estimation with DTW Barycenter Averaging
23
M-step:
Prototype estimation with DTW Barycenter Averaging
23
M-step:
Prototype estimation with DTW Barycenter Averaging
23
top related