an unsupervised probability model for speech-to ...aanastas/research/emnlp2016slides.pdfgriko -...

Post on 17-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Unsupervised Probability Model for Speech-to-Translation Alignment

of Low-Resource Languages

Antonios Anastasopoulos1, David Chiang1, Long Duong2

1 University of Notre Dame, USA 2 University of Melbourne, Australia

Why Speech-based MT?

90% of languages do not have a writing system

Motivation

2

Why Speech-based MT?

90% of languages do not have a writing system

Motivation

2

Why Speech-based MT?

90% of languages do not have a writing system

Motivation

2

Endangered languages documentation

Use speech with translations

Motivation

3

Using the Aikuma (Bird 2010) app to collect parallel speech

Low-resource languages

Utilize translations rather than transcriptions

Motivation

4

Task Description

5

Source side: Frames of the speech signal

Task Description

5

Source side: Frames of the speech signal

Target side: Translation text

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Source side: Frames of the speech signal

Target side: Translation text

Task: find best alignment between source and target side

Task Description

5

…a

little bit of

knowledge

Our method outperforms both baselines

The big picture

6

The big picture

go dog go

6

vaya vaya perro

The big picture

go dog go

6

vaya vaya perro

The big picture

go dog go

6

vaya vaya perro

The big picture

go dog go

6

vaya vaya perro

f8 : f8 :f13 :

The big picture

go dog go

6

vaya vaya perro

f8 : f8 :f13 :

Distortion model

Controls the reordering of the target words - based on fast-align [Dyer et al.]

7

Distortion model

Controls the reordering of the target words - based on fast-align [Dyer et al.]

Original

7

Modification

Distortion model

Controls the reordering of the target words - based on fast-align [Dyer et al.]

7

Span Start Span End

Clustering model

Assuming a “prototype” for each cluster

8

Clustering model

Assuming a “prototype” for each cluster

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Clustering model

Assuming a “prototype” for each cluster

f8

8

Training

Expectation-Maximization

9

Training

Expectation-Maximization

9

Initialize spans and clusters

Training

Expectation-Maximization

9

Initialize spans and clusters

Training

Expectation-Maximization

9

Initialize spans and clusters

Training

Expectation-Maximization

9

Initialize spans and clusters

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes• E step:

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes• E step:

- Assign cluster and align

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes• E step:

- Assign cluster and align

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes• E step:

- Assign cluster and align

Training

Expectation-Maximization

9

Initialize spans and clusters• M step:

- Re-estimate prototypes• E step:

- Assign cluster and align- We restrict the search space:

- voice activity detection - phone boundary detection [Khanaga et al.]

Training

Expectation-Maximization

9

Experiments

10

Language Pair Dataset Number of utterances

Griko - Italian [Lekakou et al] 330

Spanish - EnglishCALLHOME (sample) 2k

CALLHOME (all) 17kFisher 143k

Baselines

11

Baselines

11

• Naive:• frames/word ~ #characters• along the diagonal

Baselines

11

• Naive:• frames/word ~ #characters• along the diagonal

Austin is great

Baselines

11

• Naive:• frames/word ~ #characters• along the diagonal

• Neural [Duong et al]:• DNN optimised for direct translation of speech• convert attention mechanism weights to alignments

Austin is great

Alignment F-score

12

Results

Alignment F-scoreF-

scor

e

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

12

Results

Alignment F-scoreF-

scor

e

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

27.8

35.735.8

46.7

12

Results

Alignment F-scoreF-

scor

e

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

26.229.126.427.0 27.8

35.735.8

46.7

12

Results

Alignment F-scoreF-

scor

e

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

30.8

38.638.8

53.8

26.229.126.427.0 27.8

35.735.8

46.7

12

Results

Alignment Precision

13

Results

Alignment PrecisionPr

ecis

ion

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

13

Results

Alignment PrecisionPr

ecis

ion

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

24.0

31.831.9

42.2

13

Results

Alignment PrecisionPr

ecis

ion

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

24.726.123.824.6 24.0

31.831.9

42.2

13

Results

Alignment PrecisionPr

ecis

ion

0

10

20

30

40

50

60

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

33.338.438.8

56.6

24.726.123.824.6 24.0

31.831.9

42.2

13

Results

ResultsWord-level F-score

14

Word Length (frames)

aver

age

F-sc

ore

Example

15

Conclusion

Alignment model

Extension of IBM-2 with fast-align for speech-to-translation

k-means clustering with DTW and DBA

Improvements in F-score and particularly Precision

https://bitbucket.org/ndnlp/speech2translation

16

Alignment Recall

17

Results

Alignment RecallRe

call

0

10

20

30

40

50

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

17

Results

Alignment RecallRe

call

0

10

20

30

40

50

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

33.2

40.740.8

52.2

17

Results

Alignment RecallRe

call

0

10

20

30

40

50

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

27.832.9

29.830.033.2

40.740.8

52.2

17

Results

Alignment RecallRe

call

0

10

20

30

40

50

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

28.7

38.838.9

51.2

27.832.9

29.830.033.2

40.740.8

52.2

17

Results

Example

18

Example

19

Proper model

20

Deficient:

Proper model

20

Proper:

Proper model

20

Proper:

The proper model performs much worse. -It favours too long or too short spans

Example

21

Background: DTW and DBA

Dynamic Time Warping (DTW)

22

DTW Barycenter Averaging (DBA)

Background: DTW and DBA

22

M-step:

Prototype estimation with DTW Barycenter Averaging

23

M-step:

Prototype estimation with DTW Barycenter Averaging

23

M-step:

Prototype estimation with DTW Barycenter Averaging

23

top related