an unsupervised probability model for speech-to ...aanastas/research/emnlp2016slides.pdfgriko -...

An Unsupervised Probability Model for Speech-to-Translation Alignment

of Low-Resource Languages

Antonios Anastasopoulos1, David Chiang1, Long Duong2

1 University of Notre Dame, USA 2 University of Melbourne, Australia

Why Speech-based MT?

90% of languages do not have a writing system

Motivation

Endangered languages documentation

Use speech with translations

Motivation

Using the Aikuma (Bird 2010) app to collect parallel speech

Low-resource languages

Utilize translations rather than transcriptions

Motivation

Task Description

Source side: Frames of the speech signal

Task Description

Target side: Translation text

Task Description

little bit of

knowledge

Task: find best alignment between source and target side

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Task Description

little bit of

knowledge

Our method outperforms both baselines

The big picture

go dog go

vaya vaya perro

The big picture

go dog go

vaya vaya perro

The big picture

go dog go

vaya vaya perro

The big picture

go dog go

vaya vaya perro

f8 : f8 :f13 :

The big picture

go dog go

vaya vaya perro

f8 : f8 :f13 :

Distortion model

Controls the reordering of the target words - based on fast-align [Dyer et al.]

Distortion model

Original

Modification

Distortion model

Span Start Span End

Clustering model

Assuming a “prototype” for each cluster

Clustering model

Training

Expectation-Maximization

Training

Initialize spans and clusters

Training

Initialize spans and clusters• M step:

- Re-estimate prototypes

Training

- Re-estimate prototypes• E step:

Training

- Assign cluster and align

Training

- Assign cluster and align- We restrict the search space:

- voice activity detection - phone boundary detection [Khanaga et al.]

Training

Experiments

Language Pair Dataset Number of utterances

Griko - Italian [Lekakou et al] 330

Spanish - EnglishCALLHOME (sample) 2k

CALLHOME (all) 17kFisher 143k

Baselines

• Naive:• frames/word ~ #characters• along the diagonal

Baselines

Austin is great

Baselines

• Neural [Duong et al]:• DNN optimised for direct translation of speech• convert attention mechanism weights to alignments

Austin is great

Alignment F-score

Results

Alignment F-scoreF-

Griko-Italian Callhome (2k) Callhome (17k) Fisher (143k)

naive neural ours

Results

Alignment F-scoreF-

naive neural ours

35.735.8

Results

Alignment F-scoreF-

naive neural ours

26.229.126.427.0 27.8

35.735.8

Results

Alignment F-scoreF-

naive neural ours

38.638.8

26.229.126.427.0 27.8

35.735.8

Results

Alignment Precision

Results

Alignment PrecisionPr

naive neural ours

Results

naive neural ours

31.831.9

Results

naive neural ours

24.726.123.824.6 24.0

31.831.9

Results

naive neural ours

33.338.438.8

24.726.123.824.6 24.0

31.831.9

Results

ResultsWord-level F-score

Word Length (frames)

Example

Conclusion

Alignment model

Extension of IBM-2 with fast-align for speech-to-translation

k-means clustering with DTW and DBA

Improvements in F-score and particularly Precision

https://bitbucket.org/ndnlp/speech2translation

Alignment Recall

Results

Alignment RecallRe

naive neural ours

Results

Alignment RecallRe

naive neural ours

40.740.8

Results

Alignment RecallRe

naive neural ours

27.832.9

29.830.033.2

40.740.8

Results

Alignment RecallRe

naive neural ours

38.838.9

27.832.9

29.830.033.2

40.740.8

Results

Example

Proper model

Deficient:

Proper model

Proper:

Proper model

Proper:

The proper model performs much worse. -It favours too long or too short spans

Example

Background: DTW and DBA

Dynamic Time Warping (DTW)

DTW Barycenter Averaging (DBA)

Background: DTW and DBA

M-step:

Prototype estimation with DTW Barycenter Averaging

M-step:

an unsupervised probability model for speech-to ...aanastas/research/emnlp2016slides.pdfgriko -...

Documents

lekakou c dr (mech. eng. sci.) sri open access (library...

publications in peer-reviewed journals in peer... · 2...

1 sharepoint momentum 17k+ customers, 100m licenses leader...

blind elephant: web application fingerprinting ... elephant:...

c commands · 4-3 cisco mds 9000 family command reference...

-a-.*-.yym.m 7#9Ê5a (17k) 2014#s (a) 201 11 090—...

924 bel air rd lp $250,000,000 - constant...

57 dunmow road pestell &...

prerequisites for smart licensing · callhome hostname...

17k+ customers, 100m licenses continued platform and...

things i wish someone had told me about idaa two years ·...

table of contents - rainbow power company · this manual...

: 17k 1180 1545 1641 1675 1784 1841 1843 1868 1873 1876 ......

focus on anaphora - lot publications · 2014. 10. 23. ·...

proposals for budget briefing - trafford traff… ·...

tower fan - clas ohlson · tower fan art.no18-1369 model...

lekakou, c., smith, p., sordon, a., santoni, c., meeks, g...

exploring the image of shipping as communicated by selected...

omnik transformerless solar inverter 13k 17k 20k

stairs light controller 11-17k lcd sharp