arxiv:2010.10048v1 [cs.cl] 20 oct 2020 · mimics the human interpreter’s practice to trans-late...

10
Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training * Renjie Zheng 1 Mingbo Ma 1 Baigong Zheng 1 Kaibo Liu 1 Jiahong Yuan 1 Kenneth Church 1 Liang Huang 1,2 1 Baidu Research, Sunnyvale, CA, USA 2 Oregon State University, Corvallis, OR, USA {renjiezheng, mingboma}@baidu.com Abstract Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate la- tencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as mea- sured by the naturalness metric MOS) with substantially lower latency than the baseline, in both ZhEn directions. 1 Introduction Simultaneous speech-to-speech translation, which mimics the human interpreter’s practice to trans- late the source speech into a different language with 3 to 5 seconds delay, has wide usage sce- narios such as international conference meetings, traveling and negotiations as it provides more natural communication process than simultane- ous speech-to-text translation. This task has been widely considered as one of the most challenging tasks in NLP with (but not limited to) following reasons: on one hand, the simultaneous transla- tion is a hard task due to the word order differ- ence between source and target languages, e.g., * See our speech-to-speech simultaneous translation demos (including comparison with human interpreters) at https://sat-demo.github.io. Equal contribution Work done at Baidu Research. Current address: Kwai Inc., Seattle, WA, USA. slower source speech faster source speech src speech tgt speech src speech tgt speech unnatural pauses latency Sent. #1 Sent. #2 Sent. #1 Sent. #2 Sent. #3 Figure 1: Slower source speech causes unnatural pauses () between words. Faster source speech prop- agates extra latencies () to the following sentences. SOV languages (German, Japanese, etc.) and SVO languages (English, Chinese, etc.); on the other hand, simultaneous speech-to-speech translation escalates the challenge by considering the smooth cooperation between the modules of speech recog- nition, translation and speech synthesis. In order to achieve simultaneous speech-to- speech translation (SSST), to the best of our knowledge, most recent approaches (Oda et al., 2014; Xiong et al., 2019) dismantle the entire sys- tem into a three-step pipelines, streaming Auto- matic Speech Recognition (ASR) (Sainath et al., 2020; Inaguma et al., 2020; Li et al., 2020), simul- taneous Text-to-Text translation (sT2T) (Gu et al., 2017; Ma et al., 2019; Arivazhagan et al., 2019; Ma et al., 2020b), and Text-to-Speech (TTS) syn- thesis (Wang et al., 2017; Ping et al., 2017; Oord et al., 2017). Most recent efforts mainly focus on sT2T which is considered the key component to further reduce the translation latency and im- prove the translation quality for the entire pipeline. To achieve better translation quality and lower la- tency, there has been extensive research efforts which concentrate on the sT2T by introducing more robust models (Ma et al., 2019; Arivazha- gan et al., 2019), better policies (Gu et al., 2017; Zheng et al., 2020a, 2019b,a), new decoding algo- rithms (Zheng et al., 2019c, 2020b), or multimodal arXiv:2010.10048v2 [cs.CL] 21 Oct 2020

Upload: others

Post on 22-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Fluent and Low-latency Simultaneous Speech-to-Speech Translationwith Self-adaptive Training ∗

    Renjie Zheng 1 † Mingbo Ma 1 † Baigong Zheng 1 ‡ Kaibo Liu 1Jiahong Yuan 1 Kenneth Church 1 Liang Huang 1,2

    1Baidu Research, Sunnyvale, CA, USA2Oregon State University, Corvallis, OR, USA{renjiezheng, mingboma}@baidu.com

    Abstract

    Simultaneous speech-to-speech translation iswidely useful but extremely challenging, sinceit needs to generate target-language speechconcurrently with the source-language speech,with only a few seconds delay. In addition,it needs to continuously translate a streamof sentences, but all recent solutions merelyfocus on the single-sentence scenario. Asa result, current approaches accumulate la-tencies progressively when the speaker talksfaster, and introduce unnatural pauses whenthe speaker talks slower. To overcome theseissues, we propose Self-Adaptive Translation(SAT) which flexibly adjusts the length oftranslations to accommodate different sourcespeech rates. At similar levels of translationquality (as measured by BLEU), our methodgenerates more fluent target speech (as mea-sured by the naturalness metric MOS) withsubstantially lower latency than the baseline,in both Zh↔En directions.

    1 Introduction

    Simultaneous speech-to-speech translation, whichmimics the human interpreter’s practice to trans-late the source speech into a different languagewith 3 to 5 seconds delay, has wide usage sce-narios such as international conference meetings,traveling and negotiations as it provides morenatural communication process than simultane-ous speech-to-text translation. This task has beenwidely considered as one of the most challengingtasks in NLP with (but not limited to) followingreasons: on one hand, the simultaneous transla-tion is a hard task due to the word order differ-ence between source and target languages, e.g.,

    ∗ See our speech-to-speech simultaneous translationdemos (including comparison with human interpreters) athttps://sat-demo.github.io.

    † Equal contribution‡ Work done at Baidu Research. Current address: Kwai

    Inc., Seattle, WA, USA.

    slower source speech

    faster source speech

    src speech

    tgt speech

    src speech

    tgt speech

    unnatural pauses

    latency

    Sent. #1 Sent. #2

    Sent. #1 Sent. #2 Sent. #3

    Figure 1: Slower source speech causes unnaturalpauses (↔) between words. Faster source speech prop-agates extra latencies (↔) to the following sentences.

    SOV languages (German, Japanese, etc.) and SVOlanguages (English, Chinese, etc.); on the otherhand, simultaneous speech-to-speech translationescalates the challenge by considering the smoothcooperation between the modules of speech recog-nition, translation and speech synthesis.

    In order to achieve simultaneous speech-to-speech translation (SSST), to the best of ourknowledge, most recent approaches (Oda et al.,2014; Xiong et al., 2019) dismantle the entire sys-tem into a three-step pipelines, streaming Auto-matic Speech Recognition (ASR) (Sainath et al.,2020; Inaguma et al., 2020; Li et al., 2020), simul-taneous Text-to-Text translation (sT2T) (Gu et al.,2017; Ma et al., 2019; Arivazhagan et al., 2019;Ma et al., 2020b), and Text-to-Speech (TTS) syn-thesis (Wang et al., 2017; Ping et al., 2017; Oordet al., 2017). Most recent efforts mainly focuson sT2T which is considered the key componentto further reduce the translation latency and im-prove the translation quality for the entire pipeline.To achieve better translation quality and lower la-tency, there has been extensive research effortswhich concentrate on the sT2T by introducingmore robust models (Ma et al., 2019; Arivazha-gan et al., 2019), better policies (Gu et al., 2017;Zheng et al., 2020a, 2019b,a), new decoding algo-rithms (Zheng et al., 2019c, 2020b), or multimodal

    arX

    iv:2

    010.

    1004

    8v2

    [cs

    .CL

    ] 2

    1 O

    ct 2

    020

    https://sat-demo.github.io

  • Figure 2: Average Chinese and English speed rate dis-tribution for different speakers.

    information (Imankulova et al., 2019). However,is it sufficient to only consider the effectivenessof sT2T and ignore the interactions between otherdifferent components?

    Furthermore, in practice, when we need totranslate multiple sentences continuously, it is notonly important to consider the cooperations be-tween sT2T and other speech-related modules, butalso essential to take the effects between currentand later sentence into consideration as shown inFig. 1. Unfortunately, all the aforementioned tech-niques ignore other speech modules and merelyestablish their systems and analysis on single-sentence scenario, which is not realistic. Toachieve fluent and constant, low-latency SSST, wealso need to consider speech speed difference be-tween target and source speech.

    Fig. 2 shows the speech rate distributions forboth Chinese and English in our speech corpus.The speech rate varies especially for differentspeakers. As shown in Fig. 1, when we have vari-ous source speech speed, the number of unnaturalpauses and the latency vary dramatically. Morespecifically, when speaker talks slowly, TTS of-ten needs to make more pauses to wait for moretokens from sT2T which usually does not out-put new translations with limited source informa-tion. These unnatural pauses lead to semanticand syntactic confusion (Lege, 2012; Bae, 2015).On the contrary, when speaker talks fast, the tar-get speech synthesized from previous sT2T mod-els (e.g. wait-k) always introduce large latencywhich accumulates through the entire paragraphand causes significant delays.Therefore, in realis-tic, the latency for the latter sentences are far morethan the claimed latency in the original system.Fig. 3 supports the above hypothesis when sourceside speech rate varies while using one wait-ktranslation model and iTTS model.

    To overcome the above problems, we proposeSelf-Adaptive Translation (SAT) for simultaneousspeech-to-speech translation, which flexibly deter-mines the length of translation based on differ-

    1.0 1.5 2.0 2.5 3.0Source speech rate (words per second)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    Avg.

    # o

    f unn

    atur

    al p

    ause

    s

    5

    10

    15

    20

    25

    Avg.

    late

    ncy

    (sec

    onds

    )

    Figure 3: Relationship between source speech rate withlatency and number of unnatural pauses by naively us-ing wait-k model in simultaneous Chinese-to-Englishspeech-to-speech translation in our dev set.

    ent source speech rate. As it is shown in Fig-ure 6, within this framework, when the speakerstalk very fast, the model is encouraged to generateabbreviate but informative translation. Hence, asa result of shorter translation, the previous trans-lation speech can finish earlier and alleviate theireffects to the latter ones. Similarly, when thespeakers have slower speech rate, the decoder willgenerate more meaningful tokens until a naturalspeech pause. The speech pauses can be under-stood as a natural boundary between sentences orphrase which does not introduce ambiguity to thetranslation. In conclusion, we make the followingcontributions:

    • We propose SAT to flexibly adjust translationlength to generate fluent and low-latency tar-get speeches for the entire speech (Sec. 3).

    • We propose paragraph based BoundaryAware Delay as the first latency metricsuitable for simultaneous speech-to-speechtranslation (Sec. 4).

    • We annotate a new simultaneous speech-to-speech translation dataset for Chinese-English translation, together with profes-sional interpreters’ interpretation (Sec 5).

    • Our system is the first simultaneous speech-to-speech translation system using iTTS (asopposed to full-sentence TTS) to further re-duce the latency (Sec. 2).

    • Experiments show that our proposed systemcan achieve higher speech fluency and lowerlatency with similar or even higher trans-lation quality compared with baselines andeven human interpreters (Sec. 5).

    2 PreliminariesIn this section, we first introduce each componentof three-step pipeline, which are streaming ASR,

  • Contributions

    3

    source speech stream

    streaming 
speech

    recognition source text stream

    simultaneous 
text-to-text translation target text stream

    incremental
 text-to-speech target speech stream

    (using cloud APIs)

    (foundational work 
& Aims 1-3) (Aims 4)

    主席先⽣ … Mr. Chairman …

    Figure 4: Pipeline of Speech-to-speech simultaneous translation.

    # Incremental Transcription Time (ms)1 thank you 9602 thank you miss 11203 thank you Mr chair 16004 Thank you , Mr chairman . 2040

    Table 1: Example of English streaming ASR. Redwords are revised in latter steps. Punctuations only ap-pear in the final step.

    simultaneous translation models and incrementalTTS techniques.

    2.1 Streaming Automatic Speech RecognitionWe use anonymous real-time speech recognizeras the speech recognition module. As shown inFig. 4, streaming ASR is first step of the entirepipeline which converts the growing source acous-tic signals from speaker into a sequence of tokensx = (x1, x2, ...) timely with about 1 second la-tency. Table 1 demonstrates one example of En-glish streaming ASR which generates the Englishoutputs incrementally. Each row in the table rep-resents the streaming ASR outputs at each step.Note that streaming ASR sometimes revises sometail outputs from previous step (e.g. 3th and 4thsteps in Table 1). To get stabler outputs, we ex-clude the last word in ASR outputs (except the fi-nal steps) in our system.

    2.2 Simultaneous Machine TranslationAs an intermediate step between source speechrecognition and target speech synthesis modules,the goal of this step is to translation all the avail-able source language tokens from streaming ASRinto another language. There are many Text-to-Text simultaneous translation models (Gu et al.,2017; Ma et al., 2019; Arivazhagan et al., 2019;Ma et al., 2020b) that have been proposed recently.

    Different from conventional full-sentence trans-lation model, which encodes the entire source sen-tence x = (x1, ...xm) into a sequence of hid-den states, and decodes sequentially conditionedon those hidden states and previous predictions asp(y | x) =

    ∏|y|t=1 p(yt | x, y

  • Speech Rate MOS0.5× 2.00± 0.080.6× 2.32± 0.080.75× 2.95± 0.07

    Original 4.01± 0.081.33× 3.34± 0.081.66× 2.40± 0.092.0× 2.06± 0.04

    Table 2: Mean Opinion Score (MOS) evaluations ofnaturalness for different speech speed changed by ffm-peg. Original English speeches are synthesized by ourincremental Text-to-speech system.

    Figure 5: Tgt/src length ratio for English-to-Chinesetask in training data (red) and ideal testing cases (blue).

    focus on the translated speech when we speed upthe speech rate on target side, and sometimes itwill disrupt the audiences’ comprehension of thetranslation (Gordon-Salant et al., 2014). Simi-larly, slowing down the speech only creates over-long phoneme pronunciation which is unnaturaland leads to confusion.

    Inspired by human interpreters(He et al., 2016;Al-Khanji et al., 2000) who often summarize thecontexts in order to catch up the speaker, or makewordy translation to wait the speaker, the optimaltranslation model should be enable to adjust thelength of translated sentence to change the speechduration on target side to avoid further delays orunnatural pauses fundamentally.

    3.2 Self-Adaptive Training

    Translation between different language pairs havevarious tgt/src length ratios, e.g., English-to-Chinese translation ratio is roughly around 0.85(small variations between different datasets).However, this length ratio merely reflects the av-erage statistics for the entire dataset, and as it isshown with red line in Fig. 5, the ratio distribu-tion for individual sentence is quite wide aroundthe average length ratio.

    As shown in Fig. 6 and discussed in earlier sec-tions, over short and long translations are not pre-ferred in simultaneous speech-to-speech transla-tion. Ideally, we prefer the system to have a similar

    source ASR

    wait-k + iTTS

    source ASRwait-k + iTTS

    source ASR

    wait-k + iTTS

    (a) slow source speech

    (b) moderate source speech

    (c) fast source speech

    SAT + iTTS

    SAT + iTTS

    SAT + iTTS

    Figure 6: Illustration of conventional wait-k (red) andSAT-k (yellow) training policy. In SAT, we force thelength of tail to be k which equals the latency k. In theabove example, we have k = 1.

    amount of initial wait with delay in the tail duringtranslation of each sentence. Following this de-sign, the translation tail of previous sentence willfit perfectly into the beginning delay window forthe following sentence, and will not cause any ex-tra latency and intermittent speech.

    Based on the above observation, we proposeto use different training policies for different sen-tences with different tgt/src ratios. As shown inFig. 6, We start from a fixed delay of k tokensand then force the model to have the same numberof tokens in initial wait and final tail by amortiz-ing the extra tokens into the middle steps. Morespecifically, when we have longer tail than thefixed initial wait, we move extra words into formersteps, and some steps before tail will decode morethan one word at a time. As a result, there willbe some one-to-many policies between source andtarget and the model will learn to generate longertranslations with shorter source text. On the con-trary, when we have shorter tail, we perform ex-tra reading on the source side and the model willlearn to generate shorter translation through thismany-to-one policy. Formally, we define our SATtraining policy as follows:

    gwait-k, c(t) = min{k + t− 1− bctc, |x|} (3)

    where c is the compensation rate which is de-cided by the tgt/src length ratio after deductionof k tokens in source initial and target tail. Forexample, when tgt/src length ratio is 1.25, thenc = |tgt|−k|src|−k − 1 = 1.25 − 1 = 0.25, represent-ing to decode 5 target words for every 4 sourcewords, and model learn to generate wordy transla-tion. When target side is shorter than source side, cbecomes negative, and model learn to decode lesstokens than source side.

    Note that the tgt/src length ratio in our caseis determined by the corresponding sentence it-

  • 1

    }

    k

    }k

    }

    k

    }

    kc = 0 }k

    }k

    c=-0.5

    c = 1

    Figure 7: Different translation policy with differentchoice of c. Green boxes represent many-to-1 policy;yellow boxes denote 1-to-1 policy; purple boxes show1-to-many translation policy.

    self instead of the corpus level tgt/src length ra-tio, which is a crucial different from catchup al-gorithm from (Ma et al., 2019) where some shorttranslations is trained with inappropriate positivec. It seems to be a minor difference, but it actuallyenables the model to learn totally different thingother than catchup.

    The blue line in Fig. 5 represents the tgt/srclength ratio for the ideal simultaneous speech-to-speech translation examples in our training setwhich have the same speech time between sourceand target side. When we have the same speechtime between source and target side, there will beno accumulated latency from previous sentencesto the following sentences. As we notice, ourtraining data covers all the tgt/src length ratio dis-tribution for the ideal cases, indicating that by ad-justing the compensation rate c from our trainingcorpus, our model learns to generate appropriatelength of translation on the target side to avoid ac-cumulated latency.

    As shown in Fig. 5, there are many differentchoices of c for different sentences, and each sen-tence is trained with their own corresponding com-pensation rate which makes the training policy dif-ferent from others with different c. Hence, Asshown in Fig. 7, our trained model is implicitlylearned many different policies, and when youchoose a compensation rate c during inference,the model will generate certain length of trans-lation corresponding to that compensation rate cin training. More specifically, assume we havea source sentence, for example in Chinese, withlength of m, and the conventional full-sentence orwait-k model normally would translate this into aEnglish sentence with length of 1.25 ×m. How-ever, the output length from SAT can be changed

    (a) Tail length vs. test-time compensation rate

    (b) Translation length |y| vs. test-time compensationrate

    Figure 8: Translation length analysis on Chinese-to-English task using one SAT-3 and wait-k model.

    by c following the policy in Eq. 3 during decoding.When c is negative, SAT generates shorter transla-tion than 1.25×m. On the contrary, if we choosec that is positive, SAT generates longer translationthan 1.25×m. The compensation rate c functionsas the key of model selection to generate outputsof different lengths.

    Fig. 8(a)–8(b) show the effectiveness of our pro-posed model which has the ability to adjust the taillength of the entire translation with different c’s.

    3.3 Self-Adaptive Inference

    The above section discusses the importance of c,which is easily to obtain during training time, butat inference time, we do not know the optimalchoice of c in advance since the fluency and la-tency criteria also rely on the finish time for eachword on both sides. Therefore, the streaming ASRand iTTS plays important roles here to determinethe decoding policy to form fluent and low-latencytranslation speech and we use the knowledge ofstreaming ASR and iTTS for selecting the appro-priate policy on the fly.

    When we have a faster speech, streaming ASRwill send multiple tokens to SAT at some steps.But SAT only generates one token at a time on thetarget side and pass it to iTTS instantly. This de-coding policy has a similar function to a negativec, which has many-to-one translation policy. Inthis case, SAT will generate succinct translation

  • and iTTS therefor can finish the translation speechwith shorter time since there are less tokens.

    Contrarily, when the speaker talks with slowerpace, there is only one token that is feed into SAT.SAT translates it into a different token and deliversit to iTTS. This is one-to-one translation policy.When iTTS is about to finish playing the newlygenerated speech, and there is no new incomingtoken from steaming ASR, SAT will force the de-coder to generate one extra token, which becomesone-to-many translation policy (including the de-coded token in previous step), and feeds it to iTTS.When the speaker makes a long pause, and thereis still no new tokens from streaming ASR, the de-coder of SAT continues to translate until a pausetoken (e.g. any punctuations) generated. Thispause token forms a necessary, natural pause onspeech side and will not change the understandingfor the translated speech.

    4 Paragraph-Based Boundary AwareLatency

    As mentioned frequently above, latency is anotheressential dimension for simultaneous speech-to-speech translation performance. However, themeasurement of speech delay from source speechto each synthesized word in target speech is chal-lenging and there is no direct metric that is suitablefor simultaneous speech-to-speech translation.

    Human interpreters use Ear-Voice-Span (EVS)(Gumul, 2006; Lee, 2002) to calculate transla-tion delay for some landmark words from sourcespeech to target source. However, this requires thetarget-to-source word correspondence. In prac-tice, the translation model sometimes makes errorsduring translation which includes miss transla-tion of some words or over translated some wordsthat source does not include. Thus, an automaticfully word-to-word alignment between target andsource is hard to be accurate. Human annotationis accurate but expensive and not practical.

    Inspired by Ari et al. (2020) who proposedTranslation Lag (TL) to ignore the semantic cor-respondence between words from target to sourceside and only calculate each target delay propor-tionally to each source words regardless the ac-tually meaning of word in the task of simultane-ous “speech-to-text” translation, we use a similarmethod to calculate the latency for each sentence.

    Nevertheless, TL is only designed for single-sentence latency ,while we need to measure thelatency of a paragraph of speech. Thus, we pro-

    pose paragraph based Boundary Aware Latency(pBAL) to compute the latency of long speech si-multaneous translation. In pBAL, we first align theeach sentence, make each word’s correspondencewithin the sentence boundary. Then we computethe time differences of the finished time betweeneach target word’s audio and its proportion corre-sponding source word’s finish time in source side.In experiments, we determine the finish time ofeach source and target words by forced aligner(Yuan and Liberman, 2008) and align the transla-tion and source speech by using the correspondingstreaming ASR as a bridge.

    5 Experiments5.1 Datasets and Systems SettingsWe evaluate on two simultaneous speech-to-speech translation directions: Chinese↔English.For training, we use the text-to-text parallel cor-pora available from WMT181 (24.7M sentencepairs). We also annotate a portion of Chinese andEnglish speeches from LDC United Nations Pro-ceedings Speech 2 (LDC-UN) as a speech-to-textcorpus. This corpus includes speeches recorded in2009-2012 from United Nations conferences in sixofficial UN languages. We transcribe the speechesand then translate the transcriptions as references.The speech recordings include not only sourcespeech but also corresponding professional simul-taneous interpreters’ interpretation in the confer-ence. Thus, we also transcribe those human simul-taneous interpretation of En→Zh direction whichwill not be used in our model but compared to inthe following experiments.

    En→Zh Zh→En

    Train# of speeches 58 119

    # of words 63650 61676Total time 6.81 h 9.68 h

    Dev# of speeches 3 6

    # of words 1153 2415Total time 0.27 h 0.35 h

    Test# of speeches 3 6

    # of words 3053 1870Total time 0.39 h 0.30 h

    Table 3: Statistics of LDC-UN dataset (source-side).

    Table 3 shows the statistics of our speech-to-text dataset. We train our models using both theWMT18 training set and the LDC UN speech-to-text training set. We validate and test the mod-els only in the LDC-UN dataset. For Chinese side

    1http://www.statmt.org/wmt18/translation-task.

    html2https://catalog.ldc.upenn.edu/LDC2014S08

    http://www.statmt.org/wmt18/translation-task.htmlhttp://www.statmt.org/wmt18/translation-task.htmlhttps://catalog.ldc.upenn.edu/LDC2014S08

  • text, we use jieba 3 Chinese segmentation tool. Weapply BPE (Sennrich et al., 2015) on all texts in or-der to reduce the vocabulary sizes. We set the vo-cabulary size to 16K for both Chinese and Englishtext. Our Transformer is essentially the same withbase Transformer model (Vaswani et al., 2017).

    As mentioned in Section 2.1, we use an anony-mous real-time speech recognizer from a well-known cloud platform as the speech recognitionmodule for both English and Chinese. Duringspeech-to-speech simultaneous translation decod-ing, after receiving an ASR input, we first nor-malize the punctuations and tokenize (or do Chi-nese segmentation for Zh→En translation) the in-put. The last tokens are always removed in the en-coder of translation model because they are veryunstable. In the latency measurement we use PennPhonetics Lab Forced Aligner (P2FA) (Yuan andLiberman, 2008) as the forced aligner to automati-cally annotate the time-stamp for both Chinese andEnglish words in source and target sides.

    For the incremental Text-to-speech system, wefollow Ma et al. (2020a) and take the Tacotron2 model (Shen et al., 2018) as our phoneme-to-spectrogram model and train it with additionalguided attention loss (Tachibana et al., 2018)which speeds up convergence. Our vocoder isthe same as that in the Parallel WaveGAN pa-per (Yamamoto et al., 2020), which consists of 30layers of dilated residual convolution blocks withexponentially increasing three dilation cycles, 64residual and skip channels and the convolutionfilter size 3. For English, we use a proprietaryspeech dataset containing 13,708 audio clips (i.e.,sentences) from a female speaker and the corre-sponding transcripts. For Chinese, we use a pub-lic speech dataset4 containing 10,000 audio clipsfrom a female speaker and the transcripts.

    5.2 Speech-to-Speech Simul. TranslationFig. 9 show the final results of our proposed mod-els and baselines. For translation quality mea-surement, we use the ”multi-bleu.pl” 5 script tocalculate BLEU scores. Since different punctua-tions are soundless, we remove all of them beforeBLEU evaluation for both hypotheses and refer-ences. We follow (Xiong et al., 2019) to concate-nate the translations of each talk into one sentenceto measure BLEU scores.

    3https://github.com/fxsjy/jieba

    4https://www.data-baker.com/open_source.html

    5https://github.com/moses-smt//mosesdecoder/blob/

    master/scripts/generic/multi-bleu.perl

    (a) Chinese-to-English simultaneous translation

    4 5 6 7 8 9Latency (seconds)

    15

    20

    25

    30

    35

    40

    45

    BLEU

    1.6 s

    SAT-kwait-kwait-k + SAT decodingSegmentHuman

    13

    (b) English-to-Chinese simultaneous translation

    Figure 9: Translation quality and latency (pBAL)of proposed simultaneous speech-to-speech translationsystems compared with baselines. For all those SAT-kand wait-k models, k = {3, 5, 7} from bottom to top.

    For Chinese-to-English simultaneous transla-tion, we compare our models with naive wait-k,wait-k with SAT decoding (only use Self-adaptiveinference in Sec. 3.3), segment based models (Odaet al., 2014; Xiong et al., 2019) and full sentencetranslation model. All these models share oneiTTS system. For segment based model, since ourstreaming ASR API doesn’t provide any punctua-tion before the final step, we use the final punc-tuations to segment the partial streaming inputsand then use a full-sentence translation model totranslate the partial segment as a full sentence.The results show that our proposed SAT-k modelscan achieve much lower latency without sacrific-ing quality compared with those baselines.

    Fig. 9(b) shows the results of En→Zh simulta-neous translation. Besides the baselines used inZh→En experiments, we also compare our systemwith professional human interpreters’ translation.Our proposed models also outperform all the base-lines and human interpreters. Our models reducemore latency in Zh→En than En→Zh comparedwith wait-k because English sentences is alwayslonger than Chinese thus it’s more easily to accu-mulate latency in Zh→En (also shown in Fig. 10).

    https://github.com/fxsjy/jiebahttps://www.data-baker.com/open_source.htmlhttps://github.com/moses-smt//mosesdecoder/blob/master/scripts/generic/multi-bleu.perlhttps://github.com/moses-smt//mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

  • Gloss

    Input

    中方

    China

    作为

    as

    观察员

    observer

    以with

    积极

    active

    负责任

    responsible

    的态度

    attitude

    参与

    join

    了侵略

    invade

    crim

    e

    条款

    treaty

    的谈判

    negotiation

    StreamingASR

    中方

    作为

    观察

    员以

    积极

    负责任的

    态度

    参与

    了侵略

    罪条款的

    谈判

    SAT-3

    (previoussentence)

    China

    asan

    observer

    takesan

    active

    and

    respon

    sible

    part

    intheprovisionsofthecrim

    eof

    aggression

    wait-3

    (previoussentence)

    asan

    observer

    China

    has

    takena

    positive

    and

    responsible

    part

    inthe

    negotiation

    ofthe

    draft

    articles

    onthecrim

    eofag

    gression

    wait-3+

    SAT

    decoding

    (previoussentence)

    asan

    observer

    China

    participated

    inthe

    negotiation

    ofthe

    draft

    articles

    onthecrim

    eof

    aggression

    witha

    positive

    and

    responsible

    attitude

    Tim

    e1.0s

    2.0s

    3.0s

    4.0s

    5.0s

    6.0s

    7.0s

    8.0s

    9.0s

    10.0

    s11

    .0s

    12.0

    s13

    .0s

    14.0

    s15.0

    s16.0

    s

    Figu

    re11

    :Dec

    odin

    gre

    sults

    ofpr

    opos

    edsi

    mul

    tane

    ous

    spee

    ch-t

    o-sp

    eech

    Chi

    nese

    -to-

    Eng

    lish

    tran

    slat

    ion

    syst

    eman

    dba

    selin

    es.

    Input

    But theaction

    plan

    isnot

    something

    simply

    tobe

    admired

    Itmustbe

    implemented

    StreamingASR

    butthe

    action

    plan

    isnot

    something

    simply

    toad

    mired

    itmust

    be

    implemented

    Gloss

    SAT-3

    (previoussentence)

    但 but

    行动计划

    action-plan

    并 and

    不是

    not

    简单

    simple

    的值得

    word

    钦佩

    admire

    的东西

    things

    它 it

    必须

    must

    得到

    get

    执行

    execute

    Gloss

    wait-3

    (previoussentence)

    但 but

    行动计划

    action-plan

    是 is

    不 not

    可能

    possible

    的只是

    just

    被赞美

    praise

    它 it

    必须

    must

    在 is

    执行

    execute

    Gloss

    wait-3+

    SAT

    decoding

    (previoussentence)

    但是

    but

    行动计划

    action-plan

    不 not

    仅仅

    only

    是 is

    值得

    worth

    赞赏

    admire

    的东西

    things

    它 it

    必须

    must

    得到

    get

    全面

    entirely

    执行

    execute

    Gloss

    Human

    (previoussentence)

    但是

    but

    行动纲领

    action-plan

    不是

    not

    什么

    sth.

    可值得

    worth

    钦佩

    admire

    的因为

    for

    它 it

    必须

    must

    得到

    get

    贯彻

    execute

    Tim

    e1.0s

    2.0s

    3.0s

    4.0s

    5.0s

    6.0s

    7.0

    s8.0s

    9.0s

    10.0

    s11.0

    s12.0

    s13.0

    s14.0

    s15.0

    s16.0

    s

    Figu

    re12

    :Dec

    odin

    gre

    sults

    ofpr

    opos

    edsi

    mul

    tane

    ous

    spee

    ch-t

    o-sp

    eech

    Eng

    lish-

    to-C

    hine

    setr

    ansl

    atio

    nsy

    stem

    and

    base

    lines

    .

    Figure 10: Latency for sentences at different indices inChinese-to-English dev-set.

    5.3 Human Evaluation on Speech Quality

    Method En→Zh Zh→Enwait-3 3.56± 0.09 3.68± 0.08

    wait-3 + SAT decoding 3.81± 0.08 3.96± 0.04SAT-3 3.83± 0.07 3.97± 0.07

    Segment-based 3.79± 0.15 3.99± 0.07Full sentence 3.98± 0.08 4.03± 0.03

    Human 3.85± 0.05 -

    Table 4: MOS evaluations of fluency for different targetspeeches generated by different methods.

    In Table 4, we evaluate our synthesizedspeeches by Mean Opinion Scores (MOS) with na-tive speakers, which is a standard metric in TTS.Each speech received 10 human ratings scaledfrom 1 to 5, with 5 being the best. For bothZh↔En directions, wait-3 models have the low-est MOS due to the many unnatural pauses (seeSec. 3.1). Our proposed model SAT-3 and wait-3with SAT decoding achieve similar fluency to fullsentence models and even human interpreters.

    5.4 ExamplesFig. 11 shows a Zh→En decoding example. Herethe wait-3 models’ outputs have a much longerlatency compared with SAT-3 because their be-ginnings are delayed by the translation of pre-vious sentence(s) and their tails are also verylong. The En→Zh example in Fig. 12 is similar.While streaming ASR has a very long delay, SAT-3 model still controls the latency to roughly 4.5s;all pauses on the target side are natural ones frompunctuations. By contrast, the human interpreter’stranslation has the longest latency.

    6 ConclusionsWe proposed Self-Adaptive Translation for simul-taneous speech-to-speech translation which flexi-bly adjusts translation length to avoid latency ac-cumulation and unnatural pauses. In both Zh↔Endirections, our method generates fluent and low la-tency target speeches with high translation quality.

  • ReferencesRaja Al-Khanji, Said Shiyab, and Riyad Hussein. 2000.

    On the use of compensatory strategies in simultane-ous interpretation. Meta: Journal des traducteurs,45:548.

    Naveen Ari, Colin Andrew Cherry, Te I, WolfgangMacherey, Pallavi Baljekar, and George Foster.2020. Re-translation strategies for long form, simul-taneous, spoken language translation. In ICASSP2020.

    Naveen Arivazhagan, Colin Cherry, WolfgangMacherey, Chung-Cheng Chiu, Semih Yavuz,Ruoming Pang, Wei Li, and Colin Raffel. 2019.Monotonic infinite lookback attention for simulta-neous machine translation. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, Florence, Italy. Association forComputational Linguistics.

    Rebecca M. Bae. 2015. The effects of pausing on com-prehensibility. PhD dissertation, Iowa State Univer-sity.

    Sandra Gordon-Salant, Danielle J Zion, and CarolEspy-Wilson. 2014. Recognition of time-compressed speech does not predict recognitionof natural fast-rate speech by older listeners. TheJournal of the Acoustical Society of America,136(4):EL268–EL274.

    Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor O. K. Li. 2017. Learning to translate in real-time with neural machine translation. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics,EACL 2017, Valencia, Spain, April 3-7, 2017, Vol-ume 1: Long Papers, pages 1053–1062.

    Ewa Gumul. 2006. Conjunctive cohesion and thelength of ear-voice span in simultaneous interpret-ing. a case of interpreting students. Linguistica Sile-siana, (27):93–103.

    Wei He, Zhongjun He, Hua Wu, and Haifeng Wang.2016. Improved neural machine translation with smtfeatures. In Proceedings of the Thirtieth AAAI Con-ference on Artificial Intelligence, pages 151–157.AAAI Press.

    Aizhan Imankulova, Masahiro Kaneko, Tosho Hira-sawa, and Mamoru Komachi. 2019. Towards mul-timodal simultaneous neural machine translation.arXiv preprint arXiv:2004.03180.

    Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li,and Yifan Gong. 2020. Minimum latency trainingstrategies for streaming sequence-to-sequence asr.ICASSP.

    Tae-Hyung Lee. 2002. Ear voice span in english intokorean simultaneous interpretation. Méta: jour-nal des traducteurs/Méta: Translators’ Journal,47(4):596–606.

    Ryan Lege. 2012. The Effect of Pause Duration onIntelligibility of Non-Native Spontaneous Oral Dis-course. PhD dissertation, Brigham Young Univer-sity.

    Bo Li, Shuo-Yiin Chang, Tara N. Sainath, RuomingPang, Yanzhang He, Trevor Strohman, and YonghuiWu. 2020. Towards fast and accurate streaming end-to-end asr.

    Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using prefix-to-prefix framework. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3025–3036,Florence, Italy. Association for Computational Lin-guistics.

    Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng,Hairong Liu, Kainan Peng, Kenneth Church, andLiang Huang. 2020a. Incremental text-to-speechsynthesis with prefix-to-prefix framework. Findingsof EMNLP.

    Xutai Ma, Juan Pino, James Cross, Liezl Puzon, andJiatao Gu. 2020b. Monotonic multihead attention.8th International Conference on Learning Represen-tations, ICLR 2020, Addis Ababa, Ethiopia, April26-30, 2020, Conference Track Proceedings.

    Yusuke Oda, Graham Neubig, Sakriani Sakti, TomokiToda, and Satoshi Nakamura. 2014. Optimiz-ing segmentation strategies for simultaneous speechtranslation. In ACL.

    Aaron Oord, Yazhe Li, Igor Babuschkin, KarenSimonyan, Oriol Vinyals, Koray Kavukcuoglu,George Driessche, Edward Lockhart, Luis Cobo,Florian Stimberg, Norman Casagrande, DominikGrewe, Seb Noury, Sander Dieleman, Erich Elsen,Nal Kalchbrenner, Heiga Zen, Alex Graves, HelenKing, and Demis Hassabis. 2017. Parallel WaveNet:Fast high-fidelity speech synthesis. In ICML.

    Wei Ping, Kainan Peng, Andrew Gibiansky, Ser-can Ömer Arik, Ajay Kannan, Sharan Narang,Jonathan Raiman, and John L. Miller. 2017. DeepVoice 3: Scaling text-to-speech with convolutionalsequence learning. In ICLR.

    Tara N. Sainath, Yanzhang He, Bo Li, ArunNarayanan, Ruoming Pang, Antoine Bruguier,Shuo-Yiin Chang, Wei Li, Raziel Alvarez, ZhifengChen, Chung-Cheng Chiu, David Garcı́a, Alex Gru-enstein, Ke Hu, Minho Jin, A. Kannan, Qiao Liang,I. Mcgraw, Cal Peyser, Rohit Prabhavalkar, GolanPundak, David Rybach, Yuan Shangguan, YashSheth, Trevor Strohman, Mirkó Visontai, YonghuiWu, Yinyong Zhang, and Ding Zhao. 2020. Astreaming on-device end-to-end model surpassingserver-side conventional model quality and latency.ICASSP.

    https://doi.org/10.7202/001873arhttps://doi.org/10.7202/001873arhttps://arxiv.org/abs/1912.03393https://arxiv.org/abs/1912.03393https://aclanthology.info/papers/E17-1099/e17-1099https://aclanthology.info/papers/E17-1099/e17-1099https://doi.org/10.18653/v1/P19-1289https://doi.org/10.18653/v1/P19-1289https://doi.org/10.18653/v1/P19-1289

  • Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Neural machine translation of rare words withsubword units. arXiv preprint arXiv:1508.07909.

    Jonathan Shen, Ruoming Pang, Ron Weiss, MikeSchuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif Saurous, Yannis Agiomvrgiannakis, andYonghui Wu. 2018. Natural TTS synthesis by condi-tioning WaveNet on MEL spectrogram predictions.In Interspeech. Tacotron 2.

    Hideyuki Tachibana, Katsuya Uenoyama, and Shun-suke Aihara. 2018. Efficiently trainable text-to-speech system based on deep convolutional net-works with guided attention. In 2018 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 4784–4788. IEEE.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30.

    Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton,Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,Zongheng Yang, Ying Xiao, Zhifeng Chen, SamyBengio, Quoc Le, Yannis Agiomyrgiannakis, RobClark, and Rif A. Saurous. 2017. Tacotron: Towardsend-to-end speech synthesis. In Interspeech.

    Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang,Zhongjun He, Hua Wu, and Haifeng Wang. 2019.Dutongchuan: Context-aware translation model forsimultaneous interpreting.

    Ryuichi Yamamoto, Eunwoo Song, and Jae-MinKim. 2020. Parallel WaveGAN: A fast wave-form generation model based on generative adver-sarial networks with multi-resolution spectrogram.

    In ICASSP 2020-2020 IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 6199–6203. IEEE.

    Jiahong Yuan and Mark Liberman. 2008. Speakeridentification on the scotus corpus. Journal of theAcoustical Society of America, 123(5):3878.

    Baigong Zheng, Kaibo Liu, Renjie Zheng, Mingbo Ma,Hairong Liu, and Liang Huang. 2020a. Simultane-ous translation policies: from fixed to adaptive. InProceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics.

    Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019a. Simpler and faster learning of adap-tive policies for simultaneous translation. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), pages 1349–1354.

    Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019b. Simultaneous translation with flexi-ble policy via restricted imitation learning. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 5816–5822.

    Renjie Zheng, Mingbo Ma, Baigong Zheng, and LiangHuang. 2019c. Speculative beam search for simulta-neous translation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 1395–1402.

    Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu,and Liang Huang. 2020b. Opportunistic decodingwith timely correction for simultaneous translation.In ACL.

    https://arxiv.org/abs/1703.10135https://arxiv.org/abs/1703.10135