can automatic speech recognition enhance human communication? · 2019. 12. 18. · aalto asr group,...

Mikko Kurimo

Department of Signal

Processing and Acoustics

Aalto University

Can automatic speech

recognition enhance human

communication?

Mikko Kurimo

1989-1997 MSc and PhD at Kohonen’s CIS lab: speech recognition

with neural networks

1998-2012 Visiting research fellow in top speech labs:

– Research: IDIAP (CH), SRI (USA), ICSI (USA)

– University: Edinburgh, Cambridge, Colorado, Nagoya

2012 Associate professor in speech and language processing

- Teaching speech recognition and natural language processing

- Head of Aalto speech recognition group

Research topics: speech recognition, adaptation, translation, synthesis,

diarization, language modeling, audio and video description

ASR (Automatic Speech Recognition) is

already in our phones and homes

Including televisions, phones, new assistance devices, toys

ASR is applied to several everyday tasks

Including dictation, captioning, translation, interpretation,

information retrieval, conversational assistants, language learning

Speech signal and speech-to-text

● Rich communication signal between humans

● The most complex from all biosignals

● Speech => text + emotion,loudness,speed,emphasis,...

● How much speech “understanding” is needed depends on the task

● People perceive the level of spoken communication as a sign of

“intelligence”

Trying to improve several aspects of human

communication

For:

People who do not hear

People who do not speak the

same language

People who do not see (all)

Big data that cannot be viewed

Extended assistants

Learning new languages

Online subtitles: for audiences who do not hear

Challenge: speed, slang, readability

https://www.youtube.com/watch?v=0neezwViIPE

https://www.youtube.com/watch?v=0neezwViIPE

Conversation assistant: for participants who do not hear

Challenge: speed, showing text and facial expressions

https://www.youtube.com/watch?v=f6rA_fcT5cY

https://www.youtube.com/watch?v=f6rA_fcT5cY

Speech translation: ASR + text translation (+synthesis)

Challenge: Errors cumulate in pipeline / end-to-end training

End-to-end

1. Source language audio

2. Neural speech translationmodel

3. Target language text

4. (Target audio)

Audio-text alignment is particularlyhard between languages

Pipeline

1. Source language audio

2. Neural acoustic and languagemodels

3. Source language transcript

4. Neural translation models

5. Target language text

6. (Target audio)

https://www.youtube.com/watch?v=wqv7uYAyAQ0EMIME:

https://www.youtube.com/watch?v=wqv7uYAyAQ0

Games and language learning: “Say it again Kid”

Challenge: speed, children, training data

https://youtu.be/9wGUyMf87ag

https://www.youtube.com/watch?v=eI4De9Q_GYA

https://youtu.be/9wGUyMf87ag

https://www.youtube.com/watch?v=eI4De9Q_GYA

Learning languages

Production

Feedback

Picture ㏄ Wikimedia commons

Exposure

3. Listen your own attempt and once more the model.

4.

SIAK

Transparent feedback based on CTC decoding

Karhila, Smolander, Ylinen, Kurimo. Transparent pronunciation scoring using

articulatorily weighted phoneme edit distance. Interspeech 2019.

Assessing oral skills: language testing and learning

Challenge: noise, irregular speech, human experts

Reduce the workload

of reviewers with:

-trimming

-transcriptions

-statistics

-prosodic and phonetic

grading

https://digitala.aalto.fi/is16

https://www.youtube.com/watch?v=p3_UTsjBYJY

https://digitala.aalto.fi/is16

https://www.youtube.com/watch?v=p3_UTsjBYJY

Video description: for accessing big data

Challenge: combining picture, movement, audio, speech

Video + audio task:

• Describe what is going on

• Recognize people, places, events etc

Audio tasks:

• Diarization and speakerID: who spoke when?

• Spoken languageID

• Named entity recognition (per, loc, time, org etc.)

• Audio event tagging (other than speech)

• Subtitling, speech translation, subtitle translation

https://www.youtube.com/watch?v=cvR3iA5and8

https://www.youtube.com/watch?v=cvR3iA5and8

Conversational robots, toys, assistants

Challenge: speed, environment, dialog

• Vad tänker du?

• Jätte kiva show.

• It was amazing.

• Pah, tylsää.

• Who is in the white house now?

• Close the window and return.

Future (unsolved) challenges

Including informal conversations, multimodality,

multilinguality, personalization, context sensitivity

Solutions for those challenges are

studied in my research group

• Contact: [email protected]

• Publications: http://research.aalto.fi

• Home page: (search: ”Aalto asr home”)

• Software: (search: ”Aalto asr github”)

• Demos: (search: ”Aalto asr video”)

mailto:[email protected]

http://research.aalto.fi/

Aalto ASR group, some key publications

https://research.aalto.fi/

• Modeling under-resourced languages for speech recognition,

Kurimo,Enarvi,Tilk,Varjokallio,Mansikkaniemi & Alumäe, Language resources and evaluation 2017,

• Automatic Speech Recognition with Very Large Conversational Finnish and Estonian

Vocabularies, Enarvi,Smit,Virpioja & Kurimo, Trans. Audio, Speech, and Language Processing 2017

• The MeMAD submission to the WMT18 multimodal translation task, Grönroos, 9 others &

Kurimo, Workshop on Machine Translation 2018

• Character-based units for Unlimited Vocabulary Continuous Speech Recognition,

Smit,Gangireddy,Enarvi,Virpioja & Kurimo, Automatic Speech Recognition and Understanding 2018

• First-pass decoding with n-gram approximation of RNNLM: The problem of rare words,

Singh,Smit,Virpioja & Kurimo, Machine Learning in Speech and Language Processing 2018

• First-Pass Techniques for Very Large Vocabulary Speech Recognition of Morphologically Rich

Languages, Varjokallio,Virpioja & Kurimo, Spoken LanguageTechnology 2018

• The MeMAD Submission to the IWSLT 2018 Speech Translation Task,

Sulubacak,Tiedemann,Rouhe,Grönroos & Kurimo, Spoken Language Translation 2018

• The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general

purpose audio tagging, Xu,Smit & Kurimo, Detection and Classification of Acoustic Scenes and

Events 2018

• RL-KLM: Automating Keystroke-level Modeling with Reinforcement Learning, Leino,Oulasvirta

& Kurimo, Intelligent User Interfaces 2019

https://research.aalto.fi/

https://research.aalto.fi/en/publications/modeling-underresourced-languages-for-speech-recognition(8e20ea9c-bc65-4d6d-8b92-a014aa5af908).html

https://research.aalto.fi/en/publications/automatic-speech-recognition-with-very-large-conversational-finnish-and-estonian-vocabularies(74066940-5e5d-4208-af53-e61615e0603c).html

https://research.aalto.fi/en/publications/the-memad-submission-to-the-wmt18-multimodal-translation-task(5e94eedd-4f1e-44a8-a49a-e222753fed29).html

https://research.aalto.fi/en/publications/characterbased-units-for-unlimited-vocabulary-continuous-speech-recognition(bf94112f-8c70-453d-ad69-021ccdd56e25).html

https://research.aalto.fi/en/publications/firstpass-decoding-with-ngram-approximation-of-rnnlm-the-problem-of-rare-words(36f2cb41-4cf4-452c-a083-e390999e5c05).html

https://research.aalto.fi/en/publications/firstpass-techniques-for-very-large-vocabulary-speech-recognition-of-morphologically-rich-languages(188a4dfc-c769-4bf4-a564-575c462167f7).html

https://research.aalto.fi/en/publications/the-memad-submission-to-the-iwslt-2018-speech-translation-task(69119a17-f74e-4267-957a-5dc5538ab80d).html

https://research.aalto.fi/en/publications/the-aalto-system-based-on-finetuned-audioset-features-for-dcase2018-task2--general-purpose-audio-tagging(5ce018ed-dfaa-4d8d-9966-9fc15f635869).html

https://research.aalto.fi/en/publications/rlklm-automating-keystrokelevel-modeling-with-reinforcement-learning(b0f0ba0f-78b9-4953-91a9-33958047f827).html

Aalto ASR group, online resources

Code: https://github.com/aalto-speech

• AaltoASR tools

• Kaldi ASR training scripts

• VariKN language models

• TheanoLM language models

• Morfessor segmentation

• SIAK pronunciation assessment

• Speaker diarization

• Audio tagger

• etc

Data: https://www.kielipankki.fi/

Demos: https://www.youtube.com/channel/UCY4NOvOgKz9-x7rR_kkb51Q/

https://github.com/aalto-speech

https://www.kielipankki.fi/

https://www.youtube.com/channel/UCY4NOvOgKz9-x7rR_kkb51Q/

PhD thesis

2017 Andre Mansikkaniemi: Continuous unsupervised

topic adaptation for morph-based speech recognition

2018 Seppo Enarvi: Modeling conversational Finnish for

automatic speech recognition

2019 Reima Karhila: Building Personalised Speech

Technology Systems with Sparse, Bad Quality or Out-of-

domain Data

2019 Peter Smit: Modern subword-based models for

automatic speech recognition

MSc thesis

2018 Zhicun Xu: Audio event classification using deep learning

methods

2018 Anja Virkkunen: Automatic speech recognition for the hearing

impaired in an augmented reality application

2019 Tuomas Kaseva: Online speaker diarization

2019 Ekaterina Voskoboinik: Constructing word representations

using subword embeddings

2019 Sara Klimko: Comparison of Speech-to-Text services

2019 Sujith Padaru: Synthetically generated data for pronunciation

evaluator

2019 Vasumathi Neralla: Adaptation of character-based neural

network language models

Performance depends on: 1. Speaking

environment, microphone, speaker

1. Office, headset, close-talking

2. Telephone speech, mobile

3. Noise, outside, microphone far away

4. Voice, accents

Acoustic

modeling

2. Style of speaking and language

1. Isolated words, small

vocabulary

2. Continuous speech, read or

planned, large vocabulary

3. Spontaneous speech, open

vocabulary, hesitations

Language modeling

https://www.youtube.com/watch?v=UK_2dF9zXl4


New data: Parliament sessions 2008-2016

2269 hours annotated speech corpus for freehttps://www.youtube.com/watch?v=UK_2dF9zXl4


Results in Finnish (word error rate%)

[1] Enarvi & Kurimo. Interspeech 2016.

[2] Enarvi, Smit, Virpioja & Kurimo.

IEEE/ACM TASLP 25:11, 2017.

[3] Kurimo et al. Language resources

and evaluation 51:4, 2017.

[4] Mansikkaniemi, Smit & Kurimo.

Interspeech, 2017.

[5] Smit, Gangireddy, Enarvi, Virpioja &

Kurimo. IEEE ASRU, 2017.

[6] Alumäe. Interspeech 2014.

[7] Smit, Virpioja & Kurimo. Interspeech,

2017.

[8] Smit, Virpioja & Kurimo. 2018 (in

review).

https://research.aalto.fi/en/publications/

2018:

11.4

https://research.aalto.fi/en/publications/

can automatic speech recognition enhance human communication? · 2019. 12. 18. · aalto asr group,...

Documents