can automatic speech recognition enhance human communication? · 2019. 12. 18. · aalto asr group,...
TRANSCRIPT
Mikko Kurimo
Department of Signal
Processing and Acoustics
Aalto University
Can automatic speech
recognition enhance human
communication?
Mikko Kurimo
1989-1997 MSc and PhD at Kohonen’s CIS lab: speech recognition
with neural networks
1998-2012 Visiting research fellow in top speech labs:
– Research: IDIAP (CH), SRI (USA), ICSI (USA)
– University: Edinburgh, Cambridge, Colorado, Nagoya
2012 Associate professor in speech and language processing
- Teaching speech recognition and natural language processing
- Head of Aalto speech recognition group
Research topics: speech recognition, adaptation, translation, synthesis,
diarization, language modeling, audio and video description
ASR (Automatic Speech Recognition) is
already in our phones and homes
Including televisions, phones, new assistance devices, toys
ASR is applied to several everyday tasks
Including dictation, captioning, translation, interpretation,
information retrieval, conversational assistants, language learning
Speech signal and speech-to-text
● Rich communication signal between humans
● The most complex from all biosignals
● Speech => text + emotion,loudness,speed,emphasis,...
● How much speech “understanding” is needed depends on the task
● People perceive the level of spoken communication as a sign of
“intelligence”
Trying to improve several aspects of human
communication
For:
People who do not hear
People who do not speak the
same language
People who do not see (all)
Big data that cannot be viewed
Extended assistants
Learning new languages
Online subtitles: for audiences who do not hear
Challenge: speed, slang, readability
https://www.youtube.com/watch?v=0neezwViIPE
Conversation assistant: for participants who do not hear
Challenge: speed, showing text and facial expressions
https://www.youtube.com/watch?v=f6rA_fcT5cY
Speech translation: ASR + text translation (+synthesis)
Challenge: Errors cumulate in pipeline / end-to-end training
End-to-end
1. Source language audio
2. Neural speech translationmodel
3. Target language text
4. (Target audio)
Audio-text alignment is particularlyhard between languages
Pipeline
1. Source language audio
2. Neural acoustic and languagemodels
3. Source language transcript
4. Neural translation models
5. Target language text
6. (Target audio)
https://www.youtube.com/watch?v=wqv7uYAyAQ0EMIME:
Games and language learning: “Say it again Kid”
Challenge: speed, children, training data
https://youtu.be/9wGUyMf87ag
https://www.youtube.com/watch?v=eI4De9Q_GYA
Learning languages
Production
Feedback
Picture ㏄ Wikimedia commons
Exposure
3. Listen your own attempt and once more the model.
4.
SIAK
Transparent feedback based on CTC decoding
Karhila, Smolander, Ylinen, Kurimo. Transparent pronunciation scoring using
articulatorily weighted phoneme edit distance. Interspeech 2019.
Assessing oral skills: language testing and learning
Challenge: noise, irregular speech, human experts
Reduce the workload
of reviewers with:
-trimming
-transcriptions
-statistics
-prosodic and phonetic
grading
https://digitala.aalto.fi/is16
https://www.youtube.com/watch?v=p3_UTsjBYJY
Video description: for accessing big data
Challenge: combining picture, movement, audio, speech
Video + audio task:
• Describe what is going on
• Recognize people, places, events etc
Audio tasks:
• Diarization and speakerID: who spoke when?
• Spoken languageID
• Named entity recognition (per, loc, time, org etc.)
• Audio event tagging (other than speech)
• Subtitling, speech translation, subtitle translation
https://www.youtube.com/watch?v=cvR3iA5and8
Conversational robots, toys, assistants
Challenge: speed, environment, dialog
• Vad tänker du?
• Jätte kiva show.
• It was amazing.
• Pah, tylsää.
• Who is in the white house now?
• Close the window and return.
Future (unsolved) challenges
Including informal conversations, multimodality,
multilinguality, personalization, context sensitivity
Solutions for those challenges are
studied in my research group
• Contact: [email protected]
• Publications: http://research.aalto.fi
• Home page: (search: ”Aalto asr home”)
• Software: (search: ”Aalto asr github”)
• Demos: (search: ”Aalto asr video”)
Aalto ASR group, some key publications
https://research.aalto.fi/
• Modeling under-resourced languages for speech recognition,
Kurimo,Enarvi,Tilk,Varjokallio,Mansikkaniemi & Alumäe, Language resources and evaluation 2017,
• Automatic Speech Recognition with Very Large Conversational Finnish and Estonian
Vocabularies, Enarvi,Smit,Virpioja & Kurimo, Trans. Audio, Speech, and Language Processing 2017
• The MeMAD submission to the WMT18 multimodal translation task, Grönroos, 9 others &
Kurimo, Workshop on Machine Translation 2018
• Character-based units for Unlimited Vocabulary Continuous Speech Recognition,
Smit,Gangireddy,Enarvi,Virpioja & Kurimo, Automatic Speech Recognition and Understanding 2018
• First-pass decoding with n-gram approximation of RNNLM: The problem of rare words,
Singh,Smit,Virpioja & Kurimo, Machine Learning in Speech and Language Processing 2018
• First-Pass Techniques for Very Large Vocabulary Speech Recognition of Morphologically Rich
Languages, Varjokallio,Virpioja & Kurimo, Spoken LanguageTechnology 2018
• The MeMAD Submission to the IWSLT 2018 Speech Translation Task,
Sulubacak,Tiedemann,Rouhe,Grönroos & Kurimo, Spoken Language Translation 2018
• The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general
purpose audio tagging, Xu,Smit & Kurimo, Detection and Classification of Acoustic Scenes and
Events 2018
• RL-KLM: Automating Keystroke-level Modeling with Reinforcement Learning, Leino,Oulasvirta
& Kurimo, Intelligent User Interfaces 2019
Aalto ASR group, online resources
Code: https://github.com/aalto-speech
• AaltoASR tools
• Kaldi ASR training scripts
• VariKN language models
• TheanoLM language models
• Morfessor segmentation
• SIAK pronunciation assessment
• Speaker diarization
• Audio tagger
• etc
Data: https://www.kielipankki.fi/
Demos: https://www.youtube.com/channel/UCY4NOvOgKz9-x7rR_kkb51Q/
PhD thesis
2017 Andre Mansikkaniemi: Continuous unsupervised
topic adaptation for morph-based speech recognition
2018 Seppo Enarvi: Modeling conversational Finnish for
automatic speech recognition
2019 Reima Karhila: Building Personalised Speech
Technology Systems with Sparse, Bad Quality or Out-of-
domain Data
2019 Peter Smit: Modern subword-based models for
automatic speech recognition
MSc thesis
2018 Zhicun Xu: Audio event classification using deep learning
methods
2018 Anja Virkkunen: Automatic speech recognition for the hearing
impaired in an augmented reality application
2019 Tuomas Kaseva: Online speaker diarization
2019 Ekaterina Voskoboinik: Constructing word representations
using subword embeddings
2019 Sara Klimko: Comparison of Speech-to-Text services
2019 Sujith Padaru: Synthetically generated data for pronunciation
evaluator
2019 Vasumathi Neralla: Adaptation of character-based neural
network language models
Performance depends on: 1. Speaking
environment, microphone, speaker
1. Office, headset, close-talking
2. Telephone speech, mobile
3. Noise, outside, microphone far away
4. Voice, accents
Acoustic
modeling
2. Style of speaking and language
1. Isolated words, small
vocabulary
2. Continuous speech, read or
planned, large vocabulary
3. Spontaneous speech, open
vocabulary, hesitations
Language modeling
https://www.youtube.com/watch?v=UK_2dF9zXl4
New data: Parliament sessions 2008-2016
2269 hours annotated speech corpus for freehttps://www.youtube.com/watch?v=UK_2dF9zXl4
Results in Finnish (word error rate%)
[1] Enarvi & Kurimo. Interspeech 2016.
[2] Enarvi, Smit, Virpioja & Kurimo.
IEEE/ACM TASLP 25:11, 2017.
[3] Kurimo et al. Language resources
and evaluation 51:4, 2017.
[4] Mansikkaniemi, Smit & Kurimo.
Interspeech, 2017.
[5] Smit, Gangireddy, Enarvi, Virpioja &
Kurimo. IEEE ASRU, 2017.
[6] Alumäe. Interspeech 2014.
[7] Smit, Virpioja & Kurimo. Interspeech,
2017.
[8] Smit, Virpioja & Kurimo. 2018 (in
review).
https://research.aalto.fi/en/publications/
2018:
11.4