"automatic speech recognition for mobile applications in yandex" — fran campillo,...
DESCRIPTION
This talk describes the work developed by the Yandex Speech Group in the last two years. Beginning from scratch, large amounts of voice recordings were collected from the field of application, and the most popular open source speech projects were studied to get a thorough understanding of the problem and to gather ideas to build our own technology. This talk will present key experiments and their results, as well as our latest achievements in automatic speech recognition in Russian. Currently, the Yandex Speech Group provides three different services in Russian: maps, navigation, and general search, with a performance that is comparable to competitor products.TRANSCRIPT
![Page 1: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/1.jpg)
1
![Page 2: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/2.jpg)
2
Automatic speech recognition for mobile applications in Yandex
Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo
![Page 3: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/3.jpg)
3
OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
![Page 4: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/4.jpg)
4
MotivationMotivation
![Page 5: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/5.jpg)
5
MotivationMotivation
![Page 6: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/6.jpg)
6
Road mapRoad map
![Page 7: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/7.jpg)
7
Road mapRoad map
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
![Page 8: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/8.jpg)
8
Automatic speech recognitionAutomatic speech recognition
![Page 9: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/9.jpg)
9
ASR: complexityASR: complexity
Style Planned Spontaneous
Audio quality CD Telephone
Vocabulary size Hundreds Hundreds of thousands
Number of speakers One Many
Recognition rate WorseWorseBetterBetter
Complexity BiggerBiggerSmallerSmaller
![Page 10: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/10.jpg)
10
Word pronunciationsWord pronunciations
● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/
● Need a mapping from writing to phonemes: G2P.
![Page 11: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/11.jpg)
11
Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j
![Page 12: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/12.jpg)
12
Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/
![Page 13: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/13.jpg)
13
ASR: the problemASR: the problem
● We have a sequence of observations:– O = {o
1, o
2, …, o
T}
– oi is a feature vector representing a speech frame.
● Goal: finding the likeliest sequence of words wi
for O:argmax
iP (w i /O)argmax
iP (w i /O)
![Page 14: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/14.jpg)
14
ASR: the problem (II)ASR: the problem (II)
● We cannot compute directly P(wi/O).
● Bayes: P(wi /O)=P (O /w i)P (w i)
P (O)
argmaxiP (w i /O)=argmax
i{P (O /w i)P (w i)}
Acoustic model Language model
![Page 15: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/15.jpg)
15
Language modelLanguage model
● Probability of sequences of words:– “We will rock you” => P
1.
– “Will will rock you” => P2.
● Trained on large corpora.● The closer to the application domain, the better.
![Page 16: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/16.jpg)
16
Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models
● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe
● Typical layout for ASR:
Q1
Q2
Q3
a11
a12
a22
a23
a33
b1(o) b
2(o) b
3(o)
● aij: transition probabilities.
● bj(o): probability of observation o in state j.
![Page 17: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/17.jpg)
17
Acoustic model: HMM and speechAcoustic model: HMM and speech
● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.
● aij: duration of each part.
● bj(o): probability of producing a vector of features o in
state j.
![Page 18: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/18.jpg)
18
Modeling probability of observationModeling probability of observation● Gaussian mixtures:
– cjm
= weight of mth Gaussian of state j.– μ
jm => average (vector) of mth Gaussian of state j.
– ∑jm
=> covariance matrix of mth Gaussian of state j.
● Neural networks.
b j(x)=∑m c jmN (x ,μ jm ,Σ jm)
![Page 19: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/19.jpg)
19
Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/
to1
o2
o3
o4
o5
o6
o7
o8
o9
o10
/o//o/
Q1
Q2
Q3
Q1 => o
1, o
2
Q2 => o
3, o
4, o
5, o
6, o
7
Q3 = > o
8, o
9, o
10μ
3m, ∑
3m, c
3m
μ2m,
∑2m,
c2m
μ1m,
∑1m,
c1m
![Page 20: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/20.jpg)
20
Block diagram for trainingBlock diagram for training
Initialization
Baum-Welch
HMM Parameters update
Convergence
Prototype HMM
No
Trained models
Yes
Initial μ0m,
∑j0m, com
for the GMMs
Alignments of the training sentences (observations to states)
New estimations for μ
jm, ∑
jm, c
jm
Training sentences
![Page 21: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/21.jpg)
21
DecodingDecoding
●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.
Parametrize
Lexicon Acousticmodels
Languagemodel
DecoderSpeech signal Words
![Page 22: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/22.jpg)
22
Our decoderOur decoder● Based on Weighted Finite State transducers.
●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.
Lexicon Acousticmodels
Languagemodel
HCLG
![Page 23: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/23.jpg)
23
Composition of WFST: exampleComposition of WFST: example
Lexicon
Language model
0 1B:Bob2
ah: 3b:
4
l: likes
5ay: k: 6
s:
![Page 24: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/24.jpg)
24
Data collectionData collection
![Page 25: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/25.jpg)
25
Data collectionData collection
● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.
![Page 26: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/26.jpg)
26
Manual transcriptionsManual transcriptions
● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.
Percentage (%)
Native 87.7
Male 83.3
Female 8.5
Child 8.2
![Page 27: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/27.jpg)
27
Manual transcriptionsManual transcriptions
● Percentage of records without anomalies: 7.4%
Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1
![Page 28: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/28.jpg)
28
Manual transcriptions: examplesManual transcriptions: examples
● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр
![Page 29: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/29.jpg)
29
VisualQAVisualQA
![Page 30: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/30.jpg)
30
ExperimentsExperiments
![Page 31: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/31.jpg)
31
Grapheme-2-phonemeGrapheme-2-phoneme
● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.
● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.
● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).
![Page 32: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/32.jpg)
32
Noise modelsNoise models
![Page 33: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/33.jpg)
33
Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model
![Page 34: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/34.jpg)
34
Experiments: number of GaussiansExperiments: number of Gaussians
![Page 35: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/35.jpg)
35
ResultsResults
![Page 36: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/36.jpg)
36
Users: NavigatorUsers: Navigator
![Page 37: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/37.jpg)
37
● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):
Results: relative word error rateResults: relative word error rate
Maps Navigation General search
Yandex-GMM 1 1 1
3rd Party 44.6% 31.8% 37.3%
Competitor 1.9% -9.7% -23.4%
General searchYandex-DNN 1
Competitor 6.6%
![Page 38: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/38.jpg)
38
Thanks for your attention!Thanks for your attention!
![Page 39: "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс](https://reader033.vdocument.in/reader033/viewer/2022051313/548259b7b47959050d8b478f/html5/thumbnails/39.jpg)
39
Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer
Yandex Speech GroupYandex Speech Group
[email protected]@yandex-team.ru
PhDPhD