3p portuguese pronunciation professor · 3p – portuguese pronunciation professor mariana sofia...
Post on 19-Nov-2018
242 Views
Preview:
TRANSCRIPT
3P – Portuguese Pronunciation Professor
Mariana Sofia Pimenta Lopes
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Prof. Isabel Maria Martins Trancoso
Examination Committee
Chairperson: Professor João Fernando Cardoso Silva Sequeira
Supervisor: Professor Isabel Maria Martins Trancoso
Members of the Committee: Professor Hugo Daniel dos Santos Meinedo
October 2014
3P
ii
iii
To my parents,
iv
v
Acknowledgements
Acknowledgements
I am using this opportunity to express my gratitude to everyone who supported me throughout the
course of this project. I am sincerely grateful for their guidance, constructive criticism and friendly
advice during the project work.
I would like to express the deepest appreciation to my advisor Professor Isabel Trancoso for
encouraging my research and providing priceless support and encouragement when I most needed it.
I would also like to thank the L2F staff, especially to Professor Hugo Meinedo, Professor
Alberto Gareta, Professor Thomas Pelligrini and Phd student Anna Pompili for the immense
assistance, and provision of the source materials essential to helping me completing this project.
Furthermore, I would like to extend my thankfulness to all the people mentioned in the
references for making their work available, so people can understand and adapt their research.
Finally a special thanks to my family. Words cannot express how grateful I am to my mother,
and father for all of the sacrifices that you’ve made on my behalf and encouragement to strive towards
my goal.
vi
vii
Abstract
Abstract
The quality of oral proficiency forms an important part in learning a foreign language. Yet, frequently
students find it hard to obtain a reliable source where they can work their pronunciation intensely. An
automatic assessment system can reduce the cost and workload associated with this task. This type
of tools are available for students of widely spoken languages such as American or British English,
however there is not a large amount of them for students of European Portuguese (EP).
The research presented in this thesis investigates a solution for creating a computer assisted
language learning (CALL) system for EP using as its base the work of Witt (1999)[1].
This thesis begins by outlining important aspects for computer-assisted language learning and
makes a brief analysis of the EP phonemes and the comparison with the two other languages
presented in the corpus, Spanish and Bulgarian. Then the several steps in the method are explained,
i.e., firstly the audio speech is digitalized, then, using Audimus, posterior probabilities on 20 ms frames
are calculated from the extracted features. Subsequently, a GOP score is calculated for each frame
and for each phoneme. Then the GOP is normalized and using a pre-established threshold, from
native speakers’ data, the threshold is adapted in order to improve efficiency in classifying the
phonemes as a correct or incorrect utterance. Finally, since the threshold is a subjective to who
implemented it, it is compared with three human judges in order to guarantee its quality.
Keywords
CAPT, GOP, normalization, pronunciation, natives, non-natives, European Portuguese
viii
Resumo
Resumo A qualidade de proficiência oral constitui uma parte importante na aprendizagem de uma língua
estrangeira. No entanto, muitas vezes os alunos têm dificuldade em obter uma fonte fiável, onde
podem trabalhar intensamente a sua pronúncia. Um sistema de avaliação automática pode reduzir o
custo e a carga de trabalho associada a essa tarefa. Este tipo de ferramentas está disponível para
estudantes de línguas mais faladas, como o Inglês americano ou britânico, no entanto não há, em
grande parte investigação para estudantes de Português Europeu (PE).
A pesquisa apresentada nesta tese investiga uma solução para a criação de um sistema
assistido por computador para a aprendizagem de línguas (CALL) para o PE usando como base o
trabalho de Witt (1999) [1].
Esta tese começa por descrever aspetos importantes para a aprendizagem de línguas
assistida por computador, fazendo também uma breve análise dos fonemas do EP e uma
comparação com as outras duas línguas apresentadas no corpus, o espanhol e o búlgaro. Em
seguida, os vários passos do processo são explicados, ou seja, em primeiro lugar, o registo áudio da
fala é digitalizado, e, utilizando Audimus, as probabilidades posteriores em intervalos de 20 ms são
calculadas a partir das características extraídas. Subsequentemente, uma pontuação GOP é
calculada para cada intervalo e para cada fonema. Em seguida, o GOP é normalizado e usando um
limite pré-estabelecido, obtido a partir de dados de falantes nativos, o limite é adaptado, a fim de
melhorar a eficiência na classificação dos fonemas como correta ou incorretamente pronunciados.
Finalmente, uma vez que o limite é subjetivo para quem o executou, ele é comparado com o
julgamento de três juízes humanos, a fim de garantir a sua qualidade.
Palavras-chave
CALL, GOP, normalização, pronúncia, nativos, não nativos, Português Europeu.
ix
Table of Contents
Acknowledgements ................................................................................... v
Abstract .....................................................................................................vii
Resumo.................................................................................................... viii
List of Figures ........................................................................................... xi
List of Tables ............................................................................................xii
List of Acronyms ...................................................................................... xiii
List of Software ........................................................................................xiv
1 Introduction ................................................................................... 15
1.1 Overview ................................................................................................ 16
1.2 Motivation and problem specification ..................................................... 16
1.3 Innovations of the work .......................................................................... 16
1.4 Thesis contents ..................................................................................... 16
2 Pronunciation ................................................................................ 18
2.1 Learning word pronunciation ................................................................. 19
2.2 Automatic Speech Recognition .............................................................. 19
3 Phonology ..................................................................................... 21
3.1 European Portuguese ............................................................................ 23
3.1.1 Brief description of EP ...................................................................................... 23
3.1.2 Phonology of EP ............................................................................................... 25
3.2 EP and foreign languages ..................................................................... 28
3.2.1 Brief comparison with Spanish .......................................................................... 28
3.2.2 Brief comparison with Bulgarian ........................................................................ 30
4 System Design.............................................................................. 31
4.1 State of the art ....................................................................................... 33
4.1.1 Scientific research ............................................................................................ 33
4.1.2 Existing tools .................................................................................................... 33
x
4.2 Method................................................................................................... 34
4.2.1 Audimus ........................................................................................................... 35
4.2.2 GOP ................................................................................................................. 37
4.2.3 NGOP .............................................................................................................. 38
4.2.4 Threshold ......................................................................................................... 39
4.2.5 GOP for fluent speech ...................................................................................... 40
4.2.1 Overall score .................................................................................................... 41
4.2.2 Performance measure ...................................................................................... 41
4.3 Other classification methods .................................................................. 42
4.3.1 Likelihood Ratio ................................................................................................ 42
4.3.2 MFCC and DTW based evaluation .................................................................... 42
5 Experiments and results ............................................................... 43
5.1 Corpus ................................................................................................... 44
5.2 Implementation ...................................................................................... 45
5.3 Results................................................................................................... 46
5.3.1 Threshold for Native speakers .......................................................................... 46
5.3.1 SA results for all non-native .............................................................................. 49
5.3.2 SA results for Spanish ...................................................................................... 50
5.3.3 SA results for Bulgarian .................................................................................... 51
5.3.4 Comparison, Critic and Evaluation .................................................................... 52
6 Interface ........................................................................................ 54
6.1 VITHEA project ...................................................................................... 55
6.2 3P Interface ........................................................................................... 55
7 Conclusion .................................................................................... 57
7.1 Conclusions ........................................................................................... 58
7.2 Future work ............................................................................................ 59
Annex 1 – Extra Tables ........................................................................... 60
References .............................................................................................. 69
xi
List of Figures
List of Figures
Fig. 1- Lexical Distance among the Languages of Europe ................................................................. 24
Fig. 2- Places of articulation for consonants [22] ................................................................................ 26
Fig. 3- Vowel Triangle in relation to tongue position [22] .................................................................... 27
Fig. 4- Model’s Schematic. ................................................................................................................ 34
Fig. 5- Audimus squematic. ............................................................................................................... 35
Fig. 6- Visualization of the phoneme division in Wavform................................................................... 36
Fig. 7- GOP and NGOP relation. ....................................................................................................... 38
Fig. 8- GOP for fluent speech [1]. ...................................................................................................... 40
Fig. 9- Example of exact end approximated GOP. ............................................................................. 46
Fig. 10- NGOP's and GOP's SA comparison for each phoneme. ....................................................... 47
Fig. 11- NGOP for each phoneme for native (blue) and non-native (red) with native mean and
std above. ................................................................................................................ 49
Fig. 12- SA for each Phoneme in Spanish ......................................................................................... 50
Fig. 13– SA for each Phoneme in Bulgarian ...................................................................................... 51
Fig. 14 - 3P Interface......................................................................................................................... 55
xii
List of Tables
List of Tables
Table 1- Consonants in EP ............................................................................................................... 25
Table 2- Vowels of EP ....................................................................................................................... 27
Table 3- Spanish consonants and vowels in comparison with EP ...................................................... 29
Table 4- Bulgarian consonants and vowels in comparison with EP .................................................... 30
Table 5- Mean and number of occurrences by phoneme ................................................................... 48
Table 6- Threshold of each phoneme for ........................................................................................... 50
Table 7- Threshold of each phoneme ................................................................................................ 51
Table 8- Average performance measures .......................................................................................... 53
Table 9 - Judges vs GOP scoring ...................................................................................................... 53
Table 10 - Example of words using the EP phoneme list………… ......... …………………………..........61
Table 11 - Confusion matrices for Portuguese, Spanish and Bulgarian …………….... .......62
xiii
List of Acronyms
List of Acronyms
EP European Portuguese
3P Portuguese Pronunciation Professor
GOP Goodness of Pronunciation
NGOP Normalized Goodness of Pronunciation
SA Score accuracy
PP
std
Posterior Probability
Standard deviation
CALL Computer assisted language learning
CAPT Computer assisted pronunciation learning
xiv
List of Software
List of Software
Audimus AUDIMUS is a speech recognition system that Works offline with any
audio/speech/video file, transcribing its content with the possibility of
segmentation.
Matlab MATLAB is a high-level language and interactive environment for
numerical computation, visualization, and programming.
Wavsurfer WaveSurfer is an open source tool for sound visualization and
manipulation. Typical applications are speech/sound analysis and
sound annotation/transcription.
3P Program develop in this thesis for EP pronunciation training
15
Chapter 1
Introduction
1 Introduction
This chapter gives a brief overview of the work. It establishes work targets, original contributions and
the motivations. At the end of the chapter, the work structure is provided.
16
1.1 Overview
While assessing pronunciation is well defined in English, it is vaguely studied in EP. 3P (Portuguese
Pronunciation Professor) will guide the learners of EP through several exercises by giving instructions
and feedback, by using Text-To-Speech synthesis. The research presented investigates a solution for
creating a computer assisted language learning (CALL) system for EP (European Portuguese) using
as its base the work of Witt (1999)[1]. The algorithm is based on Goodness of Pronunciation (GOP), a
measure that uses confidence scores drawn from automatic recognition and alignments results at
phone-level. The GOP is computed for native data, in order to select a threshold that separates good
and bad pronunciation. The method is then tested with a non-native corpus, and the results are
analyzed and adjusted using several performance measures. The GOP module has been integrated in
an existing web interface.
1.2 Motivation and problem specification
The thesis is motivated by this vision of an interactive system that helps people learning a new
language. One of the biggest difficulties in learning EP is the lack of material available for general
public [2]. So by creating an interactive system of qualification of different phonemes/words in
European Portuguese gives the students an opportunity to study and reach pronunciation proficiency
without the need of constant native speakers support and allowing to understand the phonetic
subtleties of the language.
1.3 Innovations of the work
This research has two main innovations, the implementation of an adaptation of the GOP method for
European Portuguese phonemes and the creation of a simple web interface where students can
practice their pronunciation that can be modified with new sentences, games and better thresholds.
1.4 Thesis contents
This thesis is composed of 5 chapters.
Chapter 1 – Introduction - This chapter gives a brief overview of the work. Establishing work
targets, original contributions and the motivations.
17
Chapter 2 - Pronunciation - This chapter provides an overview on the difficulties and the
foundations to learn pronunciation. It also has a brief explanation on what a CALL and CAPT
systems are and why they are used.
Chapter 3 - Phonology - This chapter provides a brief explanation of the phonology of EP and
its comparison with Spanish and Bulgarian.
Chapter 4 - System design - This chapter provides an overview of the existing tools and the
research being done and an explanation of the steps for to obtain the classification of the
pronunciation.
Chapter 5 - Experiments and results - This chapter provides the validation of the steps for the
implementation of the project and the results obtained.
Chapter 6 - Interface -This chapter provides an overview of the interface implemented.
Chapter 7 - Conclusions - This chapter finalises the work, summarising conclusions and
pointing out aspects to be developed in future work.
18
2 Pronunciation
Chapter 2
Pronunciation
This chapter provides an overview on the difficulties and the bases to learn pronunciation. It also has a
brief explanation on what a CALL and CAPT systems are and why they are utilized.
.
19
2.1 Learning word pronunciation
Learning a language can be described as the process of obtaining a language competence as a
planned process done by a conscious study. However acquiring a new language as a child and
learning a new language as an adult are two different learning processes. As Kraschen and Terrel
wrote “[Acquiring] is the „natural‟ way, paralleling first language development in children. Acquisition
refers to an unconscious process that involves the naturalistic development of language proficiency
through understanding language and through using language for meaningful communication (Krashen
and Terrel in Richards, 1987: 131)” [3]. Hence this being a parallel process to learn a second
language, “Learning, by contrast, refers to a process in which conscious rules about a language are
developed. It results in explicit knowledge about the forms of a language and the ability to verbalize
this knowledge” (Richards; 1987:131) [3]. Nevertheless most adult learners of a foreign language and
even those as young as 6 years old retain some artifacts in their pronunciation that identify them as
non-native speakers [3].
Native-like intonation can be learned, however, this is extremely difficult for even advanced
language learners. In addition to requiring lots of feedback to improve pronunciation, students ca not
attend to all aspects of pronunciation at the same time, e.g. attending to phonetic accuracy takes
processing time away from attending to intonation [4]. But, for example, when a person listens to a
song in repeat tends to learn it more easily, making repetition a feature that helps learning that is
shared around the world. And as described by the mere exposure effect, repetition does not only work
with songs but with publicity, shapes and patterns based elements such as learning new words. By
repeating a word many times makes someone overlook what the word means but how it sounds,
making a smoother transition in learning a language. Simply repeating a sentence a number of times
shifts the listeners’ attention to the pitch and duration of the sound so that the repeated language
begins to sound like a repeated song. Therefore, one of the bases of acquiring oral proficiency is
repetition [5]. The other base is finding the language difficulties in perceiving the difference between
phonemes and correct them. So for a student to get pronunciation proficiency needs two elements
repetition and evaluation of its performance in order to correct the systematic errors, making learning a
new language a process of training, repetition and memorization.[6][7]
2.2 Automatic Speech Recognition
Learning to pronounce written words means learning the intricate relations between a language writing
system and its speech sounds [8]. When children learn to read and write in primary school they face
an analogous learning task, as do students when mastering the writing system, the speech sounds,
and the vocabulary of a language different from their mother tongue. Learning to pronounce words can
also be modeled on computers. In contrast with humans, machines can be modeled in such specific
ways that for instance, a machine can be set up to accommodate a data base of representations of
20
word-pronunciation knowledge, without having learned any of those representations by itself: it is
hardwired in memory by the system’s designer [8] [9] [10] [11].
Computer Aided Language Learning (CALL) is a cross-disciplinary field that includes the
subfields Foreign Language Learning (FLL), Foreign Language Teaching (FLT), Linguistics, and
Human Language Technologies (HLT). FLL research typically focuses on topics such as learning
strategies employed by students and effectiveness of environments designed to support learning. FLT
focuses on discovering and employing effective pedagogies to facilitate learning as well as meaningful
performance measurements. Linguistics, specifically the subfield of Second Language Learning (SLA),
focuses on the process of learning a second language by investigating common patterns of mistakes
and progression in competence. Finally, Human Language Technologies encompasses the full-range
of technologies, from audio recordings to dialogue systems, used to facilitate learning [12].
Researchers have investigated the use of computers for language learning since the 1960s.
The field of CALL has seen an explosion of research over the past decade. One of the biggest
challenges in designing computer assisted language learning (CALL) applications that provide
automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at
such a detailed level that the information provided can be useful to learners. [13] [14] [15].
CALL systems are numerous with diverse system configurations. On the simple end of the
spectrum, the systems can take the form of web pages with fill-in forms, online chat rooms, static
multimedia programs, modifications to popular games, or even simply a set of digital music files for
playback purposes. On the complex end, systems can 27 have automatic speech recognition, voice
synthesis, and highly interactive 3D environments that teach cultural norms as well as language [12].
Modern systems tend to be much richer language learning environments that incorporate high
quality audio, graphics, and automated feedback. The content of the lessons is usually not static, and
is generated randomly or adaptively, in response to student actions. Many systems use some form of
Automatic Speech Recognition (ASR), speech synthesis, natural language understanding, or natural
language generation [12].
In addition, computer assisted language learning (CALL) applications and, more specifically,
computer assisted pronunciation training (CAPT) applications for language learning that make use of
automatic speech recognition (ASR) have been focused on pronunciation grading (or scoring), while
less attention has been paid to error detection (or localization).
But before CALL methods can be devised it is important to recognize the specific difficulties
encountered in pronunciation teaching. First and foremost explicit pronunciation teaching requires the
sole attention of the teacher to a single student in order to analyses his speech and give some notes
on how to improve. This in a normal classroom environment poses a problem. Then learning a new
language involves large repetition of the words that requires not only a mental task but demands
coordination and control over many muscles to achieve proficiency. These may cause social
implications to students that are afraid to perform in the presence of others. This costly time
consuming approach and therefore its automatization is highly desirable for self-study.
21
On the other hand these technologies do not take into account the variations due to speaker
accent, demanding a strict distinction among the different sounds unlike what would happen with
human teachers. So it can be said that there are two strands in the area of pronunciation learning:
teaching correct pronunciation of a foreign language to students, which requires a precise phoneme
recognition and is more objective and easily computed, and assessing the pronunciation quality of a
speaker speaking a foreign language that can tolerate more mispronunciations, but is also more
correlated with what human teachers perceive as the correct pronunciation [8] [9] [10] [11] [16].
3 Phonology
22
Chapter 3
Phonology
This chapter provides a brief explanation of the phonology of EP and its comparison with Spanish and
Bulgarian.
23
3.1 European Portuguese
3.1.1 Brief description of EP
The Portuguese language is a romance language, i.e. descends from vulgar Latin (and influenced by
Celtic, Germanic and Arabic languages) and is the official language of Portugal, Brazil, Angola,
Mozambique, Cape Verde, Guinea-Bissau, São Tomé and Principe, Macao, Equatorial Guinea and
East Timor. It has approximately 215-220 million native speakers and 260 million total speakers (as
first language (L1) and as a foreign language (L2)) with over 10 million speaking EP [17].
Portugal has three official languages: European Portuguese, Mirandese and Portuguese Sign
Language.The dialects of Portugal can be divided into two major groups:
The southern and central dialects are broadly characterized by preserving the distinction
between /b/ and /v/, and by the tendency to substitute /ei/ and /ou/ to /6j/ and /o/. This includes the
dialect of the capital, Lisbon, which however has some peculiarities of its own. Although the dialects of
the Atlantic archipelagos of the Azores and Madeira have unique characteristics, as well, they can
also be grouped with the southern dialects.
And the northern dialects are characterized by preserving the pronunciation of /ei/ and /ou/ as
diphthongs /ei/ and /ou/ and by merging sometimes /v/ with /b/ (as in Spanish). This includes the
dialect of Porto, Portugal’s second largest city.
Also in the Portuguese town of Barrancos (in the border between Extremadura, Andalucia and
Portugal), a dialect of Portuguese heavily influenced by Southern Spanish dialects is spoken, known
as barranquenho.
As for dialects outside of Portugal, in Brazil, Africa and Asia, it is usually believed that the
dialects derived mostly from those of central and southern Portugal.
The Galician language, spoken in the region of Galicia, Spain, is considered by some of its
speakers as a dialect of the Portuguese - or, precisely speaking, Galician-Portuguese (Galego-
Português) language, while others believe it to be a different, if closely related, language. It is mainly
characterized by the lack of opposition between /b/ and /v/, the preservation of “ei” and “ou”
diphthongs, and, perhaps more characteristically, the de-voicing of the consonant S into Z and the use
of /o~/ instead of /6~w~/. In addition EP has a sister language, the Spanish, which shares a lexical
similarity (measure of the degree to which the word sets of two given languages are similar) of 89%
and, with the exception of other romance languages, the similarity with others languages is not
substantial [18] [19] [20].
The figure below schematizes the lexical distance among the languages of Europe, and as
initial hypothesis, for the two languages used by the non-native corpus (Bulgarian and Spanish),
Spanish may have good results due to its proximity with Portuguese though the mix between two
similar languages may create difficulties, while Bulgarian will probably have worse results since it is in
a different group [18].
24
Fig. 1 Lexical Distance among the Languages of Europe Fig. 1- Lexical Distance among the Languages of Europe
25
3.1.2 Phonology of EP
European Portuguese is a West Iberian Indo-European language composed phonetically by 39
phonemes, including the pause, described in the table below utilizing the SAMPA (Speech
Assessment Methods of Alphabet) script [20].
3.1.2.1 Consonants
Consonants
Labial Coronal Avelar Dental Velar Palatal/ Dorsal
Plosive/
Occlusive
voiced b d g
unvoiced p t k
Fricatives voiced v z Z
unvoiced f s S
Nasals m n J
Lateral l/l~ L
Trill r R
Semi-vowels w j
w~ j~
Table 1- Consonants in EP
The consonants in EP can be classified by the manner of articulation (the configuration and
interaction of the articulator) and place of articulation (the point of contact where an obstruction occurs
in the vocal tract between an articulatory gesture, an active articulator and a passive location (typically
some part of the roof of the mouth)) [21].
For the manner of articulation it can be labial, where can be bilabial consonant which is
articulated with both lips or labiodentals with the lower lip and the upper teeth; coronal where can be
dental in which is a consonant articulated with the tongue against the upper teeth or Alveolar where
the articulation happens with the tongue against or close to the superior alveolar ridge; and dorsal
where can be Palatal consonants with the body of the tongue raised against the hard palate (the
26
middle part of the roof of the mouth) or Velars that are articulated with the back part of the tongue
(the dorsum) against the soft palate, the back part of the roof of the mouth (known also as the velum).
As for the places of articulation it can be nasal produced with a lowered velum, allowing air to
escape freely through the nose; a plosive in which the vocal tract is blocked so that all airflow ceases;
a fricatives that is produced by forcing air through a narrow channel made by placing
two articulators close together; a lateral in which airstream proceeds along the sides of the tongue,
but is blocked by the tongue from going through the middle of the mouth; a trill produced by vibrations
between the articulator and the place of articulation; and a Fricative that is produced by forcing air
through a narrow channel made by placing two articulators close together.
There are also semi-vowels that are phonetically similar to a vowel sound but function as a
syllable boundary [21].
Fig. 2- Places of articulation for consonants [22]
27
3.1.2.2 Vowels
Vowels
Oral Nasals
Back Central Front Back Central Front
close i u i~ u~
close-mid e o e~ o~
mid @
open-mid E 61 O 6~2
open a
Table 2- Vowels of EP
Vowels can be classified according to the position of the tongue: front, central and back, from
the further front of the mouth until the back, and close, close-mid, mid, open-mid, and open, from as
close as possible to the roof of the mouth to the most further [21].
Fig. 3- Vowel Triangle in relation to tongue position [22]
1 The phoneme /6/ may appear as /A/ in some parts due to the Audimus configurations
2 Idem for the phoneme /6~/
28
3.2 EP and foreign languages
3.2.1 Brief comparison with Spanish
Spanish or Castilian is also part of the West Iberian Romance languages branch but in the Castilian
subdivision, in contrast with EP being in the Galician-Portuguese one. Both languages derive from the
Latin and have a lexical similarity, a measure of the degree to which the two given languages are
similar, of 0.89. The following table is the representation the EP phonemes with the EP phonemes
non-existent in Spanish marked in bold [23].
Consonants
Labial Coronal Avelar Dental Velar Palatal/
Dorsal
Plosive/
Occlusive
voiced b d g
unvoiced p t k
Fricatives voiced v z Z
unvoiced f s S
Nasals m n J
Lateral l/l~ L
Trill r R
Semi-vowels w j
w~ j~
29
Vowels
Oral Nasals
Back Central Front Back Central Front
close i u i~ u~
close-mid e o e~ o~
mid @
open-mid E 6 O 6~
open a
Table 3- Spanish consonants and vowels in comparison with EP
The major difference between EP and Spanish in the latter there are no nasal vowels/semi-
vowels as well as open-mid and mid vowels. The voiced fricatives /v/, /z/, /Z/ and the unvoiced /S/ are
nonexistent. However there are additional of the fricatives /T/, such as in cinco, and /x/, as in mujer,
and affricates /tS/, as in mucho and /jj/, as in hielo.
30
3.2.2 Brief comparison with Bulgarian
Bulgarian is also an Indo-European Language but is in the Slavic languages subgroup in contrast with
EP being in the Italic subdivision. The following table is the representation the EP phonemes with the
EP phonemes non-existent in Bulgarian marked in bold [24].
Consonants
Labial Coronal Avelar Dental Velar Palatal/
Dorsal
Plosive/
Occlusive
voiced b d g
unvoiced p t k
Fricatives voiced v z Z
unvoiced f s S
Nasals m n J
Lateral l/l~ L
Trill r R
Semi-vowels w j
w~ j~
Vowels
Oral Nasals
Back Central Front Back Central Front
close i u i~ u~
close-mid e o e~ o~
mid @
open-mid E 6 O 6~
open a
Table 4- Bulgarian consonants and vowels in comparison with EP
31
In phonetic terms it can observed that EP and Bulgarian are similar. Most consonants of EP are
present in Bulgarian with the exception of /J/, /l~/ and /R/. The only semi-vowel is /j/ and there are no
nasalized vowels. As for the other vowels only /E/, /6/ and /o/ are not present. In addition to these
phonemes, Bulgarian has other 17 palatalized consonants and 5 non-palatalized.
4 System Design
32
Chapter 4
System Design
This chapter provides an overview of the existing tools and the research being done and an
explanation of the steps for to obtain the classification of the pronunciation.
33
4.1 State of the art
4.1.1 Scientific research
Over the last years several groups have developed various interactive language teaching systems
based on speech recognition techniques (CAPT).
One of the first functioning projects was the SPELL project [1] which concentrated on specific
phonemes. There are also other projects that focus on scoring complete sentences but not phonemes.
Though the standard method, and the method used in this research, was firstly analyzed by Witt
(1999) in “Use of Speech Recognition in computer assisted language training”. This method uses a
measure denominated GOP to score the pronunciation. There are other possible processes to
measure, such as, computing MFCC [25] or the likelihood [26] [27] but to this date there are no
register of a method with better efficacy than the described by Witt or some of its modifications [14]
[28] [29].
Another project important to mention was PLASER (Pronunciation Learning via Automatic
SpEech Recognition) a multimedia tool created by the Hong Kong University of Science and
Technology to teach American English pronunciation to High school students. With word exercises,
PLASER computes a score based on the confidence of a given phoneme in a word and paints it with a
3-color scheme according to the accuracy of the pronunciation[29]. It also gives besides an overall
pronunciation score, an explanation with schematic on how to pronounce the phonemes.
4.1.2 Existing tools
Free software or applications available in the market mostly focus on the hearing and repetition of
several words or sentences without giving any feedback on how well they were pronounced. One of
the most complete in the web is forvo that is contains over 1,749,117 words and 1,856,029
pronunciations in 299 languages [29] including pronunciation in EP with translations. There is also
available several pronunciation exercises for the English language, for instance in learnersdictionary
[31] where the user practice how to pronounce several paronomasias, sentences and syllable stress
but it is also devoid of any interaction. As for EP, besides the previous described type of application,
there are only sites with written explanations on how to pronounce the different phonemes, such as in
learningportuguese [32].
However the most interesting one provided is bonjourdefrance [33] where users read a limited
list of sentences in French through a microphone and it gives feedback on how well it was
pronounced. Yet it only gives an overall score and does not indicate precisely which phonemes are
poorly pronounced and which phonemes are well pronounced.
We have not found any paid interactive applications for EP, but there are several for more
practiced languages such as English. One example is englishlearning [34] that costs, at the time of the
survey, between US$ 77.95 and US$125.95. It has several thousand words and sentences divided
34
into different levels and classifies the pronunciation.
Overall most of these systems only provide a general score for a word or utterance and do not
indicate where to improve or correct the mispronunciations.
4.2 Method
For measuring the quality of the pronunciation the process requires an audio file and its transcription.
The audio speech is digitalized, then, using Audimus, the in-house recognizer, posterior probabilities
on 20 ms frames are calculated from the extracted features. Subsequently, a GOP score is calculated
for each frame, and given a score on each phoneme and word by averaging the GOP from each
frame. Afterwards the GOP is normalized and using a pre-established threshold, from native speakers’
data, the threshold is adapted in order to obtain the maximum efficiency in scoring the phonemes as a
correct or incorrect utterance having the concern not to augment it so much that all phonemes are
considered correct. Finally, given the subjective nature of this threshold, the scores of the system are
compared with three human judges in order to compute its correlation.
Fig. 4- Model’s Schematic.
35
4.2.1 Audimus
Audimus is an Automatic Speech Recognition System customized to the European Portuguese
language and developed by Spoken Language Systems Laboratory (L2F) of INESC-ID [35]. The
system is based in a hybrid automatic speech recognizer that combines the temporal modeling
capabilities of Hidden Markov Models (HMMs) with the pattern discriminative classification capabilities
of Multi-Layer Perceptrons (MLPs) [36]. As an output, Audimus gives the posterior probabilities of
each one of the SAMPA phonemes and the identification of the phoneme of every frame.
This system starts by dividing the desired audio file into 20 ms frames and in each frame it
extracts three types of features thus sectioning them into three different branches. The first branch
extracts 26 PLP (Perceptual Linear Prediction) features, the second 26 Log-RASTA (log-RelAtive
SpecTrAl) features and the 3rd uses 28 MSG (ModulationSpectrogram) coefficients. Then each
branch incorporates a MLP classifier that is used to estimate the probability based on the distinctive
extracted features. Each MLP has the same basic structure, which is an input layer with 9 on text
frames, a non-linear hidden layer with over 1000 sigmoidal units and 40 softmax outputs. Lastly the
MLP/HMM acoustic model combines posterior phone probabilities generated by three phonetic
classification branches using an average in the logprobability domain [35] [36].
The accuracy presented by Audimus resulted from the phonetic transcription of a system of
rules (that was monitored by a linguistic specialist) of 27833 different words producing a set of
pronunciations. It uses the 39 phonemes defined previously and is independent of the speaker. The
Fig. 5- Audimus squematic.
36
language models used in training were achieved using the models from CMU_Cambridge. The data
base utilized corresponds to approximately 46 million words, being 321 thousand different ones that
appeared in the on-line Portuguese newspaper Público. Of all words 80% were used in training, 10%
in development and the remaining 10% in evaluation.
Fig. 6- Visualization of the phoneme division in Wavesurfer.
This tool was employed in this project not only to calculate the posterior probabilities, i.e. the
probability of a frame of an introduced audio file being one of each phoneme described previously but
also as a mean for identification of when a phoneme is uttered given an audio transcription.
37
4.2.2 GOP
The GOP method (goodness of pronunciation) was introduced by Witt and Young (1999) and is one of
the most used methods to score the articulation of words. Its popularity is due to its reduced
computational complexity and indistinctness of the language applied. This means that the same
method can be used for different dialects, as long it has the analysis of the posterior probabilities and
the sectioning of the phonemes in the utterance. Although it has shown that the method can yield
satisfactory results [37], it requires the determination of a threshold to define the boundary between a
good and a bad pronunciation. Thus the quality of the GOP scoring depends on the models utilized
and on the native speakers employed. Nonetheless the GOP is calculated equally for both accurate
and inaccurate utterances.
The GOP algorithm calculates the likelihood ratio that the recognized phoneme corresponds to
the phoneme that should have been spoken for each phoneme in an utterance. The GOP score of
phoneme p is defined as the frame-normalized logarithm of the posterior probability P(p|O(p)), where
O(p) refers to the acoustic segment uttered by the speaker. NF(p) corresponds to the number of frames
in acoustic segment O(p) [1].
𝑃𝑃(𝑝) = |log (𝑃(𝑝|𝑂(𝑝)))| 𝑁𝐹(𝑝)⁄ = |log (𝑃(𝑂(𝑝)|𝑝)𝑃(𝑝)
∑ 𝑃(𝑂(𝑝)|𝑞)𝑃(𝑞)𝑞∈𝑄)| ∕ 𝑁𝐹(𝑝) (4.1)
The posterior probability can then be decomposed in the division between the probability
observation vector sequence O(p) given the phoneme p times its prior and the sum of probability of
O(p) given any phoneme q, in a set Q that includes all phonemes, times their priors.
Assuming that that all phonemes are equally likely, thus making probability P(p) and P(q) the
same, and that the sum can be approximated by its maximum value then the derived GOP can be
described as Eq.4.2.
𝐺𝑂𝑃(𝑝) = |log (𝑃(𝑝|𝑂(𝑝)))| 𝑁𝐹(𝑝)⁄ = |log (𝑃(𝑂(𝑝)|𝑝)𝑃(𝑝)
∑ 𝑃(𝑂(𝑝)|𝑞)𝑃(𝑞)𝑞∈𝑄
)| 𝑁𝐹(𝑝)⁄
≈ |log (𝑃(𝑂(𝑝)|𝑝)
∑ 𝑃(𝑂(𝑝)|𝑞)𝑞∈𝑄)| 𝑁𝐹(𝑝)⁄ ≈ |log (
𝑃(𝑂(𝑝)|𝑝)
𝑚𝑎𝑥𝑞∈𝑄𝑃(𝑂(𝑝)|𝑞))| 𝑁𝐹(𝑝)⁄ (4.2)
The GOP(p) value is always equal or greater than zero. The greater the value, the more likely
there is a mispronunciation. In contrast, the nearer it is to zero, the more probable that the
pronunciation is as a native [38].
38
4.2.3 NGOP
In order to reduce the influence of extreme values or outliers of the data set without having to remove
them, Sigmoidal normalization was applied. This way all the data is included and since this
normalization is almost linear near the mean value, the standard deviation of the mean is preserved.
The normalized data is in the range between 0.0 and 1.0 [15].
This normalization takes the raw GOP score and concatenates it to a GOP score,
denominated NGOP (normalized GOP), into the former range. That is,
𝑁𝐺𝑂𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑠𝑢) =1
1+exp(−𝛼𝑠𝑢+𝛽) (4.3)
where the parameters alpha and beta are empirically found according how rapidly it is wanted
to reach the maximum values and at what values of the abscises the scale starts respectively. This
way it is also easier to visualize the boundaries between a good and bad score [15] [29].
Fig. 7- GOP and NGOP relation.
0 0.2 0.4 0.6 0.8 1 1.2 1.40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GOP
NG
OP
39
4.2.4 Threshold
In order to distinguish when a GOP begins to define an incorrect utterance there is a need to establish
a threshold. The threshold is different for each phoneme and can be calculated in two different ways.
First, given a native speakers training corpus, the threshold value can be calculated as the mean of
the GOP for each phoneme. Secondly, it can use not only the mean but also the std of the GOP and
use the expression:
𝑇𝑝1 = 𝜇𝑝 + 𝛼𝑠𝑡𝑑𝑝 + 𝛽 (4.4)
where α and β are empirically determined scaling constants[1].
Another phoneme dependent threshold proposed by [27], that uses a data from human
labeling, can be determined by averaging the normalized rejection counts over all speakers:
𝑇𝑝2 = log1
𝑁 ∑ (𝑐𝑛(𝑝) ∕ ∑ 𝑐𝑛(𝑚)𝑀
𝑚=1 )𝑁𝑛=1 (4.5)
where cn(p) is the total number of times that speaker n mispronounced the phoneme p by one of the
human judges in the database, M is the total number of phonemes and N is the total number of
speakers[27]. However, this introduces another level of subjectivity depending on how the human
judges decide to label, consequently the former estimation was selected [1] [37].
40
4.2.5 GOP for fluent speech
The effectiveness of the previous expression can be satisfactory for a single phoneme but for a fluent
speech can be restricted. A distinct approach is to determine from Viterbi decoding the acoustic
boundaries and the corresponding likelihoods. Firstly, the numerator in the GOP equation is calculated
using a forced alignment network in which the sequence of phoneme models is fixed by the known
transcription. Secondly, the denominator is calculated using an unstrained phoneme loop network [1].
𝐺𝑂𝑃(𝑝) ≈ |log𝑃(𝑂(𝑝)|𝑞𝑖)
𝑓𝑒 − 𝑓𝑠
− ∑log𝑃(𝑂(𝑝)|𝑞𝑖𝑗)
𝑓𝑗𝑒 − 𝑓𝑗𝑠
𝑁
𝑗=1
|
𝑝𝑥[𝑛] =1
𝑁𝑜𝑐∙𝐼⋅ 𝑛𝑜𝑐[𝑛], 𝑛 = 1,2, … , 𝑁𝐼 (4.6)
Where, fis and fie denote start and end frame number for the ith phone occurring during the current
interval from fs to fe and N are the phonemes that contribute to this likelihood. This way the
alignments of the phoneme loop will differ from the alignment in forced alignment when there is an
incorrect utterance. This method is preferred for long texts but since the quality of the corpus (see
section 5.1) is not high and the audio recordings had to be divided in small sentences does not
introduce a big advantage in augmenting the number of computations so much [1] [37].
Fig. 8- GOP for fluent speech [1].
41
4.2.1 Overall score
There are two methods to attain the overall word score using a weighted sum of the NGOP of each
phoneme in it using different or equal weights. The latter constitutes an arithmetic mean value that
despite facilitating the calculations may not take into account characteristics in the data e.g. the
different thresholds or the fact certain values of the NGOP not being so precise as others [1] [37].
𝑃𝑆(𝑤𝑜𝑟𝑑) = ∑ 𝜔𝑘 ∙ 𝑁𝐺𝑂𝑃(𝑝ℎ𝑜𝑛𝑒𝑚𝑒𝑘)𝑁𝑘=1 (4.7)
4.2.2 Performance measure
To analyse the performance of the NGOP classification algorithm, for a given threshold, four
decision types can be defined: correctly accepted (CA) phoneme realizations, when phonemes that
were pronounces correctly are also judged as correct; correctly rejected (CR), when phonemes that
were pronounced incorrectly are judged as incorrect; false accepted (FA), when phonemes that were
mispronounced are erroneously judged as correct; and false rejected (FA), when phonemes that were
pronounced correctly are judged as incorrect [1] [39].
To achieve a good performance the algorithm has to be able to not only detect
mispronunciations but also to not classify them as a correct articulation. As a result the performance of
the scoring can be defined by:
𝑆𝐴 = ((𝐶𝐴 + 𝐶𝑅)/(𝐶𝐴 + 𝐶𝑅 + 𝐹𝐴 + 𝐹𝑅)) ∗ 100 (4.8)
where the objective is to achieve optimal performance by maximizing the scoring accuracy while
minimizing the false acceptances. Other useful performance measures include the calculation of the
precision (number of correct results divided by the number of all returned results), recall (the number
of correct results divided by the number of results that should have been returned) and F-measure
(the weighted average of the precision and recall) of correctly accepted or rejected phonemes
realizations [1]:
Precision of 𝐶𝐴 = (𝐶𝐴/(𝐶𝐴 + 𝐹𝐴)) ∗ 100 (4.9)
Precision of 𝐶𝑅 = (𝐶𝑅/(𝐶𝑅 + 𝐹𝑅)) ∗ 100 (4.10)
Recall of 𝐶𝐴 = (𝐶𝐴/(𝐶𝐴 + 𝐹𝑅)) ∗ 100 (4.11)
Recall of 𝐶𝑅 = (𝐶𝑅/(𝐶𝑅 + 𝐹𝐴)) ∗ 100 (4.12)
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗ (Precision ∗ Recall )/(Precision + Recall ) (4.13)
42
4.3 Other classification methods
4.3.1 Likelihood Ratio
This classification method proposed by [26] [27] instead of the GOP score utilizes a Likelihood ratio. If
x is the intended phoneme in forced alignment and y is the phoneme resulted from the free alignment.
𝐿𝑅(𝑥, 𝑦, 𝑂) = log (𝑃(𝑂|𝑥)
𝑃(𝑂|𝑦)) (4.14)
Likelihood Ratio (LR) is useful if there is the information of y as well. The LR is based on binary
classification, determining whether O is more like x or y. As in GOP if the LR score is higher than 0,
where the segment O is judged as correct and otherwise, not. This score demonstrates good results
[26] [27] but it also implies that more scores have to be calculated, which may be unviable in a web
application.
4.3.2 MFCC and DTW based evaluation
These scoring method proposed by [25] begins by analyzing the waveform and calculating the Mel
Mel-Frequency Cepstrum Coefficients (MFCC) since they are the most commonly used acoustic
features in speech process systems. Then an Euclidean distance between students’ MFCC
pronunciation and the standard pronunciation is calculated using the Dynamic Time Warping (DTW)
algorithm. These score in conjunction with a comparison with a standard length of the phoneme gives
an evaluation of the pronunciation. While it is interesting that this scoring method uses the length of
the phonemes it also implies that it is needed to store the standard lengths and MFCC data which can
be computationally heavy. Furthermore this scoring method only helps students learn separated
phonemes and not fluent speech.
43
5 Experiments and results
Chapter 5
Experiments and results
This chapter provides the validation of the steps for the implementation of the project and the results
obtained.
44
5.1 Corpus
The native corpus is composed by 15 people from the Lisbon area, 7 males and 8 females, with age
between 22 and 24 years old. The non-native is composed by one 23 year old female Venezuelan,
which has Spanish as native language, and a group of 11 Bulgarians, 6 males and 5 females, with age
between 27 and 42 years old. The group was asked to read several sentences, having this over 14000
phonemes by the native and 9600 by non-native speakers.
The sentences were recorded using a high-quality head-mounted microphone with Mono 16-
bit resolution and 16 kHz as sampling rate. The text prompts were:
1. “Os industriais preveem uma diminuição da produção. Os empresários da
construção apontam para um recuo da procura. No comércio a retalho
espera-se uma evolução desfavorável do volume de negócios. A
confiança entre os consumidores vem conhecendo um forte recuo desde
Abril”;
2. “Na noite de segunda-feira por motivos ainda não totalmente apurados
os companheiros fizeram barulho antes de tempo o jovem acabou por ser
enleado na corda e arrastado por vários quilómetros”;
3. “O projecto está avaliado em cerca de um milhão de contos e pretende
evitar a entrada na lagoa de são martinho de grandes quantidades de
poluição evitam-se desta forma os transtornos verificados numa das
zonas mais importantes da região sobretudo na época estival”;
4. “O vento norte e o sol discutiam qual dos dois era o mais forte
quando passou um viajante envolto num casaco. Ao vê-lo apostaram que
aquele que primeiro conseguisse obrigar o viajante a tirar o casaco
seria considerado o mais forte. O vento norte começou a soprar com
muita força mas quanto mais soprava mais o viajante se embrulhava no
seu casaco até que o vento norte desistiu. O sol brilhou então com
toda a intensidade. E imediatamente o viajante tirou o casaco O vento
norte teve assim de reconhecer a superioridade do sol”.
The phrases have the phrases been altered if the speaker substitutes a word for another, in case of
misreading.
The native speakers corpus was used to compute the threshold. The non-native as group,
separated by native languages, tested and adjusted the former threshold. It should be noted that the
quality of the corpus is not the best for the non-natives, having several pauses and gasping in many
sentences due to the degree of difficulty of these sentences was not in accordance with the level of
the students. In many cases, the alignment was difficult, especially from middle to the end of each
track. To improve the alignment, the tracks were divided into simple sentences which provided a slight
improvement, nonetheless the disparities still caused several misalignments.
45
5.2 Implementation
Using Audimus the posterior probabilities are computed in forced alignment mode for the native
speakers speech. These are used in the approximated expression Eq. 4.2 to calculate the GOP score.
Then a normalization is calculated with alpha=10 and beta =-10 giving the NGOP. For comparison the
threshold was estimated for both the GOP and NGOP scores.
The first part of the implementation is repeated for the non-native corpus, but this time each
phoneme resulting from the alignment is classified by a human jury as either a good or bad
pronunciation. This classification is compared with the results of the distribution of the score measure
up to the predefined threshold and the SA is calculated.
For the first tests, the mean of the score of the natives was established as the threshold.
Testing the score in the non-native data, in the phonemes with SA inferior to 70% the threshold was
augmented using the expression Eq. 4.4 and alpha=0.5 and beta=0.1. Since the corpus is limited the
results can be inexact for certain phonemes.
46
5.3 Results
5.3.1 Threshold for Native speakers
Firstly, before computing the threshold it is crucial to validate the expression of the approximation Eq.
4.2 and if NGOP actually presents an improvement over the GOP.
5.3.1.1 Approximated vs exact GOP
First of all, is important to verify the efficiency of the approximation in Eq. 4.2. For that purpose, the
GOP score with and without the approximation was calculated for several sentences (native and non-
native). The approximation is valid since both scores diverge on average 0.0668 for GOP and 0.0521
for NGOP, with the maximum divergence in higher scores. In the figures below the comparison is
illustrated for some phonemes.
Fig. 9- Example of exact end approximated GOP
5.3.1.2 Comparison between GOP and NGOP
To compare the values of GOP and NGOP, the average standard deviation was initially calculated for
GOP values, yielding 0.476 and for NGOP, 0.213, meaning that there is a smaller divergence between
values in NGOP allowing a better classification. Secondly is important to verify if the normalization
improves the scoring. The GOP scoring has a bigger distribution while in NGOP the values are
concentrated in the extreme values. Furthermore the values of NGOP are concatenated in the interval
between 0 and 1, while GOP scoring can give values until infinite and there are less borderline cases
in NGOP.
As seen in Figure 10, the SA improves substantially with the usage of a normalized version
versus the non-normalized.
0
0,05
0,1
0,15
Exact
Aproximation
47
Fig. 10- NGOP's and GOP's SA comparison for each phoneme.
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
R i~ o~ 6~ o n J 6 w~ d r l~ u j~ z @ w u~ S
NGOP
GOP
0
0,2
0,4
0,6
0,8
1
1,2
inte
rwo
rdp
ause s t a p k Z m E f i v O g L b e j
e~ l
NGOP
GOP
48
So since both assumptions are correct, the mean of the NGOP of each phoneme uttered by all
native speakers was computed, and this was established as a preliminary threshold to be adjusted
later with the tests on the non-native data.
Phoneme Mean N. occurances Phoneme Mean N. occurances
u 0.140049 1416 o 0.02793 214
6 0.08742 1092 w~ 0.098509 213
r 0.166272 962 w 0.334389 145
t 0.005636 880 l~ 0.142358 143
d 0.127353 745 f 0.006546 142
@ 0.230987 727 o~ 0.057922 142
S 0.005135 612 e 0.053197 136
k 0.015536 567 z 0.252077 131
i 0.040606 527 E 0.019217 128
interwordpause 0.005403 526 Z 0.016086 117
a 0.004149 525 l 0.039551 116
s 0.014778 504 u~ 0.315518 112
6~ 0.065146 449 b 0.050564 101
v 0.044119 389 g 0.051099 87
j 0.043568 369 R 0.089092 84
m 0.017172 345 i~ 0.06171 72
p 0.005864 340 L 0.040425 72
O 0.052711 266 j~ 0.195488 72
n 0.086556 259 J 0.089502 57
e~ 0.058275 245
Table 5- Mean and number of occurrences by phoneme
The values vary between 0.33 for w and 0.005 for the interword_pause. There are problems
with vowel reductions such with /@/, /u/, and /6/ and some problems with word co-articulation, which
result from the training of Audimus not taking into account this phenomena.
To better examine the results and since a perfect pronunciation implies that the score is 0. i.e.
the PP of the phoneme is the maximum PP, a confusion matrix was created in other to count which
phoneme has the maximum PP (presented in the Annex 1). It is noticeable that for natives in most
cases the maximum is the phoneme itself, with the exception of some /z/ being pronounced as /S/ and
some /u/ and /@/ being deleted, not only in intra-word position, but also in word boundaries, as
predicted in 3.1.1.
49
5.3.1 SA results for all non-native
Despite existing a larger number of occurrences of each phoneme in the native corpus, it is noticeable
in the figure 11 that not only there are more scores near one (mispronunciations) in non- native (red)
than there are in native (blue). Also the scores are more distributed in the [0,1] spectrum in non-native
data. The average SA is 77.04% and the F measure is 83.30%. These results are not so interesting
since, as explained before, the difficulties for each language are different. Additionally to the threshold
found previously, the SA was measured and for the low scores (scores < 70%) the formula Eq. 4.2
was applied with alpha = 0.5 and beta=0.1.
Fig. 11- NGOP for each phoneme for native (blue) and non-native (red) with native mean and std
above.
0 0.5 10
500
1000A 0.186 0.373
0 0.5 10
200
400
600a 0.195 0.380
0 0.5 10
200
400
600A~ 0.198 0.376
0 0.5 10
50
100b 0.371 0.470
0 0.5 10
500
1000S 0.188 0.375
0 0.5 10
500
1000d 0.281 0.436
0 0.5 10
50
100
150E 0.257 0.399
0 0.5 10
50
100
150e 0.245 0.413
0 0.5 10
200
400
600@ 0.230 0.401
0 0.5 10
100
200
300e~ 0.290 0.443
0 0.5 10
50
100
150f 0.320 0.460
0 0.5 10
50
100g 0.399 0.481
0 0.5 10
200
400
600i 0.326 0.458
0 0.5 10
50
100i~ 0.286 0.445
0 0.5 10
500
1000ip 0.024 0.146
0 0.5 10
50
100
150Z 0.341 0.468
0 0.5 10
200
400
600k 0.193 0.382
0 0.5 10
50
100
150l 0.357 0.469
0 0.5 10
50
100
150l~ 0.317 0.453
0 0.5 10
50
100L 0.416 0.494
0 0.5 10
200
400m 0.219 0.401
0 0.5 10
100
200
300n 0.239 0.420
0 0.5 10
20
40
60J 0.545 0.496
0 0.5 10
100
200
300O 0.189 0.367
0 0.5 10
100
200
300o 0.146 0.328
0 0.5 10
50
100
150o~ 0.230 0.415
0 0.5 10
200
400p 0.186 0.377
0 0.5 10
500
1000r 0.300 0.456
0 0.5 10
50
100R 0.423 0.480
0 0.5 10
200
400
600s 0.240 0.415
0 0.5 10
500
1000t 0.167 0.363
0 0.5 10
500
1000
1500u 0.222 0.390
0 0.5 10
50
100u~ 0.259 0.429
0 0.5 10
200
400v 0.425 0.486
0 0.5 10
50
100w 0.487 0.489
0 0.5 10
100
200w~ 0.221 0.408
0 0.5 10
200
400j 0.356 0.460
0 0.5 10
20
40
60j~ 0.464 0.479
0 0.5 10
50
100z 0.496 0.497
50
5.3.2 SA results for Spanish
Observing the results of the confusion matrix for Spanish (annex 1.2), as expected, there is a difficulty
in the pronunciation of nasalized vowels, since these do not appear in Spanish. Likewise open-mid
and mid vowels are often replaced by the closest sounding phoneme. There were no considerable
difficulties in the pronunciation of /v/ but there were significant mislabeling in /S/, /z/, /Z/.
Fig. 12- SA for each Phoneme in Spanish
For the phonemes /e/, /@/, /o/, /u/, /u~/, /l~/and /w/ the threshold was modified. But despite
improving the SA in the majority of the cases it did not surpass the 70% accuracy. Also, despite some
phonemes having an accuracy of 1, perfect accuracy, this does not necessarily mean that the
pronunciation is perfect, but can also mean that there are not enough occurrences of the phoneme.
The average SA is 82.53% and the average F measure 88.45%.
With these results the thresholds for Spanish are:
Phoneme Threshold
6 0.08742
a 0.104149
6~ 0.165146
b 0.150564
S 0.105135
d 0.227353
E 0.119217
e 0.248972
@ 0.528522
e~ 0.158275
f 0.106546
g 0.151099
i 0.140606
Phoneme Threshold
i~ 0.16171
interwordpause 0.105403
Z 0.116086
k 0.115536
l 0.139551
l 0.399668
L 0.140425
m 0.117172
n 0.186556
J 0.189502
O 0.152711
o 0.189265
o~ 0.157922
Phoneme Threshold
p 0.105864
r 0.266272
R 0.189092
s 0.114778
t 0.105636
u 0.397042
u~ 0.631825
v 0.144119
w 0.653362
w~ 0.198509
j 0.143568
j~ 0.295488
z 0.352077
Table 6- Threshold of each phoneme for
Spanish
0
0,2
0,4
0,6
0,8
1
1,2
e l~ u b S d e~ o z O E i~ o~ a R @
inte
rwo
rd-p
ause
i
w~
A~ p
u~ r j l v k t A f g Z L m n J s w j~
SA
51
5.3.3 SA results for Bulgarian
Observing the results of the confusion matrix for Bulgarian, despite reasonable results in nasalised
phonemes, as expected there is confusion in the vowels. In coronal consonants there is a bigger
distribution of the maximums between other phonemes.
Fig. 13– SA for each Phoneme in Bulgarian
Here, 21 phonemes had SA lower than 70%, and the threshold was also modified. Also in this
case, despite improving the SA in the majority, the accuracy did not surpass the 70%. Moreover, since
there were more phoneme samples in this case, the SA decreased, not having any phoneme with
accuracy equal to 1. The average SA is 73.23% and the average F measure 76.66%. The lower
scores can also be justified by the quality of the audio tracks provided.
With these results the thresholds for Bulgarian are:
Phoneme Threshold
6 0.315838
a 0.004149
6~ 0.273898
b 0.050564
S 0.005135
d 0.127353
E 0.169588
e 0.248972
@ 0.528522
e~ 0.058275
f 0.006546
g 0.051099
i 0.2325
Phoneme Threshold
i~ 0.276879
interwordpause 0.140755
Z 0.016086
k 0.166636
l 0.225398
l 0.142358
L 0.040425
m 0.017172
n 0.322884
J 0.089502
O 0.259354
o 0.02793
o~ 0.260756
Phoneme Threshold
p 0.005864
r 0.450067
R 0.328808
s 0.170876
t 0.005636
u 0.397042
u~ 0.315518
v 0.044119
w 0.653362
w~ 0.340577
j 0.232914
j~ 0.485334
z 0.563233
Table 7- Threshold of each phoneme
for Bulgarian
00,10,20,30,40,50,60,70,80,9
1
@ e A
w~ l R j O u j~ E i~ o~ w z s
A~ n i k r S t b l~ m p d a g Z L J f
e~ o u~ v
inte
rwo
rd-p
ause
SA
52
5.3.4 Comparison, Critic and Evaluation
As the mathematical approximation is viable and the normalization reduces the outliers and
concatenates the values in interval [0,1] we validate this way the computation of NGOP for speech
evaluation. The SA for Spanish and Bulgarian are satisfactory, despite the bad recording conditions of
the corpora, but can be influenced by how many times a phoneme occurs as well. A phoneme that
occurs more frequently is better evaluated. It was also noted that despite the good phoneme based
results, the fact that some phonemes were not uttered with the same duration as a native makes the
complete word/sentence sound unnatural. This implies that further improvements should take duration
into account. This was not implemented for two reasons, firstly the Audimus already measures the
duration of each phoneme and if phoneme is not at least the time interval, computed by averaging the
time of each phoneme in the corpus that Audimus was trained, it does not count as the said phoneme.
And secondly, even if the interval of Audimus is incorrect if we wanted a time variable we would have
to scrutinize every single phoneme in text and save the average time this phoneme in this particular
place in the sentence, making the interface not so adaptable to further updates.
5.3.4.1 Performance measures
With the threshold established, and considering that this is a subjective technique, three performance
measures were also computed to compare the scoring between the transcription by two judges or one
judge and automatic NGOP. This allows a cross-validation between judges in the number of the errors
that each can find.
The first one is strictness and measures how strict a judge is. This also allows seeing how
subjective judgment interferes with the border line cases between correct and incorrect. The strictness
of judge labelling can be defined as the overall fraction of phones which are rejected, i.e. relative
strictness.
𝑆 =𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅𝑒𝑗𝑒𝑐𝑡𝑒𝑑 𝑃ℎ𝑜𝑛𝑒𝑚𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑃ℎ𝑜𝑛𝑒𝑚𝑒𝑠 Eq 5.1
And to compare to judges it is simply means to compute the difference between the two
judges, J1 and J2.
𝛿𝑆 = |𝑆𝐽1 − 𝑆𝐽2| Eq 5.2
The second measure is the agreement and takes into account if the phonemes are considered
mispronounced or not by two different judges.
𝐴𝐽1𝐽2 = 1 − 1
𝑁‖𝑋𝐽1 − 𝑋𝐽2‖𝐶 Eq 5.3
where ‖𝑋‖𝐶 = ∑ |𝑥(𝑖)|𝑁−1𝑖=0 , where x is a vector of size N (total number of phonemes in the sentence)
53
and x є {0, 1}, being 0 for phonemes classified as correct and 1 for the others.
The last one, the cross correlation measures the overall agreement between the reference
and the detected error, i.e. the similarity between all segments which contain rejections in
transcriptions. And it measure by
𝐶𝐶𝐽1,𝐽2 = 𝑋𝐽1
𝑇 𝑋𝐽2
‖𝑋𝐽1‖𝐸
‖𝑋𝐽2‖𝐸
Eq 5.4
where ‖𝑋‖𝐸 = √∑ 𝑥(𝑖)2𝑁−1𝑖=0 is the standard Euclidean norm [1].
With the performance measures described above, a small study was conducted to compare
the ratings between human judges. The inter-judge correlation was measured for 20 calibration
sentences (8 of natives and 12 of non-natives) for 3 natives judges. The results were calculated by
averaging A, CC and PC for each judge in relation to all others.
Performance
measure AA CC PC
Average 0.87 0.45 0.74
Table 8- Average performance measures
As for comparison with the NGOP, after adjusting the threshold to the result of best score
possible, each judge obtained the AA, CC and PC below.
Judge AA CC PC
J1 vs NGOP 0.86 0.52 0.79
J2 vs NGOP 0.88 0.40 0.67
J3 vs NGOP 0.87 0.43 0.69
Table 9 - Judges vs GOP scoring
Evaluating these results we find that there was not a very discrepant view in comparison to the
NGOP, it all depended each subjective interpretation for each judge. A stricter judge would scale down
the results.
54
6 Interface
Chapter 6
Interface
This chapter provides an overview of the Interface where the GOP module was integrated.
55
6.1 VITHEA project
VITHEA (Virtual Therapist for Aphasia treatment - Terapeuta Virtual para o tratamento da Afasia) is a
software program for the treatment of aphasic patients, particularly those that show difficulties when
recalling words, incorporating recent advances of speech and language technology. It was created
by L2F (Spoken Language Systems Lab - Laboratório de sistemas de Língua Falada) as part of the
INESC (Institute for Systems and Computer Engineering - Instituto de Engenharia de Sistemas e
Computadores) and by LEL (Language Research Laboratory - Laboratório de Estudos de
Linguagem) as part of the Department of Clinical Neurosciences of the Lisbon Faculty of Medicine
and the hospital Santa Maria.
The software acts as a "virtual therapist", asking for the patient to recall the contents of a
photo or a picture that is shown. Using automatic speech recognition (ASR) technology, the program
is able to recognize what was said by the patient and to validate if it was correct or not. The "virtual
therapist" is able to provide help to the user whenever it is asked for both semantically and
phonologically, both as a written solution or as a speech synthesized production based on text-to-
speech (TTS) technology [40].
6.2 3P Interface
The VITHEA project interface was adapted to make a 3P Interface, however due to time restrictions it
was not possible to test the Interface as a whole. Nevertheless, the VITHEA interface was profoundly
tested with a good performance. It uses JSP/Servlet Server, connected to Audimus, a Database
Management System and the internet, to lodge a Flash application available in a Web browser.
Fig. 14 - 3P Interface
56
The interface has on the left side what the student is supposed to repeat and on the right side the
sound recording and feedback instructions player.
57
7 Conclusion
Chapter 7
Conclusions
This chapter finalises this work, summarising conclusions and pointing out aspects to be developed in
future work.
58
7.1 Conclusions
With thesis it was explored an approach for an interactive system that helps people learning a new
language. This was motivated by the fact that EP learners find difficult finding material to learn the
language. Hence creating an interactive system of qualification of different phonemes/words for
European Portuguese gives the students an opportunity to study and reach pronunciation proficiency
with self-study, as oppose to requiring the solely attention of a human teacher.
The gain of pronunciation proficiency as an adult is a difficult task that requires a great deal
training, repetition and memorization. So it makes sense to use automatic systems to aid the learning.
A CALL explores the techniques to develop automatic professor for pronunciation learning. This will
help the student to train by repeating the same words or phrases while having a classification of his
performance.
The Portuguese language is a romance language and has approximately 215-220 million
native speakers and 260 million total with over 10 million speaking EP. In the EP group there are two
main dialects: northern and central/southern. Here we explored the central/southern, spoken in the
capital, Lisbon. This dialect is characterized manly by the substitutions of /ei/ and /ou/ to /6j/ and /o/.
Spoken EP has the particularity of having vowel reductions and several co-articulations. EP is a
composed phonetically by 39 phonemes.
The languages of the non-native corpus are the Spanish and the Bulgarian, and these are, in
comparison with EP, by the lack of nasalised sounds, open or semi-vowels and by the absence of
certain consonants.
In this thesis was utilized one of the most standard methods, firstly analyzed by Witt (1999),
which uses a performance measure of the pronunciation named GOP. A sigmoidal transformation was
applied to reduce the outliers and concatenate the values between 0 and 1, as opposed to 0 to infinity.
This measure is referred as NGOP. There are also other projects that focus on scoring phonemes,
using for example, MFCC and DTW, but by analyzing the documentation, these did not surpass the
GOP neither in efficiency nor in easiness of the computation.
Therefore the system applied calculates the NGOP to the segmented, by phonemes, audio,
using the probabilities computed by Audimus. After, using a corpus of native speakers the mean and
std are calculated. And using the non-native corpus, divided by languages, a threshold was
established. This threshold was constructed by firstly applying the mean as a threshold to the non-
native data to obtain the SA of each phoneme. To the phonemes with less than 70% SA the threshold
was increased using the formula Eq.4.4.
Although the results were satisfactory the fact that the quality of the corpus is not the best for
the non-natives, having several pauses and gasping in the sentences, causes some doubts about the
viability of the results.
59
For Spanish, since the corpus was too small, some SA are 1. The threshold was recalculated
for seven phonemes, the majority of non-existing vowels in their native language. The average SA is
82.53% and the average F measure 88.45%.
For Bulgarian, the scoring was worse having 21 phonemes with a SA lower than 0.7. The
threshold was re-calculated for these cases but there was not a major improvement. It was opted that
the threshold would not be shifted to higher values since this would increase the falsely accepted
scores. The average SA is 73.23% and the average F measure 76.66%. The lower scores can also be
justified by quality of the audio tracks provided.
Overall the SA for Spanish and Bulgarian are satisfactory but when listening to the whole
sentence/word, some did not sound correct. One of the reasons was the fact that the phoneme did not
have the duration of the native speech. A variable measuring time could solve this problem but would
also diminish the simplicity of the NGOP.
Using the thresholds, performance measures were applied, using three native judges.
Evaluating the results we there was not a very discrepant view in comparison to the NGOP, but since
it all depended each subjective interpretation for each judge, a stricter judge could scale down the
results.
7.2 Future work
The principal flaw in this thesis was the lack of good non-native audio files, so a major improvement
can be made by recording simpler sentences in accordance to the level of the students. There were
only two languages analysed, Spanish and Bulgarian, so a wider and more differentiated corpus would
help to accommodate a more varied number of native languages. Another enhancement can be made
by retraining the Audimus taking into consideration the vowel reductions and the co-articulations.
Finally, despite the VITHEA interface being heavily tested, it would be good to test the 3P interface
with native and non-native subjects.
60
Annex 1 – Extra Tables
Annex 1
Extra Tables
This annex presents extra tables somewhat important for the explanation of the research.
61
A.1 Example of words using the EP phoneme list
Consonants
plosives Symbol Word Transcription
p pai p"aj
b barco b"arku
t tenho t"6Ju
d doce d"os@
k com ko~
g grande gr"6~d@
fricatives f falo f"alu
v verde v"erd@
s céu s"Ew
z casa k"az6
S chapéu S6p"Ew
Z jóia Z"Oj6
nasals m mar m"ar
n nada n"ad6
J vinho v"iJu
liquids l lanche l"6~S@
L trabalho tr6b"aLu
r caro k"aru
R rua R"u6
Vowels and diphthongs i vinte v"i~t@
lápis l"apiS
e fazer f6z"er
E belo b"Elu
a falo f"alu
6 cama k"6m6
madeira m6d"6jr6
O ontem "O~t6~j~
o lobo l"obu
u jus Z"uS
futuro fut"uru
@ felizes f@l"iz@S
i~ fim f"i~
e~ emprego e~pr"egu
6~ irmã irm"6~
o~ bom b"o~
u~ um u~
aw mau m"aw etc.: iw, ew, Ew, (ow)
aj mais m"ajS etc.: ej, Ej, Oj, oj,
6~j~ têm t"6~j~6~j etc.: e~j~, o~j~, u~j~
Taken from [20]
62
A.2 Confusion matrices for Portuguese, Spanish and Bulgarian
In this section the number of times each phoneme is identified as another phoneme is presented. In
purple are the times it is identified as itself and in orange are some cases in which there were a big
portion of phonemes mistaken as another phoneme.
63
Confusion matrix for natives
ip b d g p t k s z f v S Z l l~ L r R m n
ip 456 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0
b 0 62 6 1 20 1 1 0 0 1 2 0 0 0 0 0 0 0 1 0
d 14 2 444 7 5 122 4 10 0 0 6 17 1 3 3 2 10 2 3 15
g 1 2 1 58 0 0 17 0 0 1 0 0 0 0 0 0 0 0 0 0
p 18 1 0 0 286 21 5 0 0 1 0 0 0 0 0 0 2 0 0 0
t 21 0 10 0 9 735 36 10 0 5 0 10 0 0 0 1 7 1 0 1
k 24 0 1 0 4 5 508 1 0 0 0 2 0 0 0 0 2 2 0 0
s 10 0 0 0 2 5 2 440 2 13 0 26 0 0 0 0 0 1 0 0
z 1 0 7 2 0 4 0 7 53 1 7 23 13 0 0 0 1 0 0 0
f 2 0 0 0 12 2 3 1 0 120 0 2 0 0 0 0 0 0 0 0
v 3 2 17 5 4 2 3 0 1 4 306 3 1 2 0 0 1 11 6 1
S 15 0 5 0 0 2 1 5 19 3 2 533 16 0 0 0 3 0 0 0
Z 0 0 1 0 0 0 0 0 0 0 1 19 94 0 0 0 1 0 0 0
l 1 0 1 1 0 0 0 0 0 0 0 0 0 83 2 0 0 0 2 1
l~ 4 1 2 0 0 1 0 0 0 0 0 0 0 0 69 0 1 0 2 0
L 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 51 0 0 0 1
r 9 2 5 1 16 9 11 5 0 1 3 21 2 1 1 0 734 3 1 7
R 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 69 0 1
m 2 4 2 0 1 0 0 0 0 1 0 0 0 2 1 0 2 0 270 4
n 4 0 5 0 1 1 1 0 0 2 0 0 2 3 0 0 4 0 13 199
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1
j 1 0 1 2 0 1 2 6 0 0 0 20 2 2 0 3 14 1 0 1
w 0 0 1 1 0 0 12 0 0 0 1 0 0 0 2 0 0 0 1 4
i~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2
o~ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
A~ 1 0 0 0 0 1 0 1 0 3 0 0 0 0 0 0 4 1 0 1
e~ 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
u~ 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 10 1
j~ 0 0 0 0 0 3 0 2 0 0 0 1 0 0 0 0 0 0 0 0
w~ 1 0 2 0 1 0 3 3 0 0 0 0 0 1 5 0 1 1 8 0
E 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0
O 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
u 81 0 28 2 24 40 33 22 5 7 10 62 9 20 6 5 56 1 25 24
@ 53 2 12 1 2 28 11 42 1 6 6 66 9 2 1 0 29 5 4 1
i 2 0 1 1 0 0 3 0 0 0 0 0 0 0 0 8 5 0 8 1
e 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 0 0 0
A 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1
a 7 1 0 0 1 1 5 4 0 2 0 3 2 2 0 0 22 7 4 3
o 0 0 0 0 2 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0
64
J j w i~ o~ A~ e~ u~ j~ w~ E O u @ i e A a o
ip 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 3
d 0 1 0 0 1 2 0 1 0 2 2 0 11 19 2 0 15 4 1
g 0 2 0 1 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0
p 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 1 0 1 0
t 0 0 0 0 0 1 0 0 1 3 6 0 5 2 1 0 1 3 0
k 0 0 0 0 2 0 2 1 0 0 0 0 6 0 0 0 1 5 0
s 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
z 0 0 0 0 0 0 0 0 0 0 0 0 2 7 2 0 1 0 0
f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v 0 0 0 0 0 0 1 1 0 1 0 0 6 1 6 0 0 1 0
S 0 0 0 1 0 0 0 0 0 0 0 0 4 0 1 0 0 2 0
Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
l 0 0 1 0 1 0 0 2 0 1 0 1 14 2 0 0 0 0 3
l~ 0 0 7 0 2 0 1 0 0 3 0 22 17 0 0 2 1 1 7
L 2 8 0 0 1 0 0 0 0 0 0 0 2 1 3 1 0 0 0
r 0 6 0 0 2 1 0 0 1 1 4 5 42 24 9 7 9 16 3
R 0 0 0 0 0 0 0 0 0 0 0 0 5 1 0 0 0 2 0
m 0 0 0 9 7 0 0 13 0 4 0 1 14 2 3 0 2 1 0
n 1 0 0 1 0 0 0 3 0 0 2 1 8 1 1 2 1 2 1
J 25 0 0 12 0 0 0 1 1 0 0 0 6 1 3 0 0 1 0
j 0 204 0 0 0 2 2 0 2 0 15 0 1 6 32 20 6 23 0
w 2 0 43 0 19 0 0 0 1 0 0 4 45 0 5 0 2 0 2
i~ 1 3 0 39 0 0 6 0 8 0 0 0 2 0 7 1 0 2 0
o~ 0 0 0 0 112 0 1 1 0 0 0 1 10 0 0 0 0 4 11
A~ 0 1 2 4 23 243 50 5 3 1 14 4 8 1 7 3 6 55 7
e~ 0 3 0 8 2 0 158 2 9 0 14 0 2 1 16 14 0 12 0
u~ 0 0 0 2 6 0 0 33 0 18 0 0 33 2 0 0 0 1 3
j~ 2 3 0 2 1 0 6 2 30 0 7 0 2 1 6 4 0 0 0
w~ 0 0 0 0 3 1 9 6 0 153 0 4 9 0 0 0 0 1 1
E 0 1 0 0 0 1 0 0 0 0 86 0 0 0 5 14 0 16 0
O 0 0 0 0 1 2 0 0 0 1 0 210 3 0 0 0 15 6 24
u 1 9 4 3 2 2 4 16 2 8 0 7 678 119 32 2 0 36 13
@ 1 1 0 7 3 0 1 5 0 1 1 0 87 294 20 9 0 16 0
i 1 18 0 13 0 0 3 3 0 0 2 0 7 11 404 22 0 1 0
e 0 1 0 2 0 1 3 0 0 0 7 0 11 4 7 85 0 9 0
A 0 1 2 0 1 5 0 0 0 0 16 25 0 0 0 0 436 33 2
a 7 2 1 8 2 38 14 3 3 0 31 6 41 37 35 26 34 709 17
o 0 0 1 0 5 5 0 1 0 1 1 3 46 1 0 0 1 29 112
65
Confusion matrix for Spanish
ip b d g p t k s z f v S Z l l~ L r R m n
ip 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 0 3 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 1 0 28 1 0 3 0 0 0 0 1 1 0 0 0 0 3 0 0 8
g 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
p 1 0 0 0 20 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0
t 0 0 3 0 0 55 1 0 0 0 0 0 0 0 0 0 0 0 0 0
k 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 0 0 0 0 34 0 0 0 1 0 0 0 0 0 0 0 0
z 0 0 0 0 0 0 0 3 3 0 0 2 0 0 0 0 0 0 0 0
f 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0
v 1 0 4 0 0 0 0 0 0 0 18 0 0 0 0 0 1 1 1 1
S 1 0 0 0 0 0 0 0 3 0 0 38 0 0 0 0 0 0 0 0
Z 0 0 0 0 0 0 0 0 0 0 0 1 7 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 1
l~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 1
L 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2
r 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 57 0 0 0
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0
m 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 19 1
n 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15
J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
j 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
w 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
i~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
o~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A~ 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
e~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2
u~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
j~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w~ 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
u 6 0 1 0 0 0 1 0 0 0 1 3 1 3 0 0 1 0 3 1
@ 2 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 1 0 0 0
i 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
o 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
66
J j w i~ o~ A~ e~ u~ j~ w~ E O u @ i e A a o
ip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 2 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
p 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0
l~ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0
L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r 0 1 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 5 0
R 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
n 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
J 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
j 0 16 0 0 0 0 0 0 1 0 5 0 0 0 0 0 0 1 0
w 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 2 2
i~ 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 0
o~ 0 0 0 0 4 2 0 0 0 0 0 0 1 0 0 0 0 2 1
A~ 0 0 0 0 0 12 0 0 0 0 1 1 0 0 0 0 14 1 0
e~ 0 0 0 0 0 3 4 0 0 0 5 0 0 0 0 0 0 2 0
u~ 0 0 0 0 0 1 0 1 0 3 0 0 1 0 0 0 0 1 0
j~ 1 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0
w~ 0 0 0 0 0 1 0 0 0 8 0 0 0 0 0 0 2 0 0
E 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 2 1
O 0 0 0 0 0 1 0 0 0 0 0 12 0 0 0 0 3 2 1
u 0 1 1 0 2 3 0 0 0 0 0 3 15 9 2 1 0 41 1
@ 0 1 0 0 0 0 1 0 0 0 1 0 1 10 5 2 0 24 0
i 1 1 0 1 0 0 1 0 0 0 2 0 1 1 20 2 0 2 0
e 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 3 0
A 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 31 2 0
a 1 1 0 0 0 4 2 0 0 0 3 2 0 0 0 0 12 50 0
o 0 0 0 0 0 1 0 0 0 0 0 2 1 0 0 0 2 6 0
67
Confusion matrix for Bulgarian
ip b d g p t k s z f v S Z l l~ L r R m n
ip 796 0 0 1 4 15 13 9 0 6 1 17 1 0 1 0 0 2 1 0
b 3 17 7 0 6 1 1 0 0 1 1 1 1 1 0 1 1 2 3 0
d 21 1 214 2 1 45 10 6 3 0 6 9 0 0 4 1 7 2 3 11
g 0 0 3 15 0 1 10 2 0 3 2 0 0 1 0 0 1 1 1 0
p 14 2 6 0 116 9 10 2 0 1 2 2 0 1 0 0 0 2 1 1
t 76 0 20 1 10 281 22 15 1 5 1 14 2 1 3 1 9 1 2 5
k 19 0 6 0 13 24 216 6 0 3 2 5 3 0 1 1 5 0 9 0
s 30 0 3 1 1 5 4 168 3 15 1 32 1 0 1 0 2 1 5 3
z 3 0 2 0 0 3 0 17 16 4 4 5 6 0 0 1 1 0 1 1
f 5 0 0 0 3 5 4 9 0 37 0 2 0 0 0 0 2 1 2 1
v 10 6 23 2 5 5 6 7 3 14 54 3 1 5 2 1 7 8 4 5
S 19 0 2 0 2 5 2 22 2 2 2 212 8 1 7 0 6 2 3 2
Z 2 0 0 1 1 0 0 4 0 0 1 9 31 3 1 0 0 0 3 1
l 0 0 3 0 1 0 0 1 0 0 0 0 2 19 0 4 2 1 3 7
l~ 4 0 0 0 0 3 1 0 0 0 0 0 0 1 27 0 6 1 1 0
L 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 9 1 0 0 2
r 25 0 10 0 2 10 8 8 0 3 4 24 3 4 1 2 313 14 4 4
R 8 0 0 0 0 0 5 0 0 0 0 3 0 0 0 0 17 3 0 0
m 10 1 7 0 3 2 4 0 0 1 2 1 0 1 1 1 8 0 109 10
n 5 0 6 0 1 4 5 1 0 1 1 1 1 3 0 1 5 1 16 70
J 3 0 0 0 0 0 0 1 0 0 0 2 0 1 1 1 1 0 2 8
j 17 0 4 0 3 1 0 4 0 1 0 4 1 0 1 7 5 0 1 3
w 2 0 0 0 0 1 2 1 0 0 1 3 0 3 4 0 1 0 1 2
i~ 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 2 0
o~ 3 0 0 0 2 0 2 3 0 0 0 2 0 0 0 0 2 0 1 0
A~ 10 0 1 0 0 0 1 2 0 0 0 5 0 2 2 0 6 1 8 4
e~ 10 0 2 0 1 2 3 1 0 0 0 4 0 1 0 0 3 0 4 4
u~ 2 0 2 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 5 1
j~ 1 0 0 0 2 5 0 0 0 0 0 1 1 0 0 0 0 0 3 2
w~ 9 0 1 0 0 1 0 0 0 0 0 4 0 1 18 0 1 0 8 0
E 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0
O 0 0 1 0 0 0 0 2 0 0 0 1 4 1 0 0 1 2 1 0
u 70 1 11 0 6 15 15 14 2 3 3 36 5 4 7 3 15 6 20 9
@ 32 0 7 0 1 10 9 25 1 7 5 33 2 2 2 1 13 2 6 2
i 11 1 4 0 2 6 5 2 1 0 0 10 2 2 1 4 3 1 6 2
e 2 0 0 0 1 2 1 1 0 0 0 2 0 0 0 0 2 0 0 2
A 13 0 1 0 1 4 3 2 0 2 0 3 1 3 1 0 3 0 3 1
a 29 1 3 1 4 6 10 10 0 2 2 8 0 6 2 1 17 10 5 3
o 4 0 0 0 0 2 1 1 0 0 0 0 0 1 1 0 4 1 4 0
68
J j w i~ o~ A~ e~ u~ j~ w~ E O u @ i e A a o
ip 1 1 0 0 0 0 0 0 0 2 0 0 2 2 1 1 8 5 2
b 1 1 0 0 1 0 0 0 0 0 0 1 2 1 2 0 1 1 3
d 3 1 0 1 5 4 1 1 1 1 4 3 9 8 6 2 9 18 7
g 0 1 0 0 0 0 0 0 0 0 2 0 1 1 2 1 0 0 2
p 0 1 2 0 1 0 1 1 0 0 0 2 3 1 4 1 7 6 0
t 2 3 0 0 2 3 0 0 1 3 1 1 6 2 11 1 9 7 0
k 0 3 1 1 0 2 0 1 0 2 2 4 5 1 7 1 6 4 2
s 0 1 0 4 3 1 1 1 0 0 2 2 4 4 6 3 5 4 4
z 0 0 0 0 0 1 1 0 0 0 0 0 0 2 4 2 4 2 1
f 0 0 0 0 0 1 1 0 0 0 2 1 2 2 0 0 2 0 2
v 0 3 1 4 1 1 2 1 1 4 7 1 8 6 4 3 7 5 7
S 0 0 1 0 0 1 0 2 0 4 3 0 9 2 4 0 6 5 4
Z 0 0 0 0 1 0 2 0 0 0 1 1 0 1 3 1 3 0 0
l 0 2 0 0 1 1 2 3 0 0 1 2 6 3 4 0 4 4 3
l~ 0 0 0 0 1 0 1 0 0 5 1 2 8 1 0 0 4 6 12
L 0 4 0 0 1 0 0 1 0 2 1 0 3 0 2 1 1 1 0
r 1 4 0 3 4 11 2 6 0 1 3 5 20 14 13 4 20 14 11
R 0 0 0 0 1 0 0 0 0 1 0 2 2 0 0 0 3 1 3
m 0 2 0 1 5 0 0 2 0 4 0 1 6 4 3 2 3 1 2
n 2 2 0 0 2 2 0 0 0 0 1 0 2 5 1 0 5 4 3
J 1 1 0 0 1 0 1 0 0 1 0 1 1 2 3 0 0 1 0
j 0 55 0 3 1 4 1 2 3 1 17 3 6 11 19 20 10 16 1
w 0 1 13 1 7 1 2 1 0 2 1 6 21 2 4 0 5 5 5
i~ 0 1 0 10 1 2 5 1 5 0 0 0 1 1 5 0 3 1 0
o~ 0 1 0 0 43 0 0 1 0 0 2 2 10 1 2 0 1 3 6
A~ 1 0 0 1 14 65 11 2 3 5 8 6 7 5 6 7 45 26 10
e~ 0 2 0 6 0 12 44 1 3 1 13 0 1 5 7 1 3 9 1
u~ 0 0 0 0 4 3 1 12 2 9 0 0 14 2 1 0 3 1 2
j~ 0 6 0 1 0 0 2 0 4 4 2 1 2 1 4 1 1 4 1
w~ 0 0 1 0 1 0 0 2 0 34 1 3 12 1 0 0 6 6 11
E 0 3 0 1 0 0 0 1 1 3 16 2 1 6 4 7 4 18 1
O 0 0 0 0 3 4 1 0 2 1 1 50 6 3 0 1 12 12 49
u 3 7 3 5 26 1 1 5 2 17 6 21 320 37 25 10 15 50 56
@ 2 4 0 6 5 4 2 1 2 7 17 2 32 143 26 23 9 68 6
i 1 10 0 1 2 1 4 1 2 0 10 4 21 25 93 18 5 18 6
e 0 1 0 2 1 1 4 0 0 1 16 0 2 12 8 18 2 7 0
A 1 3 0 1 2 10 2 0 0 1 7 12 4 2 4 4 162 37 17
a 1 7 2 3 5 15 5 3 2 10 11 19 48 32 28 14 45 248 32
o 0 0 0 0 7 2 0 1 0 1 2 18 14 0 2 3 5 5 50
69
References
References
[1] WITT, S., Use of Speech Recognition in Computer-assisted Language Learning, November 1999
[2] TABILO, L., O ensino do português como língua estrangeira por professores não nativos,
Setembro 2011
[3] CHARRUA, C., Aquisição Fonética-Fonológica do Português Europeu dos 18 aos 36 meses,
Setembro 2011
[4] BOSCH, A., Learning to pronounce written words: A study in inductive language learning,
December 1997
[5] TAYLOR, E., Why we love repetion music, Ted Talks, 2014
[6] RODRIGUES, S., Fonética e Fonologia no ensino da língua materna: modos de
operacionalização, Setembro 2005
[7] ARIZA, E., HANCOCK, S., Second Language Acquisition Theories as a framework for Creating
Distance Learning Courses, Florida Atlantic University, USA, 2003, online version:
(http://www.irrodl.org/index.php/irrodl/article/view/142/222)
[8] VILLALOBOS, O., Reflections on the connection between computer-assisted language learning
and second language acquisition, February 2013
[9] LEVIS, J., Computer technology in teaching and researching pronunciation, 2007
[10] NECIBI, K., An ASR-based System for Arabic Mispronunciation Detection, December, 2012
[11] NERI, A., The pedagogical effectiveness of ASR-based Computer Assisted Pronunciation
Training, 2007
[12] PEABODY, M., Methods for Pronunciation Assessment in Computer Aided Language Learning,
Massachusetts Institute of Technology, 2011
[13] ESKENAZI, M., Using a Computer in Foreign Language Pronunciation Training: What
Advantages?
[14] LEE, A., GLASS, J. , A Comparison-based Approach to Mispronunciation Detection
70
[15] PRIDDY, K., KELLER, E., Artificial Neural Networks: Am Introduction, SPIE Imprensa, 2005
[16] BAHI, H., Hybrid ASR system for teaching pronunciation, September 2008
[17] LEWIS, M., SIMONS, G., FENNIQ, C., Ethnologue: Languages of the World, Seventeenth edition,
Texas, 2014, online version: (http://www.ethnologue.com)
[18] A Pronúncia do Português Europeu, Instituto da cooperação e da Língua de Portugal
[19] MARQUILHAS, R., Gramática Histórica do Português, online version: ( http://cvc.instituto-
camoes.pt/hlp/gramhist/index.html)
[20] WELLS, J., UCL Phonetics and Linguistics , University College London, 1997, online version:
(http://www.phon.ucl.ac.uk/home/sampa/portug.htm)
[21] Dicionário Terminológico para consulta em linha, online version: (http://dt.dgidc.min-edu.pt/)
[22] The Phonetic Framework, online version: (https://www.uni-due.de/DI/Phonetic_Framework.htm)
[23] WELLS, J., SAMPA home page, UCL Phonetics and Linguistics and University College London ,
1996, online version: (http://www.phon.ucl.ac.uk/home/sampa/spanish.htm)
[24] WELLS, J.,SAMPA home page, UCL Fonética e Lingüística and BABEL , 1998, online version:
(http://www.phon.ucl.ac.uk/home/sampa/bulgar.htm)
[25] YAO, M., et all, The Implementation of an Evaluation System of English Phoneme Pronunciation
Quality
[26] WET, F., CUCCHIARINI, C., STRIK, H., BOVES, L., Using likelihood ratios to perform utterance
[27] NERI, A., CUCCHIARINI, C., STRIK, H., Pronunciation training in Dutch as a second language on
the basis of automatic speech recognition
[28] ZHAO, T., et all, Automatic Chinese pronunciation error detection using svm trained with
structural features
[29] MAK, B.,et all, PLASER: Pronunciation Learning via Automatic Speech Recognition
[30] Forvo Media SL, 2014, online version: forvo.com)
[31] WEBSTER, M., Learners’ Dictionary, online version: ( http://www.learnersdictionary.com/)
[32] WALKER, R., TAVARES, R., The Language Lover's Guide to Learning Portuguese, online
version: (http://www.learningportuguese.co.uk/guide/pronunciation/introduction)
[33] Bonjour de France, online version: (http://www.bonjourdefrance.com/)
[34] English Computerized, inc, online version: ( http://www.englishlearning.com/)
[35] MEINEDO, H., ABAD, A., PELLEGRINI, T., NETO, J., TRANCOSO, I., The L2F Broadcast News
Speech Recognition System
71
[36] MEINEDO, H., CASEIRO, D., NETO, J., TRANCOSO, I., A Broadcast News Speech Recognition
System for the European Portuguese Language
[37] WET, F., VAN DER WALT, C., NIESLER, T.R, Automatics assessment of oral language
proficiency and listening comprehension, Speech Communication 51, 864-874, March 2009
[38] STRIK, H., TRUONG, K., WET, F., CUCCHIARINI, C., Comparing different approaches for
automatic pronunciation error detection, Speech Communication 51, May 2009
[39] KANTERS, S., CUCCHIARINI, C., STRIK, H.,The Goodness of Pronunciation Algorithm: a
Detailed Performance Study
[40] L2F (Spoken Language Systems Lab) and LEL (Language Research Laboratory), VITHEA
project, Virtual Therapist for Aphasia treatment, 2014, available online: (https://vithea.l2f.inesc-
id.pt/wiki/index.php/Main_Page)
73
top related