native language identiﬁcation from i-vectors and speech

Native Language Identification from i-vectors and Speech Transcriptions

Ben Ulmeremail@domain

Aojia Zhaoemail@domain

Nolan [email protected]

Abstract

Native Language Identification (NLI) isthe task of identifying a speaker’s nativelanguage (L1) from their speech or writ-ing in a secondary language (L2). This pa-per details a novel approach to NLI. Theauthors propose several independent meth-ods for approaching the task, along withan ensemble method that achieves greaterperformance by combining results fromindividual models. The models in this pa-per were trained and tested using data fromthe Native Language Identification SharedTask data set.

1 Introduction

Native Language Identification is a quickly grow-ing subfield in NLP and speech analysis. NLI re-search is motivated by helping Second LanguageAcquisition researchers identify L1 teaching andlearning issues (Tetreault et al., 2013).

The 2017 NLI shared task concerns identifyinga speaker’s L1 from text written in English (theL2). The task is posed as a multi-class classifica-tion problem: from text and speech features, taskparticipants must design and train a model that canidentify the speakers L1 from a set of 11 differentlanguages.

The models described in this paper were alltrained and evaluated on a data set provided by theNative Language Shared Task group. The data setcontains examples from 11 different L1 languages:Arabic, Chinese, French, German, Hindi, Ital-ian, Japanese, Korean, Spanish, Telugu and Turk-ish. There are 11,000 training examples, 1,000 foreach L1 language. The data set also includes a sep-arate dev set which has 100 examples for each lan-guage for a total of 1,100 examples. Each dev ortraining example contains a speech transcription

of a speaker’s response to a verbal question, a writ-ten response to an essay question, and an i-vector(the result of speech being compressed througha linear-Gaussian model to a lower dimensionalspace) containing featurized information about thespeaker’s speech.

In this paper, the authors describe several ap-proaches to the NLI task. These approaches in-clude: a Support Vector Machine (SVM) trainedusing i-vectors as input, a Random Forest (RF) us-ing unigram features from speech transcriptions, aDeep Neural Network (DNN) using i-vectors, anda Bidirectional Long Short-Term Memory model(BiLSTM) trained using word embeddings fromspeech transcriptions. The authors also describean ensemble model using the predictions of theSVM, RF, and DNN model to achieve greater per-formance.

The models mentioned above are all evaluatedusing the separated dev set of 1,100 examples.Since neither the train nor the dev set suffer fromclass imbalance, a simple accuracy score is usedto evaluate each model and compare performancesacross models. The models are compared to twobaseline models, first, an SVM using just featuresfrom the speech transcriptions which achieves anaccuracy of 0.52 on the dev set, and second anSVM using speech transcription features alongwith i-vectors which achieves an accuracy of 0.76on the dev set.

2 Related Work

Prior literature has often focused on traditionalmethods like support vector machines and maxi-mum entropy for Native Language Identification.In a 2013 paper by Joel Tetreault, Daniel Blan-chard, and Aoife Cahill, they analyzed NLI workssubmitted by 29 teams.(Tetreault et al., 2013) Outof these, 25 teams used solely traditional meth-

ods, while 4 others experimented with ensembleapproaches. Furthermore, most teams used simi-lar linguistic features, with word n-gram being themost common one. Below are the top 5 resultsfrom 2013. Though these results point to some

Figure 1: Top 5 Performances - 2013

promising models for our project, do note that thefeatures and data are essay components, thus mak-ing this an NLP task rather than purely speech. Inour project, we implement some of the models de-tailed here with unigrams as well as i-vectors.

Beyond traditional methods, Dras and Malmasiinvestigates two sets of neural network modelsin a 2017 paper on the NLI task.(Shervin Mal-masi, 2017) First, they detailed previous work thatshowed the effectiveness of ensemble classifier ar-chitectures in NLI tasks, with input features span-ning character n-gram, word n-gram, parts-of-speech dependencies, context-free-grammar, etc,and a variety of algorithms including linear sup-port vector machine, logistic regression, percep-tron, decision trees, and linear discriminant anal-ysis. Using primarily the TOEFL11 foreign En-glish test dataset containing essay responses fromspeakers of 11 different languages, they ran a fu-sion method that combined the output probabili-ties from above algorithms to predict class labels.The best accuracy resulted from the mean classprobability metric, with 82.6% correct for train-ing set and 83.3% correct for test set. This isalready very close to the 2013 NLI competitionwinner score of 83.6% on the test set. Subse-quently, the authors pursued stacked generaliza-tion, with multi-layered meta-classifier taking theoutput vectors from previous algorithms as wellas the gold label and further outputting a class la-bel. This is akin to multilayered neural networksin a way, where the meta-classifier learns the opti-mal fusion structure to combine the various SVM,logistic regression, LDA, etc. outputs. Further-more, to combine the advantages of both models,the authors utilized a hybrid network, with out-

puts of classification algorithms each fed to manymeta-classifiers that output label probability vec-tors. A final layer that fuses the results from themeta-classifiers is used to output the class label.With this setup, they achieved 85.3% on train and87.1% on test, surpassing the previous state-of-the-art performance. Though the results do seemimpressive, compared to the oracle of 96.1% ac-curacy, the paper notes there still is much work todo. In our project, we utilize stacked generaliza-tion structure, but with DNN components insteadof traditional classification algorithms. Further-more, we look at different features in the domainof speech analysis, using i-vector features insteadof written linguistic components such as n-gramsand CFG.

Lastly, we look at a 2016 paper from Jiao andTu on end-to-end neural network model incorpo-rating both DNN and RNN elements.(et al., 2016)The experiment trains a classic DNN on whole-some speech clips, while also training a parallelRNN model which takes segmented sound blurbsof speech clips as input. The end results are lin-earized generalized together to form final predic-tions. In addition, they discuss ways to fuse theseneural models with baseline SVMs to acheive bet-ter performance. The key conclusion of this paperwas that the parallel setup worked well in separat-ing similar languages like Spanish and Teluga thatsingular models failed to distinguish. In this pa-per, we use the ensemble and parallel structure toincorporate multiple stand alone models for betterclassification predication.

3 Approach

3.1 Support Vector Machine

Due to the prevalence of Support Vector Ma-chines in literature surrounding the task of NLI,a baseline SVM with a linear kernel was trainedupon unigrams from essays. To improve uponthis model, a variety of features as proposedby literature and kernels were experimented with(Tetreault et al., 2012), and their performance wascompared through the lens of 10-fold cross vali-dation. An SVM with a linear kernel trained uponraw i-vectors resulted in the best performance.

3.2 Random Forest

A Random Forest was trained to predict the L1 ofthe speaker. 10-fold cross validation was used tocompare the performance of i-vectors, unigrams

from essays, and a combination of i-vectors andunigrams as features. The best results came fromusing unigrams from essays as the features.

3.3 Deep Neural NetworkWe applied a Deep Neural Network (DNN) modelon preprocessed, 800-dimensional i-vectors asinputs, and outputted the 11-language classprediction probabilities. Because original audioclips of foreign speakers were not provided,aural feature encoding i-vectors were used as asimilar alternative.(N. Dehak and Dumouchel,2011) The model is a 4-hidden layer DNN, takingminibatches of 25x100 inputs at each trainingiteration. Layers 1 and 2 are feature gatheringlayers, with larger dimensions than the original i-vector. This allows coarse features to be separatedout into individual nodes in the larger hiddenlayers. Layer 3 is a bottleneck layer, with a muchsmaller set of parameters designed to concentrateand extract only the vital information needed foroutput layer classification. The new, selectivefeatures out of layer 3 are then fed to the largersized layer 4, which serves as a decoder alongwith the fully connected output layer, translatingthe information encoded to which language itbelongs to.

Figure 2: DNN Model Layout

As this was a classification task, the traditionalSoftmax activation function was applied to out-put layer to bound the probability vector. For thehidden layers, Rectified Linear Unit (ReLU) wasused on layers 1, 3, and 4. This is used to pre-vent vanishing gradient occurrences common toDNN training on large iteration cycles. Addition-ally, Max Gradient Clipping was implemented toupper bound gradient magnitudes to 5; this was toprevent the opposing factor of exploding gradientsfrom generating infinite loss and killing the train-

ing process. Because the bottleneck layer, layer3, significantly compresses the information flowduring both forwards and backwards propagation,it is important to compress the activation output.Therefore, hyperbolic tangent is used for this layeras it bounds output to (−1, 1). Lastly, to preventoverfitting to the training set, Dropout with keepprobability 0.8 was added to layers 1 and 4. Thisallows the model to switch off responses periodi-cally so that each epoch can vary in training.

3.4 Bidirectional LSTM

The Bidirectional LSTM model was trained usingonly the text of the speech transcriptions. First,each word was fed through an embedding layerto retrieve the GloVe representation of that word(Pennington et al., 2014). The GloVe embeddingsof words came from the Common Crawl 42B to-kens collection, and the 300 dimensional embed-dings were used (Pennington et al., 2014). Ifno corresponding GloVe representation could befound, then a random embedding was generatedfor that word.

After feeding the speech transcriptions throughthe embedding layer, each embedded transcriptionwas put into a single layer BiLSTM. The last statein each direction was then concatenated to formthe output of the BiLSTM layer.

Finally, the output from the BiLSTM layer wasfed through a dropout layer, and into a singledensely connected layer to produce a score foreach of the 11 potential language classes. Asoftmax was applied to these scores in order toapproximate a probability distribution over theclasses.

Figure 3: The BiLSTM Model.

3.5 Ensemble Model

To combine results from our different models, wetuned linear weight multipliers on the differentprobability predictions from SVM, Random For-est, and DNN. The BiLSTM model was excludedfrom the ensemble since on average it performedworse than the other three due to higher complex-ity and limited transcription samples for training.Note the ensemble is a pure linear combination,with no class specific rules for generating overallpredictions. Language is chosen based on highestaveraged probabilities across three models.

4 Experiments

4.1 Support Vector Machine Experiments

Figure 4: Confusion matrix for the SVM Model.

The SVM was trained on several sets of fea-tures: i-vectors, unigrams, and a combination ofi-vectors and unigrams, and on a linear and a gaus-sian kernel, and its performance was evaluatedthrough 10-fold cross validation. The best perfor-mance was a result of a feature set of i-vectors anda linear kernel, and the performance of this modelon the dev set is detailed above in figure 4 (76%accuracy).

The success of this model is probably due toits resilience against overfitting, due to its firmerdecision boundary and relatively low number offeatures (800). By comparison, the SVM trainedon a unigram model with a linear kernel aver-aged roughly 50% over its cross-validation experi-ments, and had x features. This issue is not neces-sarily a problem with the model, but rather due tothe relative lack of data (11,000 training example)in comparison with the feature space.

Upon evaluating the performance of the model,

it was discovered that, although the model per-forms well on languages such as Arabic andItalian, it struggled distinguishing languages thatshare similar roots such as Hindi and Telugu (seefigure 2). To address this, we tried training a sep-arate SVM on just these languages to better sep-arate predictions of Hindi or Telugu by the mainSVM, but this approach was found not to improveperformance.

4.2 Random Forest Experiments

Figure 5: Confusion matrix for the Random For-est Model.

Similar to the SVM, the Random Forest modelwas trained on feature sets of i-vectors, unigrams,and a combination of i-vectors and unigrams, butthe best results were from unigrams of essays. TheRandom Forest Model achieved a final accuracyof 50% on the dev set, and its confusion matrix ispresented above in figure 5.

The Random Forest suffered many of the sameissues of the SVM model in that it struggled withoverfitting and with confusing languages with sim-ilar roots such as Japan and Korean.

Due to the black box nature of a Random Forest,it is hard to analyze the cause of its poor perfor-mance, but we hypothesize it struggled due to thenoisiness of the data, the large number of classesfrom which it was asked to make a prediction, andthe fact that the relationship between number ofoccurrences of word and probability of a languageisn’t linear.

4.3 Deep Neural Network Experiments

Training set data consists of 11,000 i-vectors, oneper 45 second speech clip read by non-native En-glish speakers. The 11-language classes each con-

tain 1,000 i-vectors of 800-dimensions. DNNmodel was trained on 25 i-vector minibatches,with 440 iterations to complete one epoch of train-ing set. To ensure convergence, the model wasran for 20 epochs on Tensorflow framework withCPU. For convergence smoothing, learning ratestarted at 0.1, with exponential decay factor of0.99 per 100 iterations. Backward propagation op-timizer was Stochastic Gradient Descent on crossentropy between prediction probability vector andone-hot true label vector. For testing, note that be-cause the Native Language Identification commit-tee has not yet released the test set, we use the devset as this project’s test set; the dev set consists of1,100 i-vectors with 100 per language.

Figure 6: Loss Curve for DNN Training

In the loss curve across 4000 iterations, roughly 9epochs, convergence can be seen around iteration2500. The curve starts with a very narrow plateauuntil iteration 200, at which point the loss rapidlydrops as the model learns i-vector feature struc-ture.

It is clear from the training set confusion matrixthat overfitting has indeed occurred in spite ofDropout layers attempting to include regulariza-tion. The overall accuracy here, defined as cor-rect language label, is 99.3%. For outstandinghotspots, the biggest false positive is predictingHindi in the case of true label Telugu. We see thatthis is also the case for dev set results.In the dev set, accuracy dropped to 76.5% over-all, with major hotspots at HIN-TEL, TEL-HIN,JPN-KOR, and ARA-TUR. In all four cases, theconfusion makes linguistic sense as these pairs areall similar languages in existing phones. In par-ticular, both Hindi and Telugu are Indian subcon-

Figure 7: Confusion Matrix - DNN Train

Figure 8: Confusion Matrix - DNN Dev

tinent languages, with Hindi from northern Indiaand Telugu from southern Deccan Plateau. In thecase of Arabic and Turkish, both share similar vo-cabulary and pronunciations as Asia Minor lan-guage origins. For Japanese and Korean, one con-crete example is the lack of the ’th’ sound, as in’truth’. In various transcriptions with ’th’ endingwords, audio features in i-vectors would capturethe ’s’ sound, which does exist in both languages.Therefore, although the model can successfullydetermine whether an i-vector is or is not in theset of Japanese/Korean, it has a hard time makinga binary classification between the two due to sim-ilar aural features in both.

4.4 BiLSTM Experiments

Three separate models were trained using the BiL-STM framework described above. The purposeof these experiments was to determine the opti-mal size and architecture for the BiLSTM model

Model Train acc Dev acch=100, dropout=0.0 0.70 0.41h=50, dropout=0.0 0.87 0.43h=50, dropout=0.5 0.61 0.42

Table 1: BiLSTM Results.

Figure 9: Confusion matrix for the final BiLSTMModel.

on the NLI task. Each model was trained for 10epochs (around 5 hours on a standard CPU). Theresults reported in Table 1 are from the epoch withthe highest validation score (accuracy on the devset).

The BiLSTM model has several issues. First,the models trained without dropout over-trainedvery quickly. These models achieved nearly onehundred percent accuracy on the train set, whilestill performing below baseline performance onthe dev set. Using dropout slowed the over-training issue, but failed to provide any additionalgeneralization power to the model.

One reason for the problems listed above is therelatively small size of the data set. With only1,000 examples for each L1 language, it is dif-ficult to train more complex neural based mod-els. Another potential issue is the use of GloVerepresentations as the word embedding approachfor this problem. GloVe embeddings are good atrepresenting the meanings of words, but do littleto capture variability in speech patterns. Addi-tionally, GloVe embeddings were only availablefor around 14,000 words out of the 16,000 uniquewords spoken by speakers in the train and dev sets.Many of the missing words include partially spo-

ken words (e.g. impriso- as the first half of im-prisoned) that have the potential to reveal more in-formation about an individual speaker. The em-beddings for these missing words or partial wordswere random, which could lead to poorer perfor-mance.

Another issue that plagues the current BiL-STM model is the lack of information. Try-ing to perform NLI using only speech transcrip-tions is a difficult task (the baseline only achieves52% accuracy). A second model was also exper-imented with that concatenated the outputs fromthe BiLSTM to the 800-dimensional i-vector forthat speaker. However, this model suffered evenharder from the problem of over-training and didlittle better than random on the dev set (11% accu-racy). More work needs to be done to improve theBiLSTM model in order to see better performancein the face of little available training data.

4.5 Ensemble Experiments

Figure 10: Confusion matrix for the EnsembleModel.

Our final model achieved an accuracy of 78.9%.This model was a result of combining the proba-bilities of classes from the Random Forest, SVM,and Deep Neural Network models.

5 Conclusion

In this paper, we explored the task of Native Lan-guage Identification (NLI), and approached thistask through an ensemble method comprised ofan SVM, a Random Forest, and a Deep Neu-ral Net. We also experimented with a BiLSTM

model, but found it did not improve performance.Our paper demonstrates the success of an ensem-ble method, and we hope will help push forwardother NLI experiments; in the last iteration of thenative language identification task, the vast ma-jority of teams used an SVM, and we showedthat performance can be increased by several per-centage points when this model is combined withdeep-learning approaches. We also highlighted thedifference in relative difficulty between predictinglanguages that share similar roots, such as Hindiand Telugu, which are both offspring of Sanskrit,and predicting languages such as German, whichdoes not share a close resemblance to another lan-guage in the classification task.

Our greatest challenge was overcoming the lackof data, and many of our models needed to besimplified to prevent overfitting. In the future,we want to experiment with ways of artificiallyincreasing the amount of data available to us,whether through splicing together essays writtenby the students or experimenting with techniquesto combine i-vectors. Furthermore, we want to ex-plore the process of feature selection more thor-oughly, perhaps by experimenting with normaliz-ing i-vectors.

ReferencesYishan Jiao et al. 2016. Accent identification

by combining deep neural networks and recur-rent neural networks trained on long and shortterm features. In Interspeech. pages 2388–2392.https://asu.pure.elsevier.com/en/publications/accent-identification-by-combining-deep-neural-networks-and-recur.

P. Kenny R. Dehak P. Ouellet N. Dehak and P. Du-mouchel. 2011. Front end factor analysis forspeaker verification. In IEEE Trans. Acoust.,Speech, Signal Processing, vol. 19. pages 788–798.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.

Mark Dras Shervin Malmasi. 2017. Native lan-guage identification using stacked generalization.https://arxiv.org/abs/1703.06541.

Joel Tetreault, Daniel Blanchard, and Aoife Cahill.2012. Native tongues, lost and found: Resourcesand empirical evaluations in native language identi-fication. In Proceedings of the 24th InternationalConference on Computational Linguistics. pages2585–2602.

Joel Tetreault, Daniel Blanchard, and Aoife Cahill.2013. A report on the first native language identi-fication shared task.

https://asu.pure.elsevier.com/en/publications/accent-identification-by-combining-deep-neural-networks-and-recur







http://www.aclweb.org/anthology/D14-1162



https://arxiv.org/abs/1703.06541



native language identiﬁcation from i-vectors and speech

Documents