building asr corpora using eyra - wordpress.com · • eyra is a system to record prompts for...

Building ASR corpora using Eyra Jón Guðnason, Ma:hías Pétursson, Róbert Kjaran, Simon Klüpfel, Anna Björk Nikulásdóttir Center for Analysis and Design of Intelligent Agents, Language and Voice Lab (https://lvl.ru.is/) Reykjavik University, Iceland REYKJAVIK UNIVERSITY Evaluation of scoring • Málrómur database 2 was used to evaluate the scoring. • Figure to the right shows the average scores of each of the 570 recording sessions for the entire database. • A listening test of 3000 utterances resulted in 2079 utterances rated as very good by four annotators. • The text of these prompt-recording pairs were modified with a substitution, deletion and an insertion to create three new prompt-recording pairs. • The resulting 8018 prompts (some prompts are only one word) were then tested against with the error feedback system. • The scores are shown in the figure on the right. Overview and Background • Eyra is a system to record prompts for automatic speech recognition development. • Designed to be easy to use, configurable and minimizing reading errors. • Eyra has been used to record Icelandic and Javanese. Other collections are being planned. • Collaborative effort with Google: experience with Datahound and Woefzela. • Work extended to include a quality control module for reading error feedback based on a forced alignment. References 1 M. Petursson, S. Klupfel, and J. Gudnason, “Eyra - speech data acquisition system for many languages,” in SLTU, 2016. 2 J. Gudnason, et al.“Almannaromur: an open Icelandic speech corpus.” in SLTU, 2012, pp. 80–83. http://malfong.is Forced alignment scoring • Initial phase of recording used to create an acoustic model. • Decoding graph allows for generic bi-phone models to be inserted between words. • Phone-error-rate (PER) is used to score all utterances after the initial phase. Examples of scores – Three sessions Distribution of scores Scores for simulated errors Finite state transducer with biphone models Recording ASR data with Eyra • Participant reads a list of prompts stored on the client. • Client is normally a smart-phone but can be any browser. • The figures on the right show a recording session. • Each participant records between 400-600 sentences. • The aim is to collect 150,000 read utterances per language. • Recordings are uploaded to a server that can be set up on a local area network (See figure right/below). • Each client can store the recordings locally too. • Recordings, prompts, sessions etc. are stored in a relational database on the back-end server 1 . Annotating recordings • Prompts can be graded and annotated during recording sessions or afterwards. • The annotator can choose a grade between one (very poor) and four (very good) and a comment is chosen from a drop-down menu. • Ratings and comments are stored in database for further analysis. Recording modules Data handling and storage Error feedback system • The example below shows how a reference and a hypothesis can differ. • Each recording is evaluated using: Score = 1-min(PER,1) • Scores for all utterances in three sessions are shown in the figure right. • A running average of five utterances is kept and reported back to the user in the quality bar (see green and yellow bars in figures left). Javanese and Icelandic data collection • A total of 160,266 prompts of Javanese prompts were recorded by 772 participants. • Total duration of Javanese data is 257.7 hours. • A total of 32,929 prompts of Icelandic prompts were recorded by 203 participants. • Total duration of Icelandic data was 32.2 hours. This is in addition to the Málrómur database 2 already collected.

Upload: others

Post on 11-Jul-2020

5 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Building ASR corpora using Eyra - WordPress.com · • Eyra is a system to record prompts for automatic speech recognition development. • Designed to be easy to use, configurable

BuildingASRcorporausingEyraJón Guðnason,Ma:hías Pétursson,Róbert Kjaran,SimonKlupfel,AnnaBjork NikulásdóttirCenterforAnalysisandDesignofIntelligentAgents,LanguageandVoiceLab(https://lvl.ru.is/)ReykjavikUniversity,Iceland REYKJAVIKUNIVERSITY

Evaluationofscoring• Málrómur database2 was used to evaluate the

scoring.• Figure to the right shows the average scores of

each of the 570 recording sessions for the entire database.

• A listening test of 3000 utterances resulted in 2079 utterances rated as very good by four annotators.

• The text of these prompt-recording pairs were modified with a substitution, deletion and an insertion to create three new prompt-recording pairs.

• The resulting 8018 prompts (some prompts are only one word) were then tested against with the error feedback system.

• The scores are shown in the figure on the right.

OverviewandBackground• Eyra is a system to record prompts for automatic speech recognition development.• Designed to be easy to use, configurable and minimizing reading errors.• Eyra has been used to record Icelandic and Javanese. Other collections are being planned.• Collaborative effort with Google: experience with Datahound and Woefzela.• Work extended to include a quality control module for reading error feedback based on a forced alignment.

References1M. Petursson, S. Klupfel, and J. Gudnason, “Eyra - speech data acquisition system for many languages,” in SLTU, 2016. 2 J. Gudnason, et al.“Almannaromur: an open Icelandic speech corpus.” in SLTU, 2012, pp. 80–83. http://malfong.is

Forcedalignmentscoring• Initial phase of recording used to create an

acoustic model.• Decoding graph allows for generic bi-phone

models to be inserted between words.• Phone-error-rate (PER) is used to score all

utterances after the initial phase.

Examplesofscores– Threesessions

Distributionofscores

Scoresforsimulatederrors

Finitestatetransducerwithbiphonemodels

RecordingASRdatawithEyra• Participant reads a list of prompts stored on the client.• Client is normally a smart-phone but can be any browser.• The figures on the right show a recording session.• Each participant records between 400-600 sentences.• The aim is to collect 150,000 read utterances per language.

• Recordings are uploaded to a server that can be set up on a local area network (See figure right/below).

• Each client can store the recordings locally too.• Recordings, prompts, sessions etc. are stored in a relational

database on the back-end server1.

Annotatingrecordings• Prompts can be graded and annotated

during recording sessions or afterwards.• The annotator can choose a grade between

one (very poor) and four (very good) and a comment is chosen from a drop-down menu.

• Ratings and comments are stored in database for further analysis.

Recordingmodules

Datahandlingandstorage

Errorfeedbacksystem• The example below shows how a reference and a

hypothesis can differ.• Each recording is evaluated using:

Score = 1-min(PER,1)