multimedia database (ac5210)
Post on 28-Jan-2022
22 Views
Preview:
TRANSCRIPT
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Multimedia Database
(AC5210)
Mac Dang KhoaSpeechCom department
Le Thi LanComVis department
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Audio and Speech database
3
AC5210
2017 Classification
Audio
Sound
Environment
Animal
Noise
Synthetic sound
Music
Sound effect
Speech
Language achieveme
nt
Speech processing technology
ASR TTS Others:
Identification
Verification
Authentication
Example of Sound banks
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Speech database
Speech and language achievement and
documentation
5
AC5210
2017 Languages in the world
Language speaking
distribution :
50 % population speak 10
languages
95% population speak 5%
(6500) languages
6
AC5210
2017 Languages in the world
Endangered languages
Languages disappearing 1
From 1950, 421 languages were disappeared
“1/3 of the world’s languages are in danger of disappearing
in the next few decades” .
One language dies every 14 days
Language changing, mixing
1 http://www.sil.org
https://www.ethnologue.com
Languages saving => Documentation
7
AC5210
2017 Languages documentation
What
“the methods, tools, and theoretical underpinnings for compiling a
representative and lasting multipurpose record of a natural language or one
of its varieties” (Himmelmann 1998)
To Preserve language (endangered languages)
Material for language study
Material for Natural language and Speech processing
Tasks
Collecting : recording, taking pictures, gathering written documents, ...
Processing : analysing, systematizing, transcribing, translating, ...
Archiving: storing, publising
Among 7000 languages
< 600 well documentation languages
3,349 unwritten languages
8
AC5210
2017 Community
9
AC5210
2017 Vietnamese minority languages
54 ethnic groups
Vietnamese (Kinh): 87%
5 ethnics < 1000 pers
10
AC5210
2017 MICA’s AuCo collection
ÂuCơ: Audio Copora
From 2007
Language documentation:
Vietnam an neighbors
Minorities
Tasks
Collection
Digitalization
Documentation :annotation,
transcription
Analysis
Archiving: Online access
11
AC5210
2017 DoReMiFa project
Données des Recherches
linguistiques de Michel
Ferlus en Asie du sud-est
Digitizing the collections of
Michel Ferlus (1963-2003)
2014 – 2015
Data
>40 ethnic languages
> 200 cassette tapes,
recording from 1963 – 2003
Project groups
6 Linguists + IT expert
>20 linguistic student
12
AC5210
2017 Data collection
Available recorded data
Fieldwork recording
13
AC5210
2017 Data collection
Wordlist
Common/Standard wordlist for fieldwork
>2000 worđs
Available : HAL
14
AC5210
2017 Digitalization
Signal digital (WAV, 24-bit,
48,000 Hz)
Audio
tapes
15
AC5210
2017 Processing
Transcription - Manually
Linguists /Phonetician
Time-consuming
> 10 h working for 1 hour of transcription (Word level)
> 100h working for 1 hour of transcription (
Phone level)
16
AC5210
2017 Processing (2)
Transcription: Semi-automatically
Multilingual speech recognition
Acoustic Phonetic
Recognizer
Multilingual
AM
IPA
TextGrid
X-SAMA
TextGrid
TextGrid Conversion
Input speech
Phone sequence (hypothesis)
17
AC5210
2017 Processing (3)
Transcription: Semi-automatically
Experiments of phone level transcription on
Green Mong (Mo Piu) language [1]
• <500 speakers, unwritten languages, lack of linguistic study
• Methods: 5 languages supply acoustic models : Vietnamese (VN),
Mandarin (CH), Khmer (KH), French (FR), English (EN), each one
trained on big corpora
Na language
• Acoustic model from 5 languages: 40 English phones, 43 French
phones, 34 Mandarin phones, 41 Vietnamese phones, and 36
Khmer phones
1. Caelen-Haumont G, Sam S, Castelli E (2011) Automatic Labeling and Phonetic Assessment for an Unknown Asian
Language: The Case of the“ Mo Piu” North Vietnamese Minority (early results). In: Asian Language Processing (IALP),
2011 International Conference on. IEEE, pp 260–263
2. Thi-Ngoc-Diep DO, Alexis M, Eric C (2015) Towards the Automatic Processing of Yongning Na (Sino-Tibetan): Developing
a “Light”Acoustic Model of the Target Language and Testing “Heavyweight”Models from Five National Languages. In: The
4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14. St Petersburg,
Russia
18
AC5210
2017 Processing (4)
Transcription: Semi-automatically
Phone transcription result for Mo Piu
Automatic labeling
Manual labeling
19
AC5210
2017 Processing (5)
Transcription: Semi-automatically
Phone transcription result for Mo Piu
0
10
20
30
40
50
%
good + close acoustic wrong
CHFR VNCHVNCHKH VNCHKHENFRVNCHKHFR VNFR
20
AC5210
2017 Archiving (1)
Open Language Archives Community (OLAC)
21
AC5210
2017 Archiving (2)
Participating Archives
22
AC5210
2017 Archiving (3)
Pangloss collection:
23
AC5210
2017 Archiving (5)
Data format standard:
24
AC5210
2017 Data packaging
Annotation
TextGrid
Wordlist
XML generation
25
AC5210
2017 Publishing
- 42 languages/dialects
- 120 hours of recordings
- 100 annotated documents
Current
26
AC5210
2017 Publishing (2)
Examples: http://lacito.vjf.cnrs.fr/archivage/index_en.htm
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Speech database
For speech processing
28
AC5210
2017
The Task Specific Voice Control and Dialog system
Speech
Recognizer
Language
Analyzer
Expert
system
Text-to-
speech
synthesizer
vocabulary
&grammar
model
Semantic
rule
Pronunciation
rule
Systems under voice
control executes
commands reports status
Converts spoken
input into
grammatically correct
text
Extracts
meaning
from text
Selects desired
action, issues
commands to system,
constructs reply in
text form
Converts text reply
into machine
generated speech
(TEXT)(Meaning) (reply
text)(Speech)
Voice
output
(Speech)
Output
action
Transcribed
speech
Corpus
Text Corpus
Position of text corpus and speech corpus
29
AC5210
2017 Corpus building
Recording
High quality
Well control
Non – naturel
Expensive
=> Specific purpose, Text to speech
Crowdsourcing
Source: available speech sources
Size: huge
Different types/quality
Nature quality
Not expensive for a big corpus
=> ASR
30
AC5210
2017 Evaluation problems
How evaluate an ASR (or ASPR) system ?
Tests common databases (benchmark)
Evaluation campaigns (DARPA for ASR and NIST for
ASPR)
For French: AUPELF
Common databases
31
AC5210
2017 Some speech databases
TIMIT : 630 American speakers, recording in good
conditions 1 recording session
Bref80 : 80 francophone speakers, read of
« journal Le Monde » texts (5330 sentences)
M2VTS : multimodal databases (voice + visage
images)
Switchboard : speech in English, telephone
quality, several recording sessions
There is very few databases for mobile phone
speech
CTIMIT (TIMIT corpus re-recorded through a cellular
mobile phone inside a moving truck)
Cellular Switchboard
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
VNSpeechCorpus
33
AC5210
2017 Text corpus - VietnameseData collection and normalization (1/2)
Remark:
We can get grand text corpus from Web. This corpus
contains contains a large number of words in different context
of different domains.
It is very useful to be used in analyzing the acoustic units
it can represent the statistic distribution of universal
Vietnamese language.
34
AC5210
2017 Text corpus - VietnameseData collection and normalization (2/2)
Web pages collection and data preparation :
Documents were gathered from Internet by some web robots
Constructing the text corpus from HTML pages.
Normalizing or rewriting non-standard words.
Main contents
menus links, references advertisements
RedundancyRedundancymust be must be removed !removed !
Main contents
menus links, references advertisements
RedundancyRedundancyRedundancyRedundancymust be must be removed !removed !
Txt 868MB
Normalized
www
Html 2.5GB
Data collection
Normalization
35
AC5210
2017 Text corpusData collection and normalization (2/3)
Web pages collection and data preparation : Documents were gathered from Internet by a web robot
Constructing the text corpus from HTML pages.
Normalizing or rewriting non-standard words.
All characters were converted to Unicode (UTF-8) by our tools.
Variable modules
data collecting
html2text
1. token normalizing 2. character converting
3. sentence splitting 4. word splitting
5. case changing6. lexicon constructing
7. number2text
Data preparation
8. sentence filtering
Fixed moduleswww
html
txt
sent
36
AC5210
2017 Text corpusData collection and normalization (3/3)
All redundancy removed
Menus
Links
References
Advertisements
Vietnamese: 2.5 GB of HTML pages → 868 MB of
text corpus = 10,020,267 sentences
Main contents
menus links, references advertisements
RedundancyRedundancymust be must be removed !removed !
Main contents
menus links, references advertisements
RedundancyRedundancyRedundancyRedundancymust be must be removed !removed !
37
AC5210
2017 Text corpus - VietnameseText corpus evaluation
Perplexity of the language models
Perplexity is used to evaluate quality of one language
model which is built from one Text corpus.
38
AC5210
2017 Speech Corpus- VietnameseData collection
Text corpus for recording:
Goal: cover words and sentences that are most frequently
used, cover sufficient variations to support flexible and natural
spoken language generation .
Content:
Phoneme, digits and string of digits, application words.
Sentences, short paragraphs.
Domains:
Law, culture and society, sports, science and technology,
policy, medicine, business, weather.
Everyday conversations.
Source: web, books, newspapers.
39
AC5210
2017 Quiet studio for recording
40
AC5210
2017Speech Corpus - VietnameseRequirements of Speech corpus
phoneme: study the acoustic and spectral characteristics of
Vietnamese phonemes. Ex: a /a/ , a/a/ - a ha /a - aha/
Tones: study the acoustic and spectral characteristics of six
tons. Ex: ba, bá, bà, bã, bả, bạ
Digits and string of digits: to build the isolated/connected
digital recognition/ synthesis systems.
Digits: 0-9,
String of digits: telephone number , credit number
Application words: used in controller systems by speech such
as telephone services, human-machine interface...
Ex: đóng - close, mở - open, đo - measure...
Sentences and paragraphs: used for training and testing
continuous speech recognition systems:
Dialogs , short paragraphs
41
AC5210
2017 Speech Corpus - VietnameseSpeaker selection
The most important among the speaker characteristics
sex
regional dialectal background
level of education
age and physical health.
Our speakers :
The age of the speakers : from 15 to 45 years old,
Among the 50 speakers, 25 females/ 25 are males.
From 4 big cities and provinces, Hanoi, NgheAn, HaTinh,
HCM city, represent 3 major dialect regions: the South, the
North, and the Middle
42
AC5210
2017 Speech Corpus - VietnameseVNSpeechCorpus
Common part: 45 minutes of signal of phonemes, tones, digits
and strings of digits, application words and common sentences
and paragraphs:
contains 4955 isolated words, with 1257 different
words.
There are 840 mono-words of text corpus in the 1327
most used mono-words of the web corpus
Private part: 15 minutes of signal of about 40 short paragraphs
Total 2000 short paragraphs (70 - 80 words)/ 20
subjects.
40 short paragraphs /1 speaker.
43
AC5210
2017 Speech Corpus - VietnameseSpeech database evaluation (1/6)
Evaluating corpus by analyzing the
distributions of acoustic units including:
mono-words, base syllables,
Initial-Final parts,
Phonemes, di-phones, tri-phone and tones
Compare the distributions with the
distributions obtained from Text Corpus (Web
corpus)
44
AC5210
2017 Speech Corpus - VietnameseSpeech database evaluation (2/6)
Distribution of mono-phones in common part, private part and
Web corpora
45
AC5210
2017Speech Corpus - VietnameseSpeech database evaluation (3/6)
Distribution of six tones in common part, private part and Web
corpora
The distributions of acoustic units correspond with the
distributions of acoustic units of a huge Text Corpus (Web)
46
AC5210
2017Speech Corpus - VietnameseSpeech database evaluation (4/6)
We calculated the correlation coefficients between the
distributions of the common part and the private part with
the web reference corpora
vector x: occurrence frequency of acoustic units of
VNSpeechCorpus
vector y: occurrence frequency of acoustic units of Text
Corpus.
corr(x,y): correlation coefficient between x and y.
yx
yxyxcorr
),cov(),(
))((1
),cov(1
yyxxn
yx i
n
i i
n
i i xxn
x1
2)(1
47
AC5210
2017 Speech Corpus - VietnameseSpeech database evaluation (5/6)
Correlation coefficients of acoustic units between common
part, private part and Web data :
Our corpus is acceptable and correctly balanced in
terms of acoustic units and tones
48
AC5210
2017 Speech Corpus – VietnameseSpeech database evaluation (6/6)
No of
Speaker
Recording
timePurpose
SPEECHDAT 5000 10’/Speaker Training speech recognition systems via
telephone network.
BREF 90 40’-70’/
Speaker
Training and testing speech recognition
systems of French Language
SESP 45 10’/Speaker Speaker recognition
Korean Speech
Database
50
150
10h office
10h studio
Training and testing speech recognition
systems of Korean Language
VNSpeechCorpus 50 50h office
50h studio
Training and testing speech recognition/
synthesis systems of Vnmese Language
Comparison between VNSpeechCorpus with other Speech Database
49
AC5210
2017 VNSpeechCorpus +
Goals
100h transcribed studio speech
>100 speakers
> 100h nature recording by smartphone
> 100h crowdsourcing speech
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Homework
51
AC5210
2017 Next week: presentation
Sound bank examples
Speech Database for ASR
52
AC5210
2017 Course projects
Speech/Audio crowdsourcing
Read and summary: Eskenazi, Maxine, Gina-Anne Levow,
Helen Meng, Gabriel Parent, and David
Suendermann. Crowdsourcing for speech processing:
Applications to data collection, transcription and
assessment. John Wiley & Sons, 2013
Tools for speech/audio crowding
Demo: Development a tool for online audio news crowding
(VOV, VTV online .etc)
Speech/Language online collection
OLAC collection: Overview, standard
Speech segmentation techniques
Demo: Speech segmentation (sentence level)
top related