multimedia database (ac5210)

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Multimedia Database

(AC5210)

Mac Dang KhoaSpeechCom department

Le Thi LanComVis department

UMI 2954

Audio and Speech database

AC5210

2017 Classification

Environment

Animal

Synthetic sound

Sound effect

Speech

Language achieveme

Speech processing technology

ASR TTS Others:

Identification

Verification

Authentication

Example of Sound banks

UMI 2954

Speech database

Speech and language achievement and

documentation

AC5210

2017 Languages in the world

Language speaking

distribution :

50 % population speak 10

languages

95% population speak 5%

(6500) languages

AC5210

2017 Languages in the world

Endangered languages

Languages disappearing 1

From 1950, 421 languages were disappeared

“1/3 of the world’s languages are in danger of disappearing

in the next few decades” .

One language dies every 14 days

Language changing, mixing

1 http://www.sil.org

https://www.ethnologue.com

Languages saving => Documentation

AC5210

2017 Languages documentation

“the methods, tools, and theoretical underpinnings for compiling a

representative and lasting multipurpose record of a natural language or one

of its varieties” (Himmelmann 1998)

To Preserve language (endangered languages)

Material for language study

Material for Natural language and Speech processing

Collecting : recording, taking pictures, gathering written documents, ...

Processing : analysing, systematizing, transcribing, translating, ...

Archiving: storing, publising

Among 7000 languages

< 600 well documentation languages

3,349 unwritten languages

AC5210

2017 Community

AC5210

2017 Vietnamese minority languages

54 ethnic groups

Vietnamese (Kinh): 87%

5 ethnics < 1000 pers

AC5210

2017 MICA’s AuCo collection

ÂuCơ: Audio Copora

From 2007

Language documentation:

Vietnam an neighbors

Minorities

Collection

Digitalization

Documentation :annotation,

transcription

Analysis

Archiving: Online access

AC5210

2017 DoReMiFa project

Données des Recherches

linguistiques de Michel

Ferlus en Asie du sud-est

Digitizing the collections of

Michel Ferlus (1963-2003)

2014 – 2015

>40 ethnic languages

> 200 cassette tapes,

recording from 1963 – 2003

Project groups

6 Linguists + IT expert

>20 linguistic student

AC5210

2017 Data collection

Available recorded data

Fieldwork recording

AC5210

2017 Data collection

Wordlist

Common/Standard wordlist for fieldwork

>2000 worđs

Available : HAL

AC5210

2017 Digitalization

Signal digital (WAV, 24-bit,

48,000 Hz)

AC5210

2017 Processing

Transcription - Manually

Linguists /Phonetician

Time-consuming

> 10 h working for 1 hour of transcription (Word level)

> 100h working for 1 hour of transcription (

Phone level)

AC5210

2017 Processing (2)

Transcription: Semi-automatically

Multilingual speech recognition

Acoustic Phonetic

Recognizer

Multilingual

TextGrid

X-SAMA

TextGrid

TextGrid Conversion

Input speech

Phone sequence (hypothesis)

AC5210

2017 Processing (3)

Experiments of phone level transcription on

Green Mong (Mo Piu) language [1]

• <500 speakers, unwritten languages, lack of linguistic study

• Methods: 5 languages supply acoustic models : Vietnamese (VN),

Mandarin (CH), Khmer (KH), French (FR), English (EN), each one

trained on big corpora

Na language

• Acoustic model from 5 languages: 40 English phones, 43 French

phones, 34 Mandarin phones, 41 Vietnamese phones, and 36

Khmer phones

1. Caelen-Haumont G, Sam S, Castelli E (2011) Automatic Labeling and Phonetic Assessment for an Unknown Asian

Language: The Case of the“ Mo Piu” North Vietnamese Minority (early results). In: Asian Language Processing (IALP),

2011 International Conference on. IEEE, pp 260–263

2. Thi-Ngoc-Diep DO, Alexis M, Eric C (2015) Towards the Automatic Processing of Yongning Na (Sino-Tibetan): Developing

a “Light”Acoustic Model of the Target Language and Testing “Heavyweight”Models from Five National Languages. In: The

4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14. St Petersburg,

Russia

AC5210

2017 Processing (4)

Phone transcription result for Mo Piu

Automatic labeling

Manual labeling

AC5210

2017 Processing (5)

Phone transcription result for Mo Piu

good + close acoustic wrong

CHFR VNCHVNCHKH VNCHKHENFRVNCHKHFR VNFR

AC5210

2017 Archiving (1)

Open Language Archives Community (OLAC)

AC5210

2017 Archiving (2)

Participating Archives

AC5210

2017 Archiving (3)

Pangloss collection:

AC5210

2017 Archiving (5)

Data format standard:

AC5210

2017 Data packaging

Annotation

TextGrid

Wordlist

XML generation

AC5210

2017 Publishing

- 42 languages/dialects

- 120 hours of recordings

- 100 annotated documents

Current

AC5210

2017 Publishing (2)

Examples: http://lacito.vjf.cnrs.fr/archivage/index_en.htm

UMI 2954

Speech database

For speech processing

AC5210

The Task Specific Voice Control and Dialog system

Speech

Recognizer

Language

Analyzer

Expert

system

Text-to-

speech

synthesizer

vocabulary

&grammar

Semantic

Pronunciation

Systems under voice

control executes

commands reports status

Converts spoken

input into

grammatically correct

Extracts

meaning

from text

Selects desired

action, issues

commands to system,

constructs reply in

text form

Converts text reply

into machine

generated speech

(TEXT)(Meaning) (reply

text)(Speech)

output

(Speech)

Output

action

Transcribed

speech

Corpus

Text Corpus

Position of text corpus and speech corpus

AC5210

2017 Corpus building

Recording

High quality

Well control

Non – naturel

Expensive

=> Specific purpose, Text to speech

Crowdsourcing

Source: available speech sources

Size: huge

Different types/quality

Nature quality

Not expensive for a big corpus

=> ASR

AC5210

2017 Evaluation problems

How evaluate an ASR (or ASPR) system ?

Tests common databases (benchmark)

Evaluation campaigns (DARPA for ASR and NIST for

For French: AUPELF

Common databases

AC5210

2017 Some speech databases

TIMIT : 630 American speakers, recording in good

conditions 1 recording session

Bref80 : 80 francophone speakers, read of

« journal Le Monde » texts (5330 sentences)

M2VTS : multimodal databases (voice + visage

images)

Switchboard : speech in English, telephone

quality, several recording sessions

There is very few databases for mobile phone

speech

CTIMIT (TIMIT corpus re-recorded through a cellular

mobile phone inside a moving truck)

Cellular Switchboard

UMI 2954

VNSpeechCorpus

AC5210

2017 Text corpus - VietnameseData collection and normalization (1/2)

Remark:

We can get grand text corpus from Web. This corpus

contains contains a large number of words in different context

of different domains.

It is very useful to be used in analyzing the acoustic units

it can represent the statistic distribution of universal

Vietnamese language.

AC5210

2017 Text corpus - VietnameseData collection and normalization (2/2)

Web pages collection and data preparation :

Documents were gathered from Internet by some web robots

Constructing the text corpus from HTML pages.

Normalizing or rewriting non-standard words.

Main contents

menus links, references advertisements

RedundancyRedundancymust be must be removed !removed !

Main contents

RedundancyRedundancyRedundancyRedundancymust be must be removed !removed !

Txt 868MB

Normalized

Html 2.5GB

Data collection

Normalization

AC5210

2017 Text corpusData collection and normalization (2/3)

Web pages collection and data preparation : Documents were gathered from Internet by a web robot

Constructing the text corpus from HTML pages.

Normalizing or rewriting non-standard words.

All characters were converted to Unicode (UTF-8) by our tools.

Variable modules

data collecting

html2text

1. token normalizing 2. character converting

3. sentence splitting 4. word splitting

5. case changing6. lexicon constructing

7. number2text

Data preparation

8. sentence filtering

Fixed moduleswww

AC5210

2017 Text corpusData collection and normalization (3/3)

All redundancy removed

References

multimedia database (ac5210)

Documents

multimedia databases and database contents retrieval

multimedia database systems & multimedia information...

manejo de oracle multimedia sobre oracle database 12c ·...

emerging database technology multimedia database

bit 3193 multimedia database

multimedia alicja wieczorkowska multimedia database systems...

an object-oriented query language for multimedia database...

mediaview -- towards a “ semantic ” multimedia...

a multimedia gis database for planning management and...

distributed multimedia database technologies

multimedia presentation and delivery ch.13 principles of...

a multimedia gis database for planning management and...

moditroduction multimedia database

multimedia database systems

multimedia database support for digital libraries

multimedia database

bit 3193 multimedia database chapter 4 : quering multimedia...

semantic multimedia database information retrieval

developing anintegrated multimedia-database of farming...

multimedia database as narrative mechanism intensive five...