amsp : advanced methods for speech processing an expression of interest to set up a network of...

AMSP : Advanced Methods for Speech Processing

An expression of Interest to set up a Network of Excellence in FP6

Prepared by members of COST-277 and colleagues

Submitted by Marcos FAUNDEZ-ZANUY

Presented here by Gérard [email protected] GET-ENST/CNRS-LTCI

http://www.tsi.enst.fr/~chollet

Outline

Rationale of the proposition Objectives Approaches Modeling Recognition by synthesis Robustness to environmental conditions Evaluation paradigm Excellence Integration and structuring effect

Rationale for the NoE-AMSP

The areas of Automatic Speech Processing (recognition, synthesis, coding, language identification, speaker verification) should be better integrated

Better models of Speech Production and Perception

Investigate Nonlinear Speech Processing Understanding, Semantic interpretation

Integrated platform for Automatic Speech Processing

DISCRETEMODELS

SYN

TH

ET

IC SP

EE

CH

HU

MA

N S

PE

EC

HCODED SPEECH

WRITTEN SPEECH

TtSStT

StCCtS

Analysis Synthesis

Recogn.

Cod

ing

Levels of representationsONE-LAYER CODES

MULTI-LAYER CODES

PCM

LPC,CELP

PLP,WLP

DiscreteModels

Orthography,IPA

No Models, One Quality Layer

Source-Filter Model (SFM)Two Quality Layers

SFM + Perceptual Aspects (PA)Two Quality Layers

SFM + PA + ArticulatoryAspects & Dynamics (AA)Three or more Quality Layers

SFM + PA + AA +Language Specific AspectsMany Quality Layers

Features of Speech Models

Reflect auditory properties of human perception Explain articulatory movements Surpass the limitations of the source-filter model Capture the dynamics of speech Capable of natural speech restitution Be discriminant for segmental information Robust to noise and channel distortions Adaptable to new speakers and new

environments

Time – Frequency distributions

Short Time Fourier Transform Non-linear frequency scale (PLP, WLP), mel-

cepstrum Wavelets, FAMlets Bilinear distributions (Wigner-Ville, Choi-Williams,...) Instantaneous frequency, Teager operator Time – dependent representations (parametric and

non parametric) Vector quantisation Matrix quantisation, non linear prediction

Time-dependent Spectral Models

Temporal Decomposition (B. Atal, 1983)

Vectorial Autoregressive models with detection of model ruptures (A. DeLima, Y. Grenier)

Segmental parameterisation using a time-dependent polynomial expansion (Y. Grenier)

Modeling of segmental units

Hidden Markov Model Markov Fields Bayesian Networks, Graphical ModelsOR Production models Synthesis (concatenative or rule based) with voice transformationAND / OR Non linear predictor

Expected achievements in Speech Coding and Synthesis

Modeling the non-linearities in Speech Production and Perception will lead to more accurate and/or compact parametric representations.

Integrate segmental recognition and synthesis techniques in the coding loop to achieve bit rates as low as a few 100's bps with natural quality

Develop voice transformation techniques in order to : Adapt segmental coders to new speakers, Modify the characteristics of synthetic voices

Expected achievements inSpeech Synthesis

Self-excited nonlinear feedback oscillators will allow to better match synthetic and human voices.

Current concatenative techniques should be supplemented (or replaced) by (nonlinear) model based generative techniques to improve quality, naturalness, flexibility, training and adaptation.

Model-based voice mimicry controled by textual, phonetic and/or parametric input should not only improve synthesis but also coding, recognition and speaker characterisation.

Automatic Speech Recognition

Limitations of the HMM and hybrid HMM-ANN approaches

Keyword spotting (detection with SVM), noise robustness, adaptation

Large Vocabulary Speech Recognition (SIROCCO) http://perso.enst.fr/~sirocco/index-en.html

Markov Random Fields, Bayesian Networks and Graphical Models

Markov Random Fields Bayesian Networks

and Graphical Models

• Speech modelling with state constrained Markov Random Field over Frequency bands (Guillaume Gravier and Marc Sigelle) http://perso.enst.fr/~ggravier/recherche.html#these

• Comparative framework to study MRF, Bayesian Networks and Graphical Models. http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html

Recognition by Synthesis

If we could drive a synthesizer with meaningful units (phone sequences, words,...) to produce a speech signal that mimics the one to recognize, we may come close to transcription.

Analysis by Synthesis (which is in fact modeling) is a powerful tool in recognition and coding.

A trivial implementation is indexing a labelled speech memory

A L I S P

Automatic Language Independent Speech Processing

Automatic discovery of segmental units for speech coding, synthesis, recognition, language

identification and speaker verification.

The robustness issue :

Mismatch between training and testing conditions

High Order Statistics are less sensitive to environment and transmission noise than autocorrelation

CMS, RASTA filtering Independent Component Analysis

From Speaker Independent to Speaker Dependent recognition (Personalisation)

Expected achievements inAutomatic Speech Recognition

Dynamic nonlinear models should allow to merge feature extraction and classification under a common paradigm

Such models should be more robust to noise, channel distortions and missing data (transmission errors and packet losses)

Indexing a speech memory may help in the verification of hypotheses (a technique shared with Very Low Bit Rate Coders)

Statistical language models should be supplemented with adapted semantic information (conceptual graphs)

Voice technology in Majordome

Server side background tasks:continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject

User interaction: Speaker identification and verification Speech recognition (receiving user

commands through voice interaction) Text-to-speech synthesis (reading text

summaries, E-mails or faxes)

Collaboration with COST-278

COST-278: Vocal Dialogue is a continuation of COST-249 High interest in Robust Speech Recognition,

Word spotting, Speech to actions, Speaker adaptation,...

Some members contribute to the Eureka-MAJORDOME project

Could be the seed for a Network of Excellence in FP6

Evaluation paradigm

DARPA NIST

http://www.nist.gov/speech/tests/spk/index.htm

Could we organize evaluation campaigns in Europe ?

The 6th program of the EU is trying to promote Networks of Excellence.

How should excellence be evaluated ?Should financial support be correlated with

evaluation results ?

amsp : advanced methods for speech processing an expression of interest to set up a network of...

Documents