Download - A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture
A Recognition Model for Speech Coding
Wendy Holmes20/20 Speech Limited, UK
A DERA/NXT Joint Venture
2
Introduction
• Speech coding at low data rates (a few hundred bits/s) requires compact, low-dimensional representation.=> code variable-length speech “segments”.
• Automatic speech recognition is potentially a powerful way to identify useful segments for coding.
• BUT: HMM-based coding has limitations:• shortcomings of HMMs as production models
• typical recognition feature sets (e.g. cepstral coefficients) impose limits on coded speech quality
• difficult to retain speaker characteristics (at least for speaker-independent recognition).
3
A “unified” model for speech coding
4
A simple coding scheme• Demonstrate principles of coding using same model for
both recognition and synthesis.
• Model represents linear formant trajectories.
• Recognition: linear trajectory segmental HMMs of formant features.
• Synthesis: JSRU parallel-formant synthesizer.
• Coding is applied to analysed formant trajectories
=> relatively high bit-rate (typically 600-800 bits/s).
• Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.
5
Segment coding scheme overview
6
Formant analyser (EUROSPEECH’97)
– Each formant frequency estimate is assigned a value representing confidence in its measurement accuracy. When formants are indistinct, confidence is low.
– In cases of ambiguity, the analyser offers two alternative sets of formant trajectories for resolution in the recognition process.
“four seven”
7
Linear formant trajectory recognition
• Feature set: formant frequencies plus mel-cepstrum coefficients and overall energy feature.
• Confidences: represent as variances: low confidence => large variance. Add confidence variance to model variance, so low-confidence features have little influence.
• Formant alternatives: choose one giving highest probability for each possible data segment and model state.
• Numbers of segments depend on phone identity:e.g. 1 segment for fricatives; 3 for voiceless stops.
• Range of durations : segment-dependent minimum and maximum segment duration.
8
Frame-by-frame synthesizer controls
Values for each of 10 synthesizer control parameters are obtained at 10ms intervals:
Voicing and fundamental frequency from excitation analysis program.
3 Formant frequency controls from formant analyser.
5 Formant amplitude controls from FFT-based method.
• With 6 bits assigned to each of the 10 controls, the baseline data rate is 6000 bits/s.
9
Segment coding– Segments identified by recognizer are coded using straight-
line fits to observed formant parameters.
– Use a least mean square error criterion. For formant frequencies, frame error is weighted by confidence variance. Thus the more reliable frames have more influence.
– To code a segment, represent value at start, and difference of end value from start value.
– Force continuity across segment boundaries where smooth changes are required for naturalness (e.g. semivowel-vowel boundaries).
– When there are formant alternatives, use those selected by recognizer.
10
Coding experiments– Tested on 2 tasks: speaker-independent connected digit
recognition and speaker-dependent recognition of airborne reconnaissance reports (500 word vocab.).
– Frame-by-frame analysis-synthesis (at 6000 bits/s) generally produced a close copy of original speech.
– Segment-coded versions preserved main characteristics.
– There were some instances of formant analysis errors.
– In some cases, using the recognizer to select between alternative formant trajectories improved segment coding quality.
– In general, coding still works well even if there are recognition errors, as main requirement is to identify suitable segments for linear trajectory coding.
11
Coded at about 600bps– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM report
Natural– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM
report
Speech Coding results
Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.
Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.