automatic lip- synchronization using linear prediction of speech christopher kohnert sk semwal...

Post on 17-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Lip-Automatic Lip-Synchronization Using Synchronization Using

Linear Prediction of Linear Prediction of SpeechSpeech

Christopher Kohnert Christopher Kohnert SK SemwalSK Semwal

University of Colorado, Colorado University of Colorado, Colorado SpringsSprings

Topics of PresentationTopics of Presentation

Introduction and BackgroundIntroduction and Background Linear Prediction TheoryLinear Prediction Theory Sound SignaturesSound Signatures Viseme ScoringViseme Scoring Rendering SystemRendering System ResultsResults ConclusionsConclusions

JustificationJustification

Need:Need: Existing methods are labor intensiveExisting methods are labor intensive Poor resultsPoor results ExpensiveExpensive

Solution:Solution: Automatic methodAutomatic method ““Decent” resultsDecent” results

Applications of Applications of Automatic SystemAutomatic System

Typical applications benefiting from Typical applications benefiting from an automatic method:an automatic method: Real-time video communicationReal-time video communication Synthetic computer agentsSynthetic computer agents Low-budget animation scenarios:Low-budget animation scenarios:

Video games industryVideo games industry

Automatic Is PossibleAutomatic Is Possible

Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive

Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional

animationanimation

Existing Methods of Existing Methods of SynchronizationSynchronization

Text BasedText Based Analyze text to extract phonemesAnalyze text to extract phonemes

Speech BasedSpeech Based Volume trackingVolume tracking Speech recognition front-endSpeech recognition front-end Linear PredictionLinear Prediction

HybridsHybrids Text & SpeechText & Speech Image & SpeechImage & Speech

Speech Based is BestSpeech Based is Best

Doesn’t need scriptDoesn’t need script Fully automaticFully automatic Can use original sound sample (best Can use original sound sample (best

quality)quality) Can use source-filter modelCan use source-filter model

Source-Filter ModelSource-Filter Model Models a sound signal as a source passed Models a sound signal as a source passed

through a filterthrough a filter Source: lungs & vocal cordsSource: lungs & vocal cords Filter: vocal tractFilter: vocal tract

Implemented using Linear PredictionImplemented using Linear Prediction

Speech Related TopicsSpeech Related Topics

Phoneme recognitionPhoneme recognition How many to use?How many to use?

Mapping phonemes to visemesMapping phonemes to visemes Use visually distinctive ones (e.g. vowel Use visually distinctive ones (e.g. vowel

sounds)sounds) Coarticulation effectCoarticulation effect

The Coarticulation EffectThe Coarticulation Effect

The blending of sound based on The blending of sound based on adjacent phonemes (common in adjacent phonemes (common in every-day speech)every-day speech)

Artifact of discrete phoneme Artifact of discrete phoneme recognitionrecognition

Causes poor visual synchronization Causes poor visual synchronization (transitions are jerky and unnatural)(transitions are jerky and unnatural)

Speech Encoding Speech Encoding MethodsMethods

Pulse Code Modulation (PCM)Pulse Code Modulation (PCM) VocodingVocoding Linear PredictionLinear Prediction

Pulse Code ModulationPulse Code Modulation

Raw digital samplingRaw digital sampling High quality soundHigh quality sound Very high bandwidth requirementsVery high bandwidth requirements

VocodingVocoding

Stands for VOice-enCODingStands for VOice-enCODing Origins in military applicationsOrigins in military applications Models physical entities (tongue, Models physical entities (tongue,

vocal cord, jaw, etc.)vocal cord, jaw, etc.) Poor sound quality (tin can voices)Poor sound quality (tin can voices) Very low bandwidth requirementsVery low bandwidth requirements

Linear PredictionLinear Prediction

Hybrid method (of PCM and Vocoding)Hybrid method (of PCM and Vocoding) Models sound source and filter Models sound source and filter

separatelyseparately Uses original sound sample to Uses original sound sample to

calculate recreation parameters calculate recreation parameters (minimum error)(minimum error)

Low bandwidth requirementsLow bandwidth requirements Pitch and intonation independencePitch and intonation independence

Linear Prediction TheoryLinear Prediction Theory

Source-Filter modelSource-Filter model PP coefficients are calculated coefficients are calculated

SourceFilter

Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)

The The aakk coefficients are found by coefficients are found by minimizing the original sound (minimizing the original sound (SStt) ) and the reconstructed sound (and the reconstructed sound (ssii).).

Can be solved using Levinson-Durbin Can be solved using Levinson-Durbin recursion.recursion.

Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)

Coefficients represent the filter partCoefficients represent the filter part The filter is assumed constant for The filter is assumed constant for

small “windows” on the original small “windows” on the original sample (10-30ms windows)sample (10-30ms windows)

Each window has its own Each window has its own coefficientscoefficients

Sound source is either Pulse Train Sound source is either Pulse Train (voiced) or white noise (unvoiced)(voiced) or white noise (unvoiced)

Linear Prediction for Linear Prediction for RecognitionRecognition

Recognition on raw coefficients is Recognition on raw coefficients is poorpoor

Better to FFT the valuesBetter to FFT the values Take only first “half” of FFT’d valuesTake only first “half” of FFT’d values This is the “signature” of the soundThis is the “signature” of the sound

Sound SignaturesSound Signatures

16 values represent the sound16 values represent the sound Speaker independentSpeaker independent Unique for each phonemeUnique for each phoneme Easily recognized by machineEasily recognized by machine

Viseme ScoringViseme Scoring

Phonemes were chosen judiciouslyPhonemes were chosen judiciously Map one-to-one to visemesMap one-to-one to visemes Visemes scored independently using Visemes scored independently using

historyhistory VVii= 0.9 * V= 0.9 * Vi-1i-1 + 0.1 * {1 if matched at + 0.1 * {1 if matched at ii, ,

else 0}else 0} Ramps up and down with successive Ramps up and down with successive

matches/mismatchesmatches/mismatches

Rendering SystemRendering System Uses Alias|Wavefront’s Maya Uses Alias|Wavefront’s Maya

packagepackage Built-in support for “blend Built-in support for “blend

shapes”shapes” Mapped directly to viseme Mapped directly to viseme

scoresscores Very expressive and flexibleVery expressive and flexible

Script generated and later Script generated and later read inread in

Rendered to movie, QuickTime Rendered to movie, QuickTime used to add in original sound used to add in original sound and produce final movie.and produce final movie.

Results (Timing)Results (Timing)

Precise timing can Precise timing can be achievedbe achieved

Smoothing Smoothing introduces “lag”introduces “lag”

QuickTime™ and a H.263 decompressor are needed to see this picture.

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Results (Other Examples)Results (Other Examples)

A female speaker A female speaker using male using male phoneme setphoneme set

QuickTime™ and a Cinepak decompressor are needed to see this picture.

Slower speech, male speaker

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Results (Other Examples) Results (Other Examples) (cont.)(cont.)

Accented speech Accented speech with fast pacewith fast pace

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Results (Summary)Results (Summary)

Good with basic speechGood with basic speech Good speaker independence (for Good speaker independence (for

normal speech)normal speech) Poor performance when speech:Poor performance when speech:

Is too fastIs too fast Is accentedIs accented Contains phonemes not in the reference Contains phonemes not in the reference

set (e.g. “w” and “th”)set (e.g. “w” and “th”)

ConclusionConclusion

Linear Prediction provides several Linear Prediction provides several benefits:benefits: Speaker independenceSpeaker independence Easy to recognize automaticallyEasy to recognize automatically

Results are reasonable, but can be Results are reasonable, but can be improvedimproved

Future WorkFuture Work

Identify best set of phonemes and Identify best set of phonemes and visemesvisemes

Phoneme classification could be Phoneme classification could be improved with better matching improved with better matching algorithm (neural net?)algorithm (neural net?)

Larger phoneme reference set for Larger phoneme reference set for more robust matchingmore robust matching

ResultsResults

Simple cases work very wellSimple cases work very well Timing is good and very responsiveTiming is good and very responsive

Robust with respect to speakerRobust with respect to speaker Cross-gender, multiple male speakersCross-gender, multiple male speakers Fails on: accents, speed, unknown Fails on: accents, speed, unknown

phonemesphonemes Problems with noisy samplesProblems with noisy samples

Can be smoothed but introduces “lag”Can be smoothed but introduces “lag”

End

Automatic Is PossibleAutomatic Is Possible

Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive

Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional

animationanimation Physical speech (vocal cords, vocal Physical speech (vocal cords, vocal

tract) can be modeledtract) can be modeled Source-filter modelSource-filter model

Sound Signatures Sound Signatures (Speaker Independence)(Speaker Independence)

Sound Signatures (For Sound Signatures (For Phonemes)Phonemes)

QuickTime™ and a None decompressor are needed to see this picture.

Results (Normal Speech)Results (Normal Speech)

Normal speech, Normal speech, moderate pacemoderate pace

QuickTime™ and a Cinepak decompressor are needed to see this picture.

top related