video rewrite: driving visual speech with audio

1Video Rewrite:Driving Visual Speech with Audio

Christoph Bregler

Michele Covell

Malcolm Slaney

Interval Research Corporation

QuickTime™ and aMPEG-4 Video decompressor

are needed to see this picture.

2Goal: Photo-realistic Talking Face

Handcoded3D Model

Video Rewrite

OR

2

Facial Animation History:

• Parke (1972)• Cohen & Massaro, Benoit et al. (1993)• Waters & Terzopolous (1990), DEC-Face• Lewis (1991)• Litwinowicz & Williams (1994)• Chen, Graf, Petajan, et al (1995)• Scott et al (1994)• Ezzat & Poggio (1997)• Pighin et al + Gunter et al (1998)• Brand (1999)• Cosatto, Graf (2000)

3Video Rewrite: Overview

AnalysisAnalysis

/D//D/ /IY//IY/ /P//P/ /AH//AH/

SynthesisSynthesis


AnalysisAnalysis

/D//D/ /IY//IY/ /P//P/ /AH//AH/

SynthesisSynthesis

5

Annotation

• Phonetic Phonetic

• Head PoseHead Pose

• Mouth ShapeMouth Shape

/D/ /OH/ /N/ /AH/

6

Phonetic Annotation

HMM Labels/D/ /IY/ /P/ /AH/

/D-IY-P/ /IY-P-AH/

6

Phonetic Annotation

• Acoustic Front-End: RASTA-PLP (Channel Invariant)

• HMM Models / Gaussian Mixture Models (HTK)

• Phoneme Set: 56 categories (CMU)

• Triphone models trained on TIMIT

• Annotation using Forced-Viterbi

(and CMU pronunciation dictionary)

5

Annotation




/D/ /OH/ /N/ /AH/

7

Head Pose Annotation

match planartemplate

5

Annotation




/D/ /OH/ /N/ /AH/

8

Mouth / Chin Annotation

Eigenpoints

8

Eigenpoints - Training -

Graylevel +XY Control points

8

Eigenpoints - Mapping -

Graylevel +XY Control pointSpace

QuickTime™ and aYUV420 codec decompressor



AnalysisAnalysis

/D//D/ /IY//IY/ /P//P/ /AH//AH/

SynthesisSynthesis

11

Synthesis - Overview -

background face

12

Synthesis:

• Transcribe Transcribe

• Find Lip ClipsFind Lip Clips

• Stitch TogetherStitch Together

/J/ /EH/ /L/ /IY/

13

Matching:

/T//AA/ /AA/

14Matching: Co-Articulation

/T//AA/ /AA/

?

/ UW - T - UW/

15Matching: Co-Articulation

/ UW - T - UW/

/T//AA/ /AA/

match / AA - T - AA/

16Co-Articulation: Tri-Phones

/ AA - S - AA/

/ AA - T - AA/

/ UW - T - UW/

….

More than 20,000 Tri-Phonesin English

16Viseme based Perceptual match

P B S T K …

P

B

S

T

K

…

Owens (1985) Confusion Matrix

11 Consonant Clusters:

- CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH

McGurk Effect -- Baldy by Cohen & Massaro

QuickTime™ and aCinepak decompressor


17Matching: Viseme-Distance

/ UW - T - UW/

/T//AA/ /AA/

correct phonewrong context:

/ AA - S - AA/correct visemecorrect context:

18Matching: Viseme-Distance

/ UW - T - UW/

/T//AA/ /AA/

approximatematch / AA - S - AA/

18Matching: Overlapping Triphones

Shape Distance

18

Matching: Trade-Offs

/T//AA/ /AA//P//IY/

Shape Distance

N-VisemeDistance

Rate of Speech Distance

18

Matching: N-Best Dynamic Programming

Error = V(t) + R(t) + S(t-1,t)

t

N-best

19

Stitching

+ +

20

Stitching

+ +

21

Stitching

MorphingMorphing

21

Morphing

Affine-Warp +Beier-Neely

21Simple Lighting Correction

Alpha Blending

X

X

Internsity

1.)

2.)

22

Video Rewrite Results

JFK - Video Model

2 minutes data

Ellen - Video Model

8 minutes data

23

Contributions

• Data-driven Data-driven lip animationlip animation

• Automatic Automatic using vision and speech using vision and speech

recognitionrecognition

• Photo realistic: Photo realistic:

implicitly captures specific appearance + implicitly captures specific appearance + dynamicsdynamics

24

Video Rewrite

Thanks !

S. AhmadM. BajuraF. CrowT. DarrellM. DavisG. Gordon

John F. Kennedy

Acknowledgments:K. ForceB. FusonB. LassiterJ. LewisK. Rahardja

S. SnibbeC. SequineE. TauberB. VerplankS. WhiteJ. Woodfill

1994: Scott et al (JPL + Graphco Technologies)

/o/

/n/

/e/

1994: Scott et al (JPL + Graphco Technologies)



Matching Video-Snippets with Context

/ AA - S - AA/

/ AA - T - AA/

/ UW - T - UW/

….

“Video Model”

N-phone context

/T/ /AA/ /UW/ /S/

2000: Cosatto, Graf, AT&T Research



2000: Cosatto, Graf, AT&T Research

QuickTime™ and a decompressor


24Rewrite Techniques -- Future --

Model Data

Video Rewrite

video rewrite: driving visual speech with audio

Documents

scott et

aa aa t

aa uw t

gunter et

pighin et

triphones16 aa s

perceptual match16p

speech distancematching