video rewrite: driving visual speech with audio
DESCRIPTION
1. Video Rewrite: Driving Visual Speech with Audio. Christoph Bregler Michele Covell Malcolm Slaney Interval Research Corporation. 2. Goal: Photo-realistic Talking Face. Video Rewrite. Handcoded 3D Model. OR. 2. Facial Animation History:. Parke (1972) - PowerPoint PPT PresentationTRANSCRIPT
1Video Rewrite:Driving Visual Speech with Audio
Christoph Bregler
Michele Covell
Malcolm Slaney
Interval Research Corporation
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
2Goal: Photo-realistic Talking Face
Handcoded3D Model
Video Rewrite
OR
2
Facial Animation History:
• Parke (1972)• Cohen & Massaro, Benoit et al. (1993)• Waters & Terzopolous (1990), DEC-Face• Lewis (1991)• Litwinowicz & Williams (1994)• Chen, Graf, Petajan, et al (1995)• Scott et al (1994)• Ezzat & Poggio (1997)• Pighin et al + Gunter et al (1998)• Brand (1999)• Cosatto, Graf (2000)
3Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
4Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
6
Phonetic Annotation
HMM Labels/D/ /IY/ /P/ /AH/
/D-IY-P/ /IY-P-AH/
6
Phonetic Annotation
• Acoustic Front-End: RASTA-PLP (Channel Invariant)
• HMM Models / Gaussian Mixture Models (HTK)
• Phoneme Set: 56 categories (CMU)
• Triphone models trained on TIMIT
• Annotation using Forced-Viterbi
(and CMU pronunciation dictionary)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
7
Head Pose Annotation
match planartemplate
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
8
Mouth / Chin Annotation
Eigenpoints
8
Eigenpoints - Training -
Graylevel +XY Control points
8
Eigenpoints - Mapping -
Graylevel +XY Control pointSpace
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
9Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
10Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
11
Synthesis - Overview -
background face
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
12
Synthesis:
• Transcribe Transcribe
• Find Lip ClipsFind Lip Clips
• Stitch TogetherStitch Together
/J/ /EH/ /L/ /IY/
13
Matching:
/T//AA/ /AA/
14Matching: Co-Articulation
/T//AA/ /AA/
?
/ UW - T - UW/
15Matching: Co-Articulation
/ UW - T - UW/
/T//AA/ /AA/
match / AA - T - AA/
16Co-Articulation: Tri-Phones
/ AA - S - AA/
/ AA - T - AA/
/ UW - T - UW/
….
More than 20,000 Tri-Phonesin English
16Viseme based Perceptual match
P B S T K …
P
B
S
T
K
…
Owens (1985) Confusion Matrix
11 Consonant Clusters:
- CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH
McGurk Effect -- Baldy by Cohen & Massaro
QuickTime™ and aCinepak decompressor
are needed to see this picture.
17Matching: Viseme-Distance
/ UW - T - UW/
/T//AA/ /AA/
correct phonewrong context:
/ AA - S - AA/correct visemecorrect context:
18Matching: Viseme-Distance
/ UW - T - UW/
/T//AA/ /AA/
approximatematch / AA - S - AA/
18Matching: Overlapping Triphones
Shape Distance
18
Matching: Trade-Offs
/T//AA/ /AA//P//IY/
Shape Distance
N-VisemeDistance
Rate of Speech Distance
18
Matching: N-Best Dynamic Programming
Error = V(t) + R(t) + S(t-1,t)
t
N-best
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
19
Stitching
+ +
20
Stitching
+ +
21
Stitching
MorphingMorphing
21
Morphing
Affine-Warp +Beier-Neely
21Simple Lighting Correction
Alpha Blending
X
X
Internsity
1.)
2.)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
22
Video Rewrite Results
JFK - Video Model
2 minutes data
Ellen - Video Model
8 minutes data
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
23
Contributions
• Data-driven Data-driven lip animationlip animation
• Automatic Automatic using vision and speech using vision and speech
recognitionrecognition
• Photo realistic: Photo realistic:
implicitly captures specific appearance + implicitly captures specific appearance + dynamicsdynamics
24
Video Rewrite
Thanks !
S. AhmadM. BajuraF. CrowT. DarrellM. DavisG. Gordon
John F. Kennedy
Acknowledgments:K. ForceB. FusonB. LassiterJ. LewisK. Rahardja
S. SnibbeC. SequineE. TauberB. VerplankS. WhiteJ. Woodfill
1994: Scott et al (JPL + Graphco Technologies)
/o/
/n/
/e/
1994: Scott et al (JPL + Graphco Technologies)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
1994: Scott et al (JPL + Graphco Technologies)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
Matching Video-Snippets with Context
/ AA - S - AA/
/ AA - T - AA/
/ UW - T - UW/
….
“Video Model”
N-phone context
/T/ /AA/ /UW/ /S/
2000: Cosatto, Graf, AT&T Research
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
2000: Cosatto, Graf, AT&T Research
QuickTime™ and a decompressor
are needed to see this picture.
24Rewrite Techniques -- Future --
Model Data
Video Rewrite