makeittalk: speaker-aware talking head animation · unseen images without the need of fine-tuning...

MakeItTalk: Speaker-Aware Talking Head Animation

YANG ZHOU, University of Massachusetts AmherstDINGZEYU LI, Adobe ResearchXINTONG HAN, Huya IncEVANGELOS KALOGERAKIS, University of Massachusetts AmherstELI SHECHTMAN, Adobe ResearchJOSE ECHEVARRIA, Adobe Research

+

Input: audio and single portrait image Output: talking head animation

Fig. 1. Given an unseen audio speech signal and a single unseen portrait image as input (left), our model generates speaker-aware talking head animations(right). Our method is able to create both non-photorealistic cartoon images (top) and photorealistic human face images (bottom). Please also see our videohttps://youtu.be/OU6Ctzhpc6s.

We present a method that generates expressive talking heads from a singlefacial image with audio as the only input. In contrast to previousapproaches that attempt to learn direct mappings from audio to raw pixelsor points for creating talking faces, our method first disentangles thecontent and speaker information in the input audio signal. The audiocontent robustly controls the motion of lips and nearby facial regions, whilethe speaker information determines the specifics of facial expressions andthe rest of the talking head dynamics. Another key component of ourmethod is the prediction of facial landmarks reflecting speaker-awaredynamics. Based on this intermediate representation, our method is able tosynthesize photorealistic videos of entire talking heads with full range ofmotion and also animate artistic paintings, sketches, 2D cartoon characters,Japanese mangas, stylized caricatures in a single unified framework. Wepresent extensive quantitative and qualitative evaluation of our method, inaddition to user studies, demonstrating generated talking heads ofsignificantly higher quality compared to prior state-of-the-art.

1 INTRODUCTIONAnimating expressive talking heads is essential for filmmaking,virtual avatars, video streaming, computer games and mixedrealities. Despite recent advances, generating realistic facialanimation with little or no manual labor remains an open challengein computer graphics. Several key factors contribute to thischallenge. Traditionally, the synchronization between speech andfacial movement is hard to achieve manually. Facial dynamics lieon a high-dimensional manifold, making it nontrivial to find amapping from audio/speech [Edwards et al. 2016]. Secondly,different talking styles in multiple talking heads can conveydifferent personalities and lead to better viewingexperiences [Walker et al. 1997]. Last but not least, handling lipsyncing and facial animation are not sufficient for the perception of

realism of talking heads. The entire facial expression consideringthe correlation between all facial elements and head pose also playan important role [Faigin 2012; Greenwood et al. 2018]. These,however, are less constrained by the audio and thus hard to beestimated.In this paper, we propose a new method with deep architecture,

called MakeItTalk, that addresses the above challenges. Our methodgenerates talking heads from a single facial image and audio as theonly input. At test time, the method is able to generalize to newaudio clips and portrait images not seen during training. The keyinsight of our approach is to disentangle the content and speaker inthe input audio signals, and drive the animation from the resultingdisentangled representations. The content is used for robustsynchronization of lips and nearby facial regions. The speakerinformation is used to capture the rest of the facial expressions andhead motion dynamics that are important for generating expressivetalking head animation We demonstrate that this disentanglementleads to significantly more plausible and believable headanimations.

Another key component of our method is the prediction of faciallandmarks as an intermediate representation incorporatingspeaker-aware dynamics. This is in contrast to previous approachesto talking face generation that attempt to directly generate rawpixels or points from audio. Leveraging facial landmarks as theintermediate representation between audio to visual animation hasseveral advantages. First, based on our disentangledrepresentations, our model learns to generate landmarks thatcapture subtle, speaker-dependent dynamics, sidestepping thelearning of low-level pixel appearance that tends to miss those.Second, the degrees of freedom (DoFs) for landmarks is in the order

arX

iv:2

004.

1299

2v1

[cs

.CV

] 2

7 A

pr 2

020

https://youtu.be/OU6Ctzhpc6s

2 • Zhou et al.

Suwajanakornet al. [2017]

Taylor et al.[2017]

Karras et al.[2017]

Zhou et al.[2018]

Zhou et al.[2019]

Vougioukaset al. [2019]

Chen et al.[2019]

Thies et al.[2019] Ours

animation format image riggedmodel 3d mesh riggedmodel image image image image imagebeyond lip animation ✓ × ✓ × ✓ ✓ ✓ ✓ ✓

head pose ✓ × × × × × × × ✓speaker-awareness × × × × × × × ✓ ✓

handle unseen images × ✓ × × ✓ × ✓ ✓ ✓

Table 1. A comparison of challenging related work across several tagging features.

of tens (68 in our implementation), as opposed to millions of pixelsin raw video generation methods. As a result, our learned model isalso compact, making it possible to train it from short videosegments. Last but not least, the landmarks can be easily used todrive a wide range of different types of animation content,including photorealistic videos, sketches, 2D cartoon characters,Japanese mangas, stylized caricatures. Specifically, in the case ofphotorealistic animation, we found that an image-to-imagetranslation framework [Isola et al. 2017; Ronneberger et al. 2015]converting landmarks to realistic image sequences can produceplausible videos. In the case of non-photorealistic outputs (e.g.,cartoon images), we found that a Delaunay triangulation warpingapproach can work surprisingly well and preserve feature curvescommon in vector art.In summary, given an unseen audio signal and a single unseen

portrait image as input, our method generates expressive talkingheads animations. We highlight the following contributions:

• We introduce a new deep-learning based architecture topredict facial landmarks, capturing lips, jaw, eye brows, nose,and head pose, from only speech signals.

• We generate animations incorporating facial expressions andhead motion dynamics based on disentangled speech contentand speaker representations.

• We present two image synthesis methods that work forvectorized cartoon images and photorealistic natural humanface images. Our methods can handle new faces and cartooncharacters not observed during training.

• We propose a set of quantitative metrics and user studies fornumerical evaluation of head animation methods.

2 RELATED WORKIn computer graphics, we have a long history of cross-modalsynthesis. More than two decades ago, Brand [1999] pioneeredVoice Puppetry to generating full facial animation from an audiotrack. Music-driven dance generation from Shiratori et al. [2006]matched the rhythm and intensity from the input melody tofull-body dance motions. Recently, Ginosar et al. [2019] predictedconversational gestures and skeleton movements from speechsignals. We focus on audio-driven facial animation, which cansupplement other body and gesture prediction methods. In thefollowing paragraphs, we overview prior work on facial landmarksynthesis, facial animation, and video generation. Table 1summarizes the most important feature differences of the methodsthat are most related to ours.

Audio-driven facial landmark synthesis. Eskimez et al. [2018,2019] generated synchronized facial landmarks with robust noiseresilience using deep neural networks. Later, Chen et al. [2019]trained decoupled blocks to obtain landmarks first and thengenerate rasterized videos. Attention masks are used to focus onthe most changing part on the face, especially the lips. Mostprevious audio-to-face-animation work focused on matching

speech content and left out style/identity information since theidentity is usually bypassed due to mode collapse or averagingduring training. In contrast, our approach disentangles audiocontent and speaker information, and drives landmarks capturingspeaker-dependent dynamics.

Lip-sync facial animation. With the increasing power of GPUs,we have seen prolific progress on end-to-end learning from audioto video frames, such as predicting lip movement [Chen et al. 2018;Taylor et al. 2017], generating full faces with GANs [Song et al.2019; Vougioukas et al. 2019] or encoder-decoder CNNs Chunget al. [2017], recognizing visemes [Zhou et al. 2018], and estimatingblendshape parameters [Pham et al. 2018]. However, the abovemethods do not capture speaker identity or style. As a result, if thesame sentence is spoken by two different voices, they will tend togenerate the same facial animation lacking the dynamics requiredto make it more expressive and realistic.

“Style”-aware facial head animation. Suwajanakorn et al. [2017]used a re-timing dynamic programming method to reproducespeaker motion dynamics, but it was specific to a single subject(Obama). Cudeiro et al. [2019] attempts to model speaker style in alatent representation. Concurrently, Thies et al. [2019] encodespersonal style in static blendshape bases. Both methods, however,focus on lower facial animation, especially lips, and do not predicthead pose. More similar to ours, Zhou et al. [2019] learned a jointaudio-visual representation to disentangle the identity and contentfrom the image domain. However, their identity informationprimarily focus on static facial appearance and not the speakerdynamics. As we demonstrate in §6.3, speaker awarenessencompasses many aspects beyond mere static appearances. Forexample, the action units which describe the correlation betweendifferent facial elements are unique to each individual [Agarwalet al. 2019; Ekman and Friesen 1978]. The head movement is also animportant factor for speaker-aware animations. Our methodaddresses speaker identity by learning jointly the static appearanceand head motion dynamics, to deliver faithfully animated talkingheads in terms of facial action units and head poses.

Evaluation metrics. Quantitatively evaluating the learnedidentity/style is crucial for validation; at the same time, it isnontrivial to setup an appropriate benchmark. Many prior workresorted to subjective user studies [Cudeiro et al. 2019; Karras et al.2017]. Agarwal et al. [2019] visualized the style distribution viaaction units. For existing quantitative metrics, they primarily focuson pixel-level artifacts since a majority of the network capacity isused to learn pixel generation rather than high-leveldynamics [Chen et al. 2019]. Action units proved to be an effectivemeasure of expression in the context of GAN-basedapproaches [Pumarola et al. 2018]. We propose a new collection ofaction unit metrics to evaluate the high-level dynamics that matterfor facial expression and head motion.

MakeItTalk: Speaker-Aware Talking Head Animation • 3

MLP

Voic

e C

onve

rsio

n

Self-attentionEncoder

Input Audio

Single Portrait Image

Photorealistic Image

Talking Head Animation

Predicted Landmarks

Cartoon ImageFacial Landmarks Speech Content Animation

Speaker-Aware Animation

ContentEmbedding

Speaker IdentityEmbedding

MLP

MLP

Content displacement

LSTM

Face Warp

+

Speaker-awaredisplacement

LSTM

OR

Image2ImageTranslation

Fig. 2. Pipeline of our method (“MakeItTalk”). Given an input audio signal along with a single portrait image (cartoon or real photo), our method animates theportrait driven by disentangled content and speaker embeddings.

Image-to-image translation. Image-to-image translation is a verycommon approach to modern talking face synthesis and editing.Face2Face and VDub are among the early explorers to demonstraterobust appearance transfer between two talking-head videos[Garrido et al. 2015; Thies et al. 2016]. Later, adversarial trainingwas adopted to improve the quality of the transferred results. Forexample, Kim et al. [2019] used cycle-consistency loss to transferstyles and showed promising results on one-to-one transfers.Zakharov et al. [2019] developed a few-shot learning scheme thatleveraged landmarks to generate realistic faces. Based on theseprior works, we also employ an image-to-image translationnetwork to generate realistic talking head animations. UnlikeZakharov et al. [2019], our model has better generalization tounseen images without the need of fine-tuning since we alreadytake the identity into account when predicting the landmarkdynamics. Additionally, we are able to generate non-photorealisticcartoon images through a face morphing module.

Disentangled learning. Disentanglement of content and style inaudio has been widely studied in the voice conversion community.Without diving into its long history (see [Stylianou 2009] for adetailed survey), here we only discuss recent methods that fit intoour deep learning pipeline. Wan et al. [2018] developedResemblyzer as a speaker identity embedding for verificationpurposes across different languages. Qian et al. [2019] proposedAutoVC, a few-shot voice conversion method to separate the audiointo the speech content and the identity information. As a baseline,we use AutoVC for voice content extraction and Resemblyzer foridentity embedding extraction. We introduce the idea of voiceconversion to audio-driven animation and demonstrate theadvantages of speaker-aware talking head generation.

3 METHODOverview. As summarized in Figure 2, given an audio clip and a

single facial image, our architecture, which we call “MakeItTalk”,generates a speaker-aware talking head animation synchronizedwith the audio. In the training phase, we use an off-the-shelf facelandmark detector to preprocess the input videos to extract the

landmarks [Bulat and Tzimiropoulos 2017]. A baseline model toanimate the speech content can be trained from the input audio andthe extracted landmarks directly. To achieve high-fidelity dynamics,we investigate the prediction of landmarks from disentangledcontent and speaker embedding of the input audio signal.Specifically, we use a voice conversion neural network to

disentangle the speech content and identity information [Qian et al.2019]. The content is speaker-agnostic and captures the generalmotion of lips and nearby regions (Figure 2, Speech ContentAnimation, §3.1). The identity of the speaker determines thespecifics of the motions and the rest of the talking head dynamics(Figure 2, Speaker-Aware Animation, §3.2). For example, no matterwho speaks the word ‘Ha!’, the lips are expected to be open, whichis speaker-agnostic and only dictated by the content. As for theexact shape and size of the opening, as well as the motion of nose,eyes and head, these will depend on who speaks the word, i.e.,identity. Conditioned on the content and speaker identityinformation, our deep model outputs a sequence of predictedlandmarks for the given audio.To generate rasterized images, we developed two algorithms

for the landmark-to-image synthesis (§3.3). For non-photorealisticimages like paintings or vector arts (Fig. 9), we use a simple imagewarping method based on Delaunay triangulation (Figure 2, FaceWarp). For photorealistic ones (Fig. 8), we devised an image-to-imagetranslation network (similar to pix2pix [Isola et al. 2017]) to animatethe given natural human face image with the underlying landmarkpredictions (Figure 2, Image2Image Translation). Combining all theimage frames and input audio together gives us the final talkinghead animations. In the following sections, we describe each moduleof our architecture.

3.1 Speech Content AnimationTo extract the speaker-agnostic content representation of the audio,we use AutoVC encoder fromQian et al. [2019]. The AutoVC networkutilizes an LSTM-based encoder that compresses the input audiointo a compact representation (bottleneck) trained to abandon theoriginal speaker identity but preserve content. In our case, we extracta content embedding A ∈ RT×D from AutoVC network, where T

4 • Zhou et al.

Given FacialLandmarks

Speaker 2 (active dynamics)

Speaker 1 (static dynamics)

Fig. 3. Landmark prediction for different speaker identities. Left: staticfacial landmarks from a given portrait image. Right-top: predicted landmarksequence from a speaker who tends to be conservative in terms of headmotion. Right-bottom: predicted landmark sequence from another speakerwho tends to be more active.

is the total number of input audio frames, and D is the contentdimension.The goal of the content animation component is to map the

content embedding A to facial landmark positions with a neutralstyle. In our experiments, we found that recurrent networks aremuch better suited for the task than feedforward networks, sincethey are designed to capture such sequential dependencies betweenthe audio content and landmarks. We experimented with vanillaRNNs and LSTMs [Graves and Jaitly 2014], and found that LSTMsoffered better performance. Specifically, at each frame t , the LSTMmodule takes as input the audio content A within a window[t → t + τ ]. We set τ = 18 frames (a window size of 0.3s in ourexperiments). To animate any input static landmarks q, whereq ∈ R68×3 that are extracted using a landmark detector, the outputfrom LSTM layers is fed into a Multi-Layer Perceptron (MLP) andfinally predicts displacements ∆qt , which put the input landmarksin motion at each frame.To summarize, the speech content animation module models

sequential dependencies to output landmarks based on the followingtransformations:

ct = LSTMc(At→t+τ ;wlstm,c

), (1)

∆qt = MLPc (ct , q;wmlp,c ), (2)pt = q + ∆qt , (3)

where {wlstm,c ,wmlp,c } are learnable parameters for the LSTMand MLP networks respectively. The LSTM has three layers of units,each having an internal hidden state vector of size 256. The decoderMLP network has three layers with internal hidden state vector sizeof 512, 256 and 204 (68 × 3), respectively.

3.2 Speaker-Aware AnimationMatching just the lip motion to the audio content is not sufficient.The motion of the head or the subtle correlation between mouth andeyebrows are also crucial clues to generate realistic talking heads.For example, Figure 3 shows our speaker-aware predictions for twodifferent speaker embeddings: one originates from a speaker whosehead motion tends to be more static, and another that is more active.Our method successfully differentiates the head motion dynamicsbetween these two speakers.

To achieve this, we extract the speaker identity embedding witha speaker verification model [Wan et al. 2018] which maximizes theembedding similarity among different utterances of the samespeaker, and minimizes the similarity among different speakers.The original identity embedding vector s has a size of 256. Wefound that reducing its dimensionality from 256 to 128 via a

single-layer MLP improved the generalization of facial animationsespecially for speakers not observed during training. Given theidentity embedding s extracted, we further modulate the per-framelandmarks pt such that they reflect the speaker’s identity. Morespecifically, the landmarks are perturbed to match head motiondistribution and facial expression dynamics observed in speakersduring training.

As shown in the bottom stream of Figure 2, we first use an LSTMto encode the content representation within time windows, whichhas the same network architecture and time window length as theLSTM used in the speech content animation module. We found,however, that it is better to have different learned parameters forthis LSTM, such that the resulting representation ct is more tailoredfor capturing head motion and facial expression dynamics:

ct = LSTMs(At→t+τ ;wlstm,s

), (4)

where wlstm,s are trainable parameters. Then, the following modeltakes as input the speaker embedding s, the audio contentrepresentation ct , and the initial static landmarks q to generatespeaker-aware landmark displacement. Notably, to producecoherent head motions and facial expressions, it requires capturinglonger time-dependencies compared to the speech contentanimation module. While phonemes typically last for a few tens ofmilliseconds, head motions, e.g., a head swinging left-to-right, maylast for one or few seconds, several magnitudes longer. To capturesuch long and structured dependencies, we adopted a self-attentionnetwork [Devlin et al. 2018; Vaswani et al. 2017]. The self-attentionlayers compute an output expressed as a weighted combination oflearned per-frame representations i.e., transformations of audiocontent, concatenated with the speaker embedding and initiallandmarks in our case. The weights assigned to each frame iscomputed by a compatibility function comparing all-pairs framerepresentations within a window. We set the window size toτ ′ = 256 frames (4 sec) in all experiments. The output from the lastself-attention layer is fed into an MLP that predicts the finalper-frame landmarks.

Mathematically, our speaker-aware animation models structuraldependencies to perturb landmarks that capture head motion andpersonalized expressions, which can be formulated as follows:

ht = Attns (ct→t+τ ′ , s;wattn,s ), (5)∆pt = MLPs (ht , q;wmlp,s ), (6)yt = pt + ∆pt , (7)

where {wattn,s ,wmlp,c } are trainable parameters of theself-attention encoder and MLP decoder, pt is computed by Eq. (3),and yt are the final per-frame landmarks capturing both speechcontent and speaker identity. In our implementation, the attentionnetwork follows the encoder block in Vaswani et al. [2017]. Moredetails about its architecture are provided in the appendix.

3.3 Single-Image AnimationThe last step of our model creates the final animation of the inputportrait. Given an input image Q and the predicted landmarks {yt }for each frame t , we produce a sequence of images {Ft }representing the facial animation. The input portrait might eitherdepict a cartoon face, or a photorealistic human face image. We usedifferent implementations for each of these two types of portraits.In the next paragraphs, we explain the variant used for each type.

Cartoon Images (non-photorealistic). These images usually havesharp feature edges, e.g., from vector arts or flat shaded drawings.


Fig. 4. Cartoon image face warping through facial landmarks and DelaunayTriangulation. Left: Given cartoon image and its facial landmarks. Middle:Delaunay triangulation. Right: Warped image guided by the displacedlandmarks.

To preserve these sharp features, we propose a morphing-basedmethod to animate them, avoiding pixel-level artifacts. From theinput image, we extract the facial landmarks using [Yaniv et al.2019]. We then run Delaunay triangulation on these landmarks togenerate semantic triangles. By mapping the initial pixels astexture maps to the triangles, the subsequent animation processbecomes straightforward. As long as the landmark topologyremains the same, the textures on each triangle naturally transfer.An illustration is shown in Figure 4. One analogy is the vertex andfragment shaders in the GPU rendering pipeline. The textures arebind to each fragment at the very beginning and from then on, onlythe vertex shader is changing the location of these vertices(landmark positions). In practice, we implement a GLSL-based C++code that uses vertex/fragment shaders and runs in real-time.

Photorealistic Images. The goal here is to synthesize a sequenceof frames given the input photo and the predicted animatedlandmarks from our model. Inspired by the landmark-based facialanimation from Zakharov et al. [2019], we first create an imagerepresentation Yt of the predicted landmarks yt by connectingconsecutive facial landmarks and rendering them as line segmentsof predefined color (Figure 2). The image Yt is concatenatedchannel-wise with the input portrait image Q to form a 6-channelimage of resolution 256 × 256. The image is passed to anencoder-decoder network that performs image translation toproduce the image Ft per frame. Its architecture follows thegenerators proposed in Esser et al. [2018] and Han et al. [2019].Specifically, the encoder employs 6 convolutional layers, whereeach layer contains one 2-strided convolution followed by tworesidual blocks, and produces a bottleneck, which is then decodedthrough symmetric upsampling decoders. Skip connections areutilized between symmetric layers of the encoder and decoder, as inU-net architectures [Ronneberger et al. 2015]. The generationproceeds for each frame. Since the landmarks change smoothlyover time, the output images formed as an interpolation of theselandmarks exhibit temporal coherence. Examples of generatedimage sequences are shown in Figure 8.

4 TRAININGVoice Conversion Training. We follow the training setup

described in [Qian et al. 2019] with the speaker embeddinginitialized by the pretrained model provided by Wan et al. [2018]. Atraining source speech from each speaker is processed through thecontent encoder. Then another utterance of the same sourcespeaker is used to extract the speaker embedding, which is passedto the decoder along with the audio content embedding toreconstruct the original source speech. The content encoder,decoder, and MLP are trained to minimize the self-reconstructionerror of the source speech spectrograms [Qian et al. 2019]. Training

𝐩𝐩𝑖𝑖,𝑡𝑡𝐩𝐩𝑗𝑗1

𝐩𝐩𝑗𝑗2ℒ 𝐩𝐩𝑖𝑖,𝑡𝑡

Fig. 5. Graph Laplacian coordinates illustration. Left: 8 facial parts thatcontain subsets of landmarks. Right: Zoom-in graph Laplacian vector andrelated neighboring landmark points.

is performed on the VCTK corpus [Veaux et al. 2016], which is aspeech dataset including utterances by 109 native English speakerswith various accents.

4.1 Speech Content Animation TrainingDataset. To train a content-based animation model, we use an

audio-visual dataset that provides landmarks and correspondingaudio. To this end, we use the Obama Weekly Address dataset[Suwajanakorn et al. 2017] containing 6 hours of video featuringvarious Obama’s speeches. Due to its high resolution and relativelyconsistent front facing camera angle, we can obtain accurate faciallandmark detection results using [Bulat and Tzimiropoulos 2017].We also register the facial landmarks to a front-facing standardfacial template [Beeler and Bradley 2014] using a best-estimatedaffine transformation [Segal et al. 2009]. This also results infactoring out the speaker-dependent head pose motion, which wewill address in §4.2. We note that one registered speaker is enoughto train this module, since our goal here is to learn a mapping fromaudio content to facial landmarks.

Loss function. To learn the parameters {wlstm,c ,wmlp,c } used inthe LSTM and MLP, we minimize a loss function that evaluates (a)the distance between the registered reference landmark positions ptand predicted ones pt , and (b) the distance between their respectivegraph Laplacian coordinates, which promotes correct placement oflandmarks with respect to each other and preserves facial shapedetails [Sorkine 2006]. Specifically, our loss is:

Lc =T∑t=1

N∑i=1

pi,t − pi,t 22 + λc

T∑t=1

N∑i=1

L(pi,t ) − L(pi,t ) 22, (8)

where i is the index for each individual landmark, and λc weighsthe second term (λc =1 in our implementation, tuned via hold-outvalidation). We use the following graph Laplacian L(pt ):

L(pi,t ) = pi,t −1

|N(pi )|∑

pj ∈N(pi )pj,t , (9)

where N(pi ) includes the landmark neighbors connected to piwithin a distinct facial part (Figure 5). We use 8 facial parts thatcontain subsets of landmarks predefined for the facial template.

4.2 Speaker-Aware Animation TrainingDataset. To learn the speaker-aware dynamics of head motion

and facial expressions, we need an audio-visual dataset featuring adiverse set of speakers.We found theVoxCeleb2 dataset is well-suitedfor our purpose since it contains video segments from a variety ofspeakers [Chung et al. 2018]. VoxCeleb2 was originally designedfor speaker verification. Since our goal is different, i.e., capturingspeaker dynamics for talking head synthesis, we chose a subset of67 speakers with a total of 1,232 videos clips from VoxCeleb2. Onaverage, we have around 5-10 minutes of videos for each speaker.The criteria for selection was accurate landmark detection in the

6 • Zhou et al.

videos with manual verifications. Speakers were selected based onPoisson disk sampling on the speaker representation space. We splitthis dataset as 60% / 20% / 20% for training, hold-out validation andtesting respectively. In contrast to the content animation step, wedo not register the landmarks to a front-facing template since herewe are interested in learning the overall head motion.

Adversarial network. Apart from capturing landmark position, wealso aim to match the speaker’s head motion and facial expressiondynamics during training. To this end, we follow a GAN approach.Specifically, we create a discriminator networkAttnd which followsa similar structure with the self-attention generator network in §4.2.More details about its architecture are provided in the appendix.The goal of the discriminator is to find out if the temporal dynamicsof the speaker’s facial landmarks look “realistic” or fake. It takes asinput the sequence of facial landmarks within the samewindow usedin the generator, along with audio content and speaker’s embedding.It returns an output characterizing the “realism” rt per frame t :

rt = Attnd (yt→t+τ ′ , ct→t+τ ′ , s;wattn,d ), (10)

where wattn,d are the parameters of the discriminator. We use theLSGAN loss function [Mao et al. 2017] to train the discriminatorparameters treating the training landmarks as “real” and thegenerated ones as “fake” for each frame:

Lдan =T∑t=1

(rt − 1)2 + r2t , (11)

where rt denotes the discriminator output when the traininglandmarks yt are used as its input.

Loss function. To train the parameterswattn,s of the self-attentiongenerator network, we attempt to maximize the “realism” of theoutput landmarks, and also consider the distance to the trainingones in terms of absolute position and Laplacian coordinates:

Ls =T∑t=1

N∑i=1

yi,t − yi,t 22 + λs

T∑t=1

N∑i=1

L(yi,t ) − L(yi,t ) 22

+ µs

T∑t=1

(rt − 1)2, (12)

where λs = 1 and µs = 0.001 are set through hold-out validation.We alternate training between the generator (minimizing Ls ) anddiscriminator (minimizing Lдan ) to improve each other as done inGAN approaches [Mao et al. 2017].

4.3 Image-to-Image Translation TrainingFor the final animation step, we train the photorealisticimage-to-image translation, since for non-photorealistic ones weuse a morphing approach without neural networks. Theencoder/decoder pair used for image translation is trained onpaired video frames from VoxCeleb2 dev split that contains 145,569video clips of 5,994 speakers and its high-quality splits according tothe preprocessing in Siarohin et al. [2019]. In particular, based on avideo of a talking person in the dataset, we randomly sample aframe pair: a source training frame Qsrc and a target frame Qtrд ofthis person. The facial landmarks of the target face are extractedand rasterized into an RGB image Ytrд based on the approachdescribed in §3.3. The encoder/decoder network takes as input theconcatenation of Qsrc and Ytrд and outputs a reconstructed faceQtrд . The loss function aims to minimize the L1 per-pixel distance

and perceptual feature distance between the reconstructed faceQtrд and training target face Qtrд as in [Johnson et al. 2016]:

La =∑

{src, trg}| |Qtrg − Qtrg | |1 + λa

∑{src, trg}

| |ϕ(Qtrg) − ϕ(Qtrg)| |1

where λa = 1, and ϕ concatenates feature map activations from thepretrained VGG19 network [Simonyan and Zisserman 2014].

4.4 Implementation Details.All videos in our dataset are converted to 62.5 frames per second andaudio waveforms are sampled under 16K Hz frequency. We trainedboth the speech content animation and speaker-aware animationmodules with the Adam optimizer using PyTorch. The learningrate was set to 10−4, and weight decay to 10−6. The speech contentanimation module contains 1.9M parameters and took 12 hours totrain on a Nvidia 1080Ti GPU. The speaker-aware animation modulehas 3.8M parameters and took 30 hours to train on the same GPU.

The single-face animation module for generating realistic humanfaces was also trained with the Adam optimizer, a learning rate of10−4, and a batch size of 16. The network contains 30.7M parametersand was trained for 20 hours on 8 Nvidia 1080Ti GPUs.

5 RESULTSWith all the pieces in place, we can generate talking head videosfrom a single input image and any given audio file. Our fullpipeline also enables full facial expression and head pose animation.Figure 6 show a gallery of our generated animations for cartoonand photorealistic images. For best results, please watch our videohttps://youtu.be/OU6Ctzhpc6s.

5.1 Animating Photorealistic Images.We show how the predicted landmarks, that are driven by the audiocontent and speaker identity, used to synthesize video frames bythe networks discussed in §3.3. Figure 8 shows comparisons ofsynthesized videos featuring talking people. These generated videoscapture head motion and facial expressions well.Synthesizing realistic talking heads has numerous applications

[Kim et al. 2019; Zhou et al. 2019]. One common application isdubbing for audio spoken in a different language. In Figure 7-a, theoriginal video is in English and with professional dubbed audio inSpanish, so facial expressions no longer match the dubbed audio.We can generate corresponding video frames to achieve audiovisualsynchronization and preserve the specific talking style from theoriginal video. This is validated in §6.3.Another application is bandwidth-limited video conference. In

scenarios where the visual frames cannot be delivered with highfidelity and frame-rate, we can make use of the audio signal todrive the talking head video. Audio signal can be preserved undermuch lower bandwidth compared with its visual counterpart. Onthe other hand, visual facial expressions, especially lip motions,contribute heavily to understanding in communication [McGurkand MacDonald 1976]. Figure 7-b shows we can robustly synthesizetalking heads with only the audio and an initial high-quality frame.We also compare with results from state-of-the-art video

generation methods [Chen et al. 2019; Vougioukas et al. 2019] inFigure 8. GT and our results are cropped for a better visualizationof the lip region (see row 2 and 5). Based on the results, we observethat the in videos generated by Chen et al. [2019]; Vougioukas et al.[2019], the face is cropped from the head (i.e., misses head pose)and predicts primarily the lip region. Vougioukas et al. [2019]

https://youtu.be/OU6Ctzhpc6s


Cartoon animations Natural human face animation

Fig. 6. Predicted cartoon animations and natural human face animations. Our method synthesizes not only facial expressions, but also different head poses.

overfits its dataset and does not work on the unseen images. Chenet al. [2019] lacks synchronization with the input audio, especiallyfor side-facing portraits (see left example with red box).Compared to these methods, our method predicts facial

expressions more accurately and also captures a plausible headmotion (see right example red box).

5.2 Animating Non-Photorealistic Images.Figure 9 shows a gallery of our generated animations fromchallenging non-photorealistic images. One advantage of ourmethod is, despite having only trained on human facial landmarks,it can successfully generalize to a large variety of stylized cartoonfaces. This is because we learn the relative positions (i.e., landmarkdisplacements) instead of absolute positions. We demonstrate ourmethod’s capability to animate artistic paintings, random sketches,2D cartoon characters, Japanese mangas, stylized caricatures andcasual photos.

6 EVALUATIONWe evaluated MakeItTalk and related methods quantitatively andwith user studies. For testing we created a test split from theVoxCeleb2 subset, containing 268 video segments from 67 speakers.The speaker identities were observed during training, however,their test speech and video are different from the training ones.

Source video Dubbed video(a)

Initial frame Generated frames(b)

Fig. 7. Applications. Top: dubbing across different languages. Bottom:bandwidth-limited video conference.

Each video clip lasts 5 to 30 seconds. Landmarks were extractedusing [Bulat and Tzimiropoulos 2017] from test clips and theirquality was manually verified. We call these as “referencelandmarks” and use them in the evaluation metrics explainedbelow.

6.1 Evaluation MetricsTo evaluate how well the synthesized landmarks represent accuratelip movements, we use the following metrics:• Landmark distance for jaw-lips (D-LL) represents theaverage Euclidean distance between predicted facial landmarklocations of the jaw and lips and reference ones. The landmarkpositions are normalized according to the maximum width ofthe reference lips for each test video clip.

• Landmark velocity difference for jaw-lips (D-VL)represents the average Euclidean distance between referencelandmark velocities of the jaw and lips and predicted ones.Velocity is computed as the difference of landmark locationsbetween consecutive frames. The metric captures differences infirst-order jaw-lips dynamics.

• Difference in open mouth area (D-A:): the averagedifference between the area of the predicted mouth shape andreference one. It is expressed as percentage of the maximumarea of the reference mouth for each test video clip.

To evaluate how well the landmarks produced by our methodand others reproduce overall head motion, facial expressions, andtheir dynamics, we use the following metrics:• Landmark distance (D-L): the average Euclidean distancebetween all predicted facial landmark locations and referenceones (normalized by the width of the face).

• Landmark velocity difference (D-V): the average Euclideandistance between reference landmark velocities and predictedones (again normalized by the width of the face). Velocity iscomputed as the difference of landmark locations betweenconsecutive frames. This metric serves as an indicator oflandmark motion dynamics.

• Head rotation and position difference (D-Rot/Pos): theaverage difference between the reference and predicted headrotation angles (measured in degrees) and head position (againnormalized by the width of the face). The measure indicateshead pose differences, like nods and tilts.

8 • Zhou et al.

GT

GT (cropped)

[Vougioukas et al. 2019]

[Chen et al. 2019]

Ours

Ours(cropped)

Fig. 8. Comparison with state-of-the-art methods for video generation of talking people. The compared methods crop the face and predict primarily the lipregion while ours generates both facial expression and head motion. GT and our results are full faces and are cropped for a better visualization of the lipregion. Left example: Chen et al. [2019] has worse prediction for side-faces (see the red box). Right example: our method predicts speaker-aware head posedynamics (see the red box). Note that the head pose is different than the one in the ground-truth video, but it shares similar statistics.

Sing

le c

arto

on im

age

Gen

erat

ed s

ampl

es

Artistic painting2D cartoon character Random sketch Japanese manga Stylized caricature Casual photo

Fig. 9. Our model works for a variety of non-photorealistic (cartoon) portrait images. Top: input cartoon images. Bottom: generated talking face examples bynon-photorealistic face warping.


[Zhou et al.2018]

[Eskimez et al. 2018]

[Chen et al. 2019]

OursGround-truth

/er/

/m/

/ah/

Fig. 10. Facial expression landmark comparison. Each row shows an exampleframe prediction for different methods. The GT landmark and utteredphonemes are shown on left.

6.2 Content Animation EvaluationWe begin our evaluation by comparing MakeItTalk withstate-of-the-art methods for synthesis of facial expressions drivenby landmarks. Specifically, we compare with Eskimez et al. [2018],Zhou et al. [2018], and Chen et al. [2019]. All these methodsattempt to synthesize facial landmarks, but cannot produce headmotion. Head movements are either generated procedurally orcopied from a source video. Thus, to perform a fair evaluation, wefactor out head motion from our method, and focus only oncomparing predicted landmarks under an identical “neutral” headpose for all methods. For the purpose of this evaluation, we focuson the lip synchronization metrics (D-LL, D-VL, D-A), and ignorehead pose metrics. Quantitatively, Table 2 reports these metrics forthe above-mentioned methods and ours. We include our fullmethod, including two reduced variants: (a) “Ours (no speaker ID)”,where we train and test the speech content branch alone withoutthe speaker-aware branch, (b) “Ours (no content)”, where we trainand test the speaker-aware branch alone without the speechcontent branch. We discuss these two variants in more detail in ourablation study (Section 6.4). The result shows that our methodachieves the lowest errors for all measures. In particular, ourmethod has 2x times less D-LL error in lip landmark positionscompared to [Eskimez et al. 2018], and 2.5x times less D-LL errorcompared to [Chen et al. 2019].Figure 10 shows characteristic examples of facial landmark

outputs for the above methods and ours from our test set. Each rowshows one output frame. Zhou et al. [2018] is only able to predictthe lower part of the face and cannot reproduce closed mouthsaccurately (see second row). Eskimez et al. [2018] and Chen et al.[2019], on the other hand, tend to favor conservative mouthopening. In particular, Chen et al. [2019] predicts bottom and upperlips that sometimes overlap with each other (see second row, redbox). In contrast, our method captures facial expressions thatmatch better the reference ones. Ours can also predict subtle facialexpressions, such as the lip-corner lifting (see first row, red box).

6.3 Speaker-Aware Animation EvaluationHead Pose Prediction and Speaker Awareness. Existing

speech-driven facial animation methods do not synthesize headmotion. Instead, a common strategy is to copy head poses fromanother existing video. Based on this observation, we evaluate our

Methods D-LL ↓ D-VL ↓ D-A ↓[Zhou et al. 2018] 6.2% 0.85% 15.2%

[Eskimez et al. 2018] 4.0% 0.46% 7.5%[Chen et al. 2019] 5.0% 0.49% 5.0%

Ours (no speaker ID) 2.2% 0.45% 5.9%Ours (no content) 3.1% 0.61% 10.2%

Ours (full) 2.0% 0.44% 4.2%Table 2. Quantitative comparison of facial landmark predictions ofMakeItTalk versus state-of-the-art methods.

method against two baselines: “retrieve-same ID” and“retrieve-random ID”. These baselines retrieve the head pose andposition sequence from another video clip randomly picked fromour training set. Then the facial landmarks are translated androtated to reproduce the copied head poses and positions. The firstbaseline “retrieve-same ID” uses a training video with the samespeaker as in the test video. This strategy makes this baselinestronger since it re-uses dynamics from the same speaker. Thesecond baseline “retrieve-random ID” uses a video from a differentrandom speaker. This baseline is useful to examine whether ourmethod and alternatives produce head pose and facial expressionsbetter than random or not.Table 3 reports the D-L, D-V, and D-Rot/Pos metrics. Our full

method achieves much smaller errors compared to both baselines,indicating our speaker-aware prediction is more faithful comparedto merely copying head motion from another video. In particular,we observe that our method produces 2.7x less error in head pose(D-Rot), and 1.7x less error in head position (D-Pos) compared tousing a random speaker identity (see “retrieve-random ID” ). Thisresult also confirms that the head motion dynamics of randomspeakers largely differ from ground-truth ones. Compared to thestronger baseline of re-using video from the same speaker (see“retrieve-same ID” ), we observe that our method still produces 1.3xless error in head pose (D-Rot), and 1.5x less error in head position(D-Pos). This result confirms that re-using head motion from avideo clip even from the right speaker still results in significantdiscrepancies, since the copied head pose and position does notnecessarily synchronize well with the audio. Our full method insteadcaptures the head motion dynamics and facial expressions moreconsistently w.r.t. the input audio and speaker identity.Figure 6 shows a gallery of our generated cartoon images and

natural human faces under different predicted head poses. Thecorresponding generated facial landmarks are also shown on theright-bottom corner of each image. The demonstrated examplesshow that our method is able to synthesize head pose well,including nods and swings. Figure 11 shows another qualitativevalidation of our method’s ability to capture personalized headmotion dynamics. The figure embeds 8 representative speakersfrom our dataset based on their variance in AUs, head pose andposition variance. The embedding is performed throught-SNE [Maaten and Hinton 2008]. These 8 representatives wereselected using furthest sampling i.e., their AUs, head pose andposition differ most from the rest of the speakers in our dataset. Weuse different colors for different speakers and use solid dots forembeddings produced based on the reference videos in AUs, headpose and position variance, and stars for embeddings resultingfrom our method. The visualization demonstrates that our method

10 • Zhou et al.

Fig. 11. t-SNE visualization for AUs, head pose and position variance basedon 8 reference speakers videos (solid dots) and our predictions (stars).Different speakers are marked with different colors as shown in the legend.

Methods D-L ↓ D-V ↓ D-Rot/Pos ↓retrieve-same ID 17.1% 1.1% 10.3/8.1%

retrieve-random ID 20.8% 1.2% 21.4/9.2%Ours (random ID) 33.0% 1.8% 28.7/12.3%

Ours (no speaker ID) 13.8% 0.7% 12.6/6.9%Ours (no content) 12.5% 0.8% 8.6/5.7%

Ours (full) 12.3% 0.7% 8.0/5.4%Table 3. Head pose prediction comparison with the baseline methods in§6.3 and our variants based on our head pose metrics.

produces head motion dynamics that tend to be located moreclosely to reference ones.

6.4 Ablation studyIndividual branch performance. We performed an ablation study

by training and testing reduced variants of our network: (a) the firstvariant, called “Ours (no speaker ID)” uses only the speech contentanimation branch i.e. omitting the speaker-aware animation branchduring training/testing, (b) the second variant, called “Ours (nocontent)” uses only the speaker-aware animation branch i.e. omittingthe speech content animation branch during training/testing. Theresults of these two variants and our full method are shown in inTable 2 and Table 3.

The variant “Ours (no speaker ID)” only predicts facial landmarksfrom the audio content without considering the speaker identity. Itperforms well in terms of capturing the lip landmarks, since theseare synchronized with the audio content. The variant is slightlyworse than our method based on the lip evaluation metrics (seeTable 2). However, it results in 1.6x larger errors in head pose and1.3x larger error in head position (see Table 3) since head motion isa function of both speaker identity and content.The results of the variant “Ours (no content)” has the opposite

behaviour: it performs well in terms of capturing head pose andposition (it is slightly worse than our method, see Table 3).However, it has 1.6x higher error in jaw-lip landmark differenceand 2.4x higher error in open mouth area difference (see Table 2),which indicates that the lower part of face dynamics are notsynchronized well with the audio content. Figure 12 demonstratesthat using speaker-aware animation branch alone i.e., the “Ours (nocontent)” variant results in noticeable artifacts in the jaw-lip

Our

s (no

con

tent

)O

urs (

full)

/w/ /e/ /silence/

Input image

Fig. 12. Comparison of different variants: speaker-aware branchperformance only (right-top) and our full model (right-bottom). The fullmodel result has much better articulation in the lower-part of the face.

landmark displacements. Using both branches in our full methodoffers the best performance according to all evaluation metrics.

Random speaker ID injection. We tested one more variant of ourmethod called “Ours (random ID)”. For this variant, we use our fullnetwork, however, instead of using the correct speaker embedding,we inject another random speaker identity embedding. The resultof this variant is shown in Table 3. Again we observe that theperformance is significantly worse (3.6x more error for head pose).This indicates that our method successfully splits the content andspeaker-aware motion dynamics, and captures the correct speakerhead motion dynamics (i.e., it does not reproduce random ones).

6.5 User StudiesWe also evaluated our method through perceptual user studies viaAmazon Mechanical Turk service. We obtained 6480 queryresponses from 324 different MTurk participants in our twodifferent user studies described below.

User study for speaker awareness. To evaluate the speakerawareness of different variants of our method, we assembled a poolof 300 queries displayed on different webpages. On top of thewebpage, we showed a reference video of a real person talking, andon the bottom we showed two cartoon animations: one cartoonanimation generated using our full method and another cartoonanimation based on one of the two variants discussed above: (“Ours(random ID)” and “Ours (no speaker ID)”. The two cartoon videoswere placed in randomly picked left/right positions for eachwebpage. Each query included the following question: “Supposethat you want to see the real person of the video on the top as thecartoon character shown on the bottom. Which of the two cartoonanimations best represents the person’s talking style in terms offacial expressions and head motion?” The MTurk participants wereasked to pick one of the following choices: “left animation”, “rightanimation”, “can’t tell - both represent the person quite well”, “can’ttell - none represent the person well”. Each MTurk participant wasasked to complete a questionnaire with 20 queries randomly pickedfrom our pool. Queries were shown at a random order. Each querywas repeated twice (i.e., we had 10 unique queries perquestionnaire), with the two cartoon videos randomly flipped eachtime to detect unreliable participants giving inconsistent answers.We filtered out unreliable MTurk participants who gave twodifferent answers to more than 5 out of the 10 unique queries in thequestionnaire, or took less than a minute to complete it. Eachparticipant was allowed to answer one questionnaire maximum to


62% 28%

76% 15%

44% 36%

76% 15%

User Study: Photorealistic animation

[Vougioukas et al. 2019]Ours (full)

[Chen et al. 2019]Ours (full)

None are good Both are good Prefer left Prefer right

Ours (random ID)Ours (full)

Ours (no speaker ID)Ours (full)

User Study: Speaker awareness8% 12%

5%

8%

9%

Fig. 13. User study results for speaker awareness (top) and photorealisticanimation (bottom).

ensure participant diversity. We had 90 different, reliable MTurkparticipants for this user study. For each of our 300 queries, we gotvotes from 3 different MTurk participants. Since each MTurkparticipant voted for 10 unique queries twice, we gathered 1800responses (300 queries × 3 votes × 2 repetitions) from our 90MTurk participants. Figure 13 (top) shows the study result. We seethat the majority of MTurkers picked our full method morefrequently, when compared with either of the two variants.

User study for photorealistic videos. To validate ourlandmark-driven photorealistic animation method, we conductedone more user study. Each MTurk participant was shown aquestionnaire with 20 queries involving random pairwisecomparisons out of a pool with 780 queries we generated. For eachquery, we showed a single frame showing the head of a real personon top, and two generated videos below (randomly placed atleft/right positions): one video synthesized from our method, andanother from either Vougioukas et al. [2019] or Chen et al. [2019].The participants were asked which person’s facial expression andhead motion look more realistic and plausible? We also explicitlyinstruct them to ignore the particular camera position or zoomfactor and focus on the face. Participants were asked to pick one offour choices (“left”, “right”, “both” or “none”) as in the previousstudy. We also employed the same random and repeated querydesign and MTurker consistency and reliability checks to filter outunreliable answers. We had 234 different MTurk participants forthis user study. Like in the previous study, each query receivedvotes from 3 different, reliable MTurk participants. As a result, wegathered 780 queries × 3 votes × 2 repetitions = 4680 responsesfrom our 234 participants. Figure 13 (bottom) shows the studyresult. Our method was voted as the most “realistic” and “plausible”by a large majority, when compared to Chen et al. [2019] orVougioukas et al. [2019].

7 CONCLUSIONWe have introduced a deep learning based approach to generatespeaker-aware talking head animations from an audio clip and asingle image. Our method is able to handle new audio clips andnew portrait images not seen during training. Our key insight ofpredicting landmarks from disentangled audio content and speaker,such that they capture personalized facial expressions and headmotion dynamics, led to much more expressive animations withhigher overall quality compared to the state-of-the-art.

Limitations and Future Work. There are still several avenues forfuture research. Although our method captures aspects of thespeaker’s style e.g., predicting head swings reflecting active speech,there are several other factors that can influence head motiondynamics. For example, the speaker’s mood can also play asignificant role in determining head motion and facial expressions.

Our speech content animation currently does not always capturewell the bilabial and fricative sounds, i.e. /b/m/p/f/v/, to thecorresponding visemes. We suspect that it is due to the limitationof our currently employed voice conversion module. It tends tomiss those short phoneme representations when performing voicespectrum reconstruction.Improving the translation from landmarks to videos can also be

further investigated. The current image-to-image translationnetwork takes only the 2D facial landmarks as input. Incorporatingmore phoneme- or viseme-related features as input may furtherimprove the quality of the generated video in terms of articulation.In addition, the image translation network may sometimes alterfine-grained face details, such as eye color, wrinkles or eye brows.We suspect that this might happen due to training dataset biasesand mode collapses in the generator. The output of our imagetranslation network are raw pixels that hard to edit and interpret.More interpretable controls would be desirable especially from ananimator’s point of view. In terms of the predictions themselves,capturing long-range temporal and spatial dependencies betweenpixels could further reduce artifacts.

REFERENCESShruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019.

Protecting World Leaders Against Deep Fakes. In Proc. CVPRW.Thabo Beeler and Derek Bradley. 2014. Rigid stabilization of facial expressions. ACM

Trans. Graphics (2014).Matthew Brand. 1999. Voice puppetry. In Proc. SIGGRAPH.Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d &

3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc.ICCV.

Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lipmovements generation at a glance. In Proc. ECCV.

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchicalcross-modal talking face generation with dynamic pixel-wise loss. In Proc. CVPR.

Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that?. In Proc.BMVC.

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: DeepSpeaker Recognition. In Proc. INTERSPEECH.

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black.2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proc. CVPR.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding.arXiv:1810.04805 (2018).

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: ananimator-centric viseme model for expressive lip synchronization. ACM Trans.Graphics (2016).

Paul Ekman and Wallace V Friesen. 1978. Facial action coding system: a technique forthe measurement of facial movement. (1978).

Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2018. Generatingtalking face landmarks from speech. In Proc. LVA/ICA.

Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2019.Noise-Resilient Training Method for Face Landmark Generation From Speech.IEEE/ACM Trans. Audio, Speech, and Language Processing (2019).

Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net forconditional appearance and shape generation. In Proc. CVPR.

Gary Faigin. 2012. The artist’s complete guide to facial expression. Watson-Guptill.Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick

Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors forplausible visual alignment to a dubbed audio track. In Computer graphics forum.

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and JitendraMalik. 2019. Learning Individual Styles of Conversational Gesture. In Proc. CVPR.

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition withrecurrent neural networks. In Proc. ICML.

David Greenwood, Iain Matthews, and Stephen Laycock. 2018. Joint learning of facialexpression and head pose from speech. Interspeech.

Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R Scott, and Larry S Davis. 2019.FiNet: Compatible and Diverse Fashion Image Inpainting. In Proc. ICCV.

12 • Zhou et al.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-imagetranslation with conditional adversarial networks. In Proc. CVPR.

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-timestyle transfer and super-resolution. In Proc. ECCV.

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017.Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Trans. Graphics (2017).

Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, ThaboBeeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preservingvisual dubbing. ACM Trans. Graphics (2019).

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.JMLR (2008).

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and StephenPaul Smolley. 2017. Least squares generative adversarial networks. In Proc. ICCV.

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature(1976).

Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018. End-to-end learning for3d facial animation from speech. In Proc. ICMI.

Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and FrancescMoreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from asingle image. In Proc. ECCV.

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson.2019. AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. InProc. ICML. 5210–5219.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutionalnetworks for biomedical image segmentation. In Proc. MICCAI.

Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. 2009. Generalized-icp.. In Proc.Robotics: science and systems.

Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. 2006. Dancing-to-musiccharacter animation. In Computer Graphics Forum.

Aliaksandr Siarohin, StÃľphane LathuiliÃĺre, Sergey Tulyakov, Elisa Ricci, and NicuSebe. 2019. First Order Motion Model for Image Animation. In Conference on NeuralInformation Processing Systems (NeurIPS).

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks forlarge-scale image recognition. In Proc. ICLR.

Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking FaceGeneration by Conditional Recurrent Adversarial Network. In Proc. IJCAI.

Olga Sorkine. 2006. Differential representations for mesh processing. In ComputerGraphics Forum.

Yannis Stylianou. 2009. Voice transformation: a survey. In Proc. ICASSP.Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017.

Synthesizing obama: learning lip sync from audio. ACM Trans. Graphics (2017).Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia

Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach forgeneralized speech animation. ACM Trans. Graphics (2017).

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and MatthiasNießner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment.arXiv:1912.05566 (2019).

Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and MatthiasNießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. InProc. CVPR.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc.NeurIPS.

Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2016. Superseded-cstrvctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. (2016).

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. RealisticSpeech-Driven Facial Animation with GANs. IJCV (2019).

Marilyn A Walker, Janet E Cahn, and Stephen J Whittaker. 1997. Improvising linguisticstyle: Social and affective bases for agent personality. In Proc. IAA.

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalizedend-to-end loss for speaker verification. In Proc. ICASSP.

Jordan Yaniv, Yael Newman, and Ariel Shamir. 2019. The face of art: landmark detectionand geometric style in portraits. ACM Trans. Graphics (2019).

Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019.Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. In Proc.ICCV.

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking facegeneration by adversarially disentangled audio-visual representation. In Proc. AAAI.

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, andKaran Singh. 2018. Visemenet: Audio-driven animator-centric speech animation.ACM Trans. Graphics (2018).

A SPEAKER-AWARE ANIMATION NETWORKThe attention network follows the encoder block structure inVaswani et al. [2017]. It is composed of a stack of N = 2 identicallayers. Each layer has (a) a multi-head self-attention mechanismwith Nhead = 2 heads and dimensionality dmodel = 32 and (b) afully connected feed-forward layer whose size is the same to theinput size. We also use the embedding layer (a one-layer MLP with

hidden size 32) and the position encoder layer as mentioned inVaswani et al. [2017].

The discriminator network has a similar network architecture tothe attention network. The difference is that the discriminator alsoincludes a three-layer MLP network which has 512, 256, 1 hiddensize respectively.

Landmark Representation Yt256 × 256 Input Image Q128 × 128 ResBlock down (3 + 3) → 6464 × 64 ResBlock down 64 → 12832 × 32 ResBlock down 128 → 25616 × 16 ResBlock down 256 → 5128 × 8 ResBlock down 512 → 5124 × 4 ResBlock down 512 → 5128 × 8 ResBlock up 512 → 51216 × 16 Skip + ResBlock up (512 + 512) → 51232 × 32 Skip + ResBlock up (512 + 512) → 25664 × 64 Skip + ResBlock up (256 + 256) → 128128 × 128 Skip + ResBlock up (128 + 128) → 64256 × 256 Skip + ResBlock up (64 + 64) → 3256 × 256 Tanh

Table 4. Generator architecture for synthesizing photorealistic images.

B IMAGE-TO-IMAGE TRANSLATION NETWORKThe layers of the network architecture used to generatephotorealistic images are listed in Table 4. In this table, the leftcolumn indicates the spatial resolution of the feature map output.ResBlock down means a 2-strided convolutional layer with 3 × 3kernel followed by two residual blocks, ResBlock up means anearest-neighbor upsampling with a scale of 2, followed by a 3 × 3convolutional layer and then two residual blocks, and Skip meansa skip connection that concatenates the feature maps of anencoding layer and decoding layer with the same spatial resolution.

makeittalk: speaker-aware talking head animation · unseen images without the need of fine-tuning...

Documents