end-to-end learning for 3d facial animation from raw ...end-to-end learning for 3d facial animation...

5
End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department of Computer Science, Rutgers University {hxp1,yw632,vladimir}@cs.rutgers.edu Abstract—We present a deep learning framework for real- time speech-driven 3D facial animation from just raw wave- forms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or in a surprised state, both eyebrows raise higher. Experiments on a diverse audiovisual corpus of different actors across a wide range of emotional states show interesting and promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation. I. INTRODUCTION Face synthesis is essential to many applications, such as computer games, animated movies, teleconferencing, talking agents, etc. Traditional facial capture approaches have gained tremendous successes, reconstructing high level of realism. Yet, active face capture rigs utilizing motion sensors/markers are expensive and time-consuming to use. Alternatively, pas- sive techniques capturing facial transformations from cam- eras, although less accurate, have achieved very impressive performance. There lies one problem with vision-based facial capture approaches, however, where part of the face is occluded, e.g. when a person is wearing a mixed reality visor, or in the extreme situation where the entire visual appearance is non- existent. In such cases, other input modalities, such as audio, may be exploited to infer facial actions. Indeed, research on speech-driven face synthesis has regained attention of the community in recent time. Latest works [16], [21], [30], [31] employ deep neural networks in order to model the highly non-linear mapping from input speech domain, either as audio or phonemes, to visual facial features. Particularly, in approaches by Karras et al. [16] and Pham et al. [21], the reconstruction of facial emotion is also taken into account to generate fully transform 3D facial shapes. The method in [16] explicitly specifies the emotional state as an additional input beside waveforms, whereas [21] implicitly infers affec- tive states from acoustic features, and represents emotions via blendshape weights [6]. In this work, we further improve the approach of [21] in several ways, in order to recreate a better 3D talking avatar that can naturally rotate and perform micro facial actions to represent the time-varying contextual information and emotional intensity from speech in real-time. Firstly, we forgo using handcrafted, high-level acoustic features such as chromagram or mel-frequency cepstral coefficients (MFCC), which, as the authors conjectured, may cause the loss of important information to identify some specific emotions, e.g. happy. Instead, we directly use Fourier transformed spectrogram as input to our neural network. Secondly, we employ convolutional neural networks (CNN) to learn mean- ingful acoustic feature representations, and take advantage of the locality and shift invariance in the frequency domain of audio signal. Thirdly, we combine these convolutional layers with recurrent layer in an end-to-end framework, which learns both temporal transition of facial movements, as well as spontaneous actions and varying emotional states from only speech sequences. Experiments on the RAVDESS audiovisual corpus [19] demonstrate promising results of our approach in real-time speech-driven 3D facial animation. The organization of the paper is as follows. Section II sum- marizes other studies related to our work. Our approach is explained in details in Section III. Experiments are described in Section IV, before Section V concludes our work. II. RELATED WORK ”Talking head”, is a research topic where an avatar is animated to imitate human talking. Various approaches have been developed to synthesize a face model driven by either speech audio [13], [37], [28] or transcripts [33], [9]. Essentially, every talking head animation technique develops a mapping from an input speech to visual features, and can be formulated as a classification or regression task. Classifi- cation approaches usually identify phonetic unit (phonemes) from speech and map to visual units (visemes) based on specific rules, and animation is generated by morphing these key images. On the other hand, regression approaches can directly generate visual parameters and their trajectories from input features. Early research on talking head used Hidden Markov Models (HMMs) with some successes [34], [35], despite certain limitations of HMM framework such as oversmoothing trajectory. In recent years, deep neural networks have been suc- cessfully applied to speech synthesis [24], [38] and facial arXiv:1710.00920v2 [cs.CV] 7 Dec 2017

Upload: others

Post on 23-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: End-to-end Learning for 3D Facial Animation from Raw ...End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department

End-to-end Learning for 3D Facial Animationfrom Raw Waveforms of Speech

Hai X. Pham, Yuting Wang, Vladimir PavlovicDepartment of Computer Science, Rutgers University

{hxp1,yw632,vladimir}@cs.rutgers.edu

Abstract— We present a deep learning framework for real-time speech-driven 3D facial animation from just raw wave-forms. Our deep neural network directly maps an inputsequence of speech audio to a series of micro facial actionunit activations and head rotations to drive a 3D blendshapeface model. In particular, our deep model is able to learn thelatent representations of time-varying contextual informationand affective states within the speech. Hence, our model notonly activates appropriate facial action units at inference todepict different utterance generating actions, in the form oflip movements, but also, without any assumption, automaticallyestimates emotional intensity of the speaker and reproduces herever-changing affective states by adjusting strength of facial unitactivations. For example, in a happy speech, the mouth openswider than normal, while other facial units are relaxed; or ina surprised state, both eyebrows raise higher. Experiments ona diverse audiovisual corpus of different actors across a widerange of emotional states show interesting and promising resultsof our approach. Being speaker-independent, our generalizedmodel is readily applicable to various tasks in human-machineinteraction and animation.

I. INTRODUCTION

Face synthesis is essential to many applications, such ascomputer games, animated movies, teleconferencing, talkingagents, etc. Traditional facial capture approaches have gainedtremendous successes, reconstructing high level of realism.Yet, active face capture rigs utilizing motion sensors/markersare expensive and time-consuming to use. Alternatively, pas-sive techniques capturing facial transformations from cam-eras, although less accurate, have achieved very impressiveperformance.

There lies one problem with vision-based facial captureapproaches, however, where part of the face is occluded, e.g.when a person is wearing a mixed reality visor, or in theextreme situation where the entire visual appearance is non-existent. In such cases, other input modalities, such as audio,may be exploited to infer facial actions. Indeed, research onspeech-driven face synthesis has regained attention of thecommunity in recent time. Latest works [16], [21], [30],[31] employ deep neural networks in order to model thehighly non-linear mapping from input speech domain, eitheras audio or phonemes, to visual facial features. Particularly,in approaches by Karras et al. [16] and Pham et al. [21], thereconstruction of facial emotion is also taken into accountto generate fully transform 3D facial shapes. The methodin [16] explicitly specifies the emotional state as an additionalinput beside waveforms, whereas [21] implicitly infers affec-tive states from acoustic features, and represents emotions via

blendshape weights [6].In this work, we further improve the approach of [21]

in several ways, in order to recreate a better 3D talkingavatar that can naturally rotate and perform micro facialactions to represent the time-varying contextual informationand emotional intensity from speech in real-time. Firstly, weforgo using handcrafted, high-level acoustic features such aschromagram or mel-frequency cepstral coefficients (MFCC),which, as the authors conjectured, may cause the loss ofimportant information to identify some specific emotions,e.g. happy. Instead, we directly use Fourier transformedspectrogram as input to our neural network. Secondly, weemploy convolutional neural networks (CNN) to learn mean-ingful acoustic feature representations, and take advantageof the locality and shift invariance in the frequency domainof audio signal. Thirdly, we combine these convolutionallayers with recurrent layer in an end-to-end framework,which learns both temporal transition of facial movements,as well as spontaneous actions and varying emotional statesfrom only speech sequences. Experiments on the RAVDESSaudiovisual corpus [19] demonstrate promising results of ourapproach in real-time speech-driven 3D facial animation.

The organization of the paper is as follows. Section II sum-marizes other studies related to our work. Our approach isexplained in details in Section III. Experiments are describedin Section IV, before Section V concludes our work.

II. RELATED WORK

”Talking head”, is a research topic where an avataris animated to imitate human talking. Various approacheshave been developed to synthesize a face model driven byeither speech audio [13], [37], [28] or transcripts [33], [9].Essentially, every talking head animation technique developsa mapping from an input speech to visual features, and canbe formulated as a classification or regression task. Classifi-cation approaches usually identify phonetic unit (phonemes)from speech and map to visual units (visemes) based onspecific rules, and animation is generated by morphing thesekey images. On the other hand, regression approaches candirectly generate visual parameters and their trajectoriesfrom input features. Early research on talking head usedHidden Markov Models (HMMs) with some successes [34],[35], despite certain limitations of HMM framework such asoversmoothing trajectory.

In recent years, deep neural networks have been suc-cessfully applied to speech synthesis [24], [38] and facial

arX

iv:1

710.

0092

0v2

[cs

.CV

] 7

Dec

201

7

Page 2: End-to-end Learning for 3D Facial Animation from Raw ...End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department

animation [11], [39], [13] with superior performance. Thisis because deep neural networks (DNN) are able to learn thecorrelation of high-dimensional input data, and, in case ofrecurrent neural network (RNN), long-term relation, as wellas the highly non-linear mapping between input and outputfeatures. Taylor et al. [31] propose a system using DNN toestimate active appearance model (AAM) coefficients frominput phonemes, which can be generalized well to differentspeeches and languages, and face shapes can be retargetedto drive 3D face models. Suwajanakorn et al. [30] uselong short-term memory (LSTM) RNN to predict 2D liplandmarks from input acoustic features, which are used tosynthesize lip movements. Fan et al. [13] use both acousticand text features to estimate active appearance model AAMcoefficients of the mouth area, which then be grafted onto anactual image to produce a photo-realistic talking head. Karraset al. [16] propose a deep convolutional neural network(CNN) that jointly takes audio autocorrelation coefficientsand emotional state to output an entire 3D face shape.

In terms of the underlying face model, these approachescan be categorized into image-based [5], [9], [12], [34],[37], [13] and model-based [4], [3], [29], [36], [11], [7] ap-proaches. Image-based approaches compose photo-realisticoutput by concatenating short clips, or stitch different regionsfrom a sample database together. However, their performanceand quality are limited by the amount of samples in thedatabase, thus it is difficult to generalize to a large corpusof speeches, which would require a tremendous amount ofimage samples to cover all possible facial appearances. Incontrast, although lacking in photo-realism, model-based ap-proaches enjoy the flexibility of a deformable model, whichis controlled by only a set of parameters, and more straight-forward modeling. Pham et al. [21] propose a mappingfrom acoustic features to blending weights of a blendshapemodel [6]. This face model allows emotional representationthat can be inferred from speech, without explicitly definingthe emotion as input, or artificially adding emotion to theface model in postprocessing. Our approach also enjoys theflexibility of blendshape model in 3D face reconstructionfrom speech.

CNN-based speech modeling. Convolutional neural net-works [18] have achieved great successes in many visiontasks e.g. image classification or segmentation. Their efficientfilter design allows deeper network, enables learning featuresfrom data directly while being robust to noise and small shift,thus usually having better performance than prior modelingtechniques. In recent years, CNNs have been also employedin speech recognition tasks, that directly model the rawwaveforms by taking advantage of the locality and translationinvariance in time [32], [20], [15] and frequency domain[10], [2], [1], [25], [26], [27]. In this work, we also employconvolutions in the time-frequency domain, and formulatean end-to-end deep neural network that directly maps inputwaveforms to blendshape weights.

III. DEEP END-TO-END LEARNING

FOR 3D FACE SYNTHESIS FROM SPEECH

A. Face Representation

Our work makes use of the 3D blendshape face modelfrom the FaceWarehouse database [6], which has been uti-lized successfully in visual 3D face tracking tasks [23],[22]. An arbitrary fully transformed facial shape S can becomposed as:

S = R

(B0 +

N∑i=1

(Bi −B0)ei

), (1)

where (R, e) are rotation and expression blending parame-ters, respectively, {Bi|i = 1..N} are personalized expressionblendshape bases of a particular person, and {ei} are con-strained within [0, 1]. B0 is the neutral posed blendshape.

Similar to [21], our deep model also generates (R, e),where R is represented by three free parameters of a quater-nion, and e is a vector of length N = 46. We use the 3Dface tracker in [23] to extract these parameters from trainingvideos.

Fig. 1: A few samples from the RAVDESS database, wherea 3D facial blendshape (right) is aligned to the face of theactor (left) in the corresponding frame. Red dots indicate 3Dlandmarks of the model projected to the image plane.

B. Framework Architecture

Our end-to-end deep neural network is illustrated inFigure 2. The input to our model is raw time-frequencyspectrograms of audio signals. Specifically, each spectrogramcontains 128 frequency power bands across 32 time frames,in the form of a 2D (frequency-time) array suitable for CNN.We apply convolutions on frequency and time separately,similar to [16], [27], as this practice has been empiricallyshown to reduce overfitting, furthermore, using smaller filtersrequires less computation, which consequently speeds up thetraining and inference. The network architecture is detailed inTable I. Specifically, the input spectrogram is first convolvedand pooled on the frequency axis with the downsamplingfactor of two, until the frequency dimension is reduced toone. Then, convolution and pooling is applied on the timeaxis. A dense layer is placed on top of CNN, which feedsto a unidirectional recurrent layer. In this work, we reportthe model performance where the recurrent layer utilizeseither LSTM [14] or gated recurrent unit (GRU) [8] cells.The output from RNN is passed to another dense layer,whose purpose is to reduce the oversmoothing tendency ofRNN and to allow more spontaneous changes in facial unitactivations. Every convolutional layer is followed by a batchnormalization layer, except the last one (Conv8 in Table I).

Page 3: End-to-end Learning for 3D Facial Animation from Raw ...End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department

Fig. 2: The proposed end-to-end speech-driven 3D facial animation framework. The input spectrogram is first convolvedover frequency axis (F-convolution), then over time (T-convolution). Detailed network architecture is described in Table I.

TABLE I: Configuration of our hidden neural layers.

Name Filter Size Stride Hidden Layer Size ActivationInput 128× 32

Conv1 (3,1) (2,1) 64× 64× 32 ReLUPool1 (2,1) (2,1) 64× 32× 32Conv2 (3,1) (2,1) 96× 16× 32 ReLUPool2 (2,1) (2,1) 96× 8× 32Conv3 (3,1) (2,1) 128× 4× 32 ReLUConv4 (3,1) (2,1) 160× 2× 32 ReLUConv5 (2,1) (2,1) 256× 1× 32 ReLUPool5 (1,2) (1,2) 256× 1× 16Conv6 (1,3) (1,2) 256× 1× 8 ReLUConv7 (1,3) (1,2) 256× 1× 4 ReLUConv8 (1,4) (1,4) 256× 1× 1 ReLUDense1 256 tanhRNN 256Dense2 256 tanhOutput 49

C. Training Details

Audio processing. For each video frame t in the corpus, weextract a 96ms audio frame sampled at 44.1kHz, includingthe acoustic data of the current frame and the previousframes. With the intended application in real-time animation,we do not consider any delay to gather future data, i.e. audiosamples of frames t + 1 onward, as they are unknown in alive streaming scenario. Instead, temporal transition will bemodeled by the recurrent layer. We apply FFT with windowsize of 256 and hop length of 128, to recover a powerspectrogram of 128 frequency bands and 32 time frames.

Loss function. Our framework maps input sequence ofspectrograms xt, t = 1..T to output sequence of shapeparameter vectors yt, where T is the number of videoframes. Thus, at any given time t, the deep model estimatesyt = (Rt, et) from an input spectrogram xt. Similar to [21],we split the output into two separate layers, YR for rotationand Ye for expression weights. YR has tanh activation to c,whereas Ye uses sigmoid activation to constrain the range ofoutput values:

yRt = τ (WhyRht + byR

) ,yet = σ (Whye

ht + bye) ,

yt = (yRt, yet) .(2)

We train the model by minimizing the squared error:

E =∑t

‖yt − yt‖2, (3)

where yt is the expected output, which we extract fromtraining videos. We use the CNTK deep learning toolkitto implement our neural network models. Training hyper-parameters are choosen as follows: minibatch size is 300,epoch size is 150,000 and learning rate is 0.0001. Thenetwork parameters are learned by Adam optimizer [17] in300 epochs.

IV. EXPERIMENTS

A. Dataset

We use the Ryerson Audio-Visual Database of EmotionalSpeech and Song (RAVDESS) [19] for training and evalua-tion. The database consists of 24 professional actors speakingand singing with various emotions: neutral, calm, happy, sad,angry, fearful, disgust and surprised. We use video sequencesof the first 20 actors for training, with around 250,000 framesin total, which translate to about 2hr of audio, and evaluatethe model on the data of four remaining actors.

B. Experimental Settings

We train our deep neural network in two configurations:CNN+LSTM and CNN+GRU, in which the recurrent layeruses LSTM and GRU cells, respectively. As a baselinemodel, we drop the recurrent layer in our proposed neuralnetwork, and denote it as CNN-static. This model cannothandle smoothly temporal transition, it estimates facial pa-rameters in a frame-by-frame basis. We also compared ourproposed models with the method described in [21], whichuses engineered features as input.

We compare these models on two metrics: RMSE of 3Dlandmark errors and mean squared error of facial parameters,specifically, expression blending weights with respect toground truth recovered by the visual tracker [23]. Landmarkerrors are calculated as distances from the inner landmarks(showed in Figure 1) on the reconstructed 3D face shape, tothose of the ground truth 3D shape. We ignore error metricsfor head rotation, since it is rather difficult to infer head posecorrectly from the speech. We include head pose estimationprimarily to generate plausible rigid motions to augment the

Page 4: End-to-end Learning for 3D Facial Animation from Raw ...End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department

TABLE II: RMSE of 3D landmarks in millimeter, categorized by types of emotions of test sequences, and by actors.

Mean Neutral Calm Happy Sad Angry Fearful Disgu. Surpri. Actor21 Actor22 Actor23 Actor24LSTM [21] 1.039 1.059 1.039 1.082 1.010 1.033 1.016 1.049 1.038 1.007 0.941 1.056 1.139CNN-static 0.741 0.725 0.715 0.760 0.728 0.746 0.746 0.781 0.746 0.723 0.697 0.719 0.817

CNN+LSTM 1.042 1.077 1.029 1.092 1.029 1.013 1.031 1.035 1.054 1.026 0.946 1.021 1.162CNN+GRU 1.022 1.034 0.995 1.081 0.999 1.012 1.008 1.023 1.045 0.998 0.952 0.985 1.139

TABLE III: Mean squared error of expression blending weights, categorized by types of emotions of test sequences, and byactors.

Mean Neutral Calm Happy Sad Angry Fearful Disgu. Surpri. Actor21 Actor22 Actor23 Actor24LSTM [21] 0.065 0.067 0.066 0.069 0.068 0.058 0.062 0.071 0.063 0.040 0.061 0.070 0.088CNN-static 0.018 0.016 0.016 0.018 0.019 0.016 0.018 0.022 0.019 0.012 0.016 0.019 0.022

CNN+LSTM 0.074 0.072 0.074 0.083 0.074 0.063 0.073 0.074 0.079 0.052 0.061 0.075 0.106CNN+GRU 0.067 0.065 0.065 0.075 0.069 0.059 0.065 0.073 0.070 0.041 0.061 0.066 0.100

TABLE IV: Training errors after 300 epochs of four models.

LSTM [21] CNN-static CNN+LSTM CNN+GRU0.59 0.94 0.54 0.52

realism of the talking avatar. However, these error metricsdo not truly reflect the performance of our deep generativemodel, based on our observations.

C. Evaluation

Table II and III show the aforementioned error metrics offour models, organized into different categories correspond-ing to different emotions and testers. The proposed modelwith GRU cells slightly outperforms [21] as well as theproposed model with LSTM cells, in terms of landmarkerrors. On the other hand, CNN+GRU has similar blendshapecoefficient errors to [21], whereas CNN+LSTM has highesterrors.

Interestingly, on both metrics, the CNN-static outperformsall other models with recurrent layers. As shown in Table II,the RMSE of CNN-static is about 0.7mm consistently acrossdifferent categories and testers, and 30% lower than errorsof other models. CNN-static also scores lower parameterestimation errors. These result suggests that the baseline canactually generalizes well to test data in terms of facial actionestimations, although visualization of 3D face reconstructionshows that the baseline model inherently generates non-smooth sequence of facial actions, which is shown in thesupplementary video. From these testing results, combinedwith the training errors listed in Table IV, we hypothesizethat our proposed models, CNN+LSTM and CNN+GRU,overfit the training data. And CNN+GRU, being simpler, cangeneralize somewhat better than CNN+LSTM. The baselinemodel CNN-static, being the simplest among the four i.e.it has the least number of parameters, generalizes well andachieves the best performance on the test set in terms oferror metrics, as well as emotional facial reconstruction fromspeech, demonstrated in Figure 3. These results once againprove the robustness, generalization and adaptability of theCNN architecture and suggest that we have pursued the rightdirection in using CNN to model facial action from rawwaveforms, but there are limitations and deficiencies in ourcurrent approach that need to be addressed.

Fig. 3: A few reconstruction samples. On the left are trueface appearances. From the second to the last columnsare reconstruction results by Pham et al. [21], CNN-static,CNN+LSTM and CNN+GRU, respectively. We use a generic3D face model animated with the parameters generatedby each model. The reconstructions by CNN-static depictemotions of the speakers reasonably well, however, it cannotgenerate smooth transition between frames.

Page 5: End-to-end Learning for 3D Facial Animation from Raw ...End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech Hai X. Pham, Yuting Wang, Vladimir Pavlovic Department

V. CONCLUSION AND FUTURE WORK

This paper introduces a deep learning framework forspeech-driven 3D facial animation from raw waveforms. Ourproposed deep neural network learns a mapping from audiosignal to the temporally varying context of the speech, aswell as emotional states of the speaker represented implicitlyby blending weights of a 3D face model. Experimentsdemonstrate that our approach could estimate the form oflip movements with the emotional intensity of the speakerreasonably. However, there are certain limitations in ournetwork architecture that prevent the model from reflectingthe emotion in the speech perfectly. In the future, we willimprove the generalization of our deep neural network,and explore other generative models to increase the facialreconstruction quality.

REFERENCES

[1] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, andD. Yu. Convolutional neural networks for speech recognition. IEEETransaction on Audio, Speech, and Language Processing, 22(10),October 2014.

[2] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, and G. Penn. Applyingconvolutional neural networks concepts to hybrid nn-hmm model forspeech recognition. In IEEE International Conference on Acoustics,Speech and Signal Processing, 2012.

[3] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces inimages and video. In SIGGAPH, pages 187–194, 1999.

[4] V. Blanz and T. Vetter. A morphable model for the synthesis of 3dfaces. In Eurographics, pages 641–650, 2003.

[5] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visualspeech with audio. In SIGGRAPH, pages 353–360, 2007.

[6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. FaceWarehouse:A 3D Facial Expression Database for Visual Computing. IEEETransactions on Visualization and Computer Graphics, 20(3):413–425,March 2014.

[7] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283–1302, 2005.

[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluationof gated recurrent neural networks on sequence modeling. In NIPS2014 Deep Learning and Representation Learning Workshop, 2014.

[9] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter. Lifelike talkingfaces for interactive services. Proc IEEE, 91(9):1406–1429, 2003.

[10] L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neuralnetwork using heterogeneous pooling for trading acoustic invariancewith phonetic confusion. In IEEE International Conference onAcoustics, Speech and Signal Processing, 2013.

[11] C. Ding, L. Xie, and P. Zhu. Head motion synthesis from speech usingdeep neural network. Multimed Tools Appl, 74:9871–9888, 2015.

[12] T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speechanimatio. In SIGGRAPH, pages 388–397, 2002.

[13] B. Fan, L. Xie, S. Yang, L. Wang, and F. K. Soong. A deepbidirectional lstm approach for video-realistic talking head. MultimedTools Appl, 75:5287–5309, 2016.

[14] S. Hochreiter and J. Scmidhuber. Long short-term memory. NeuralComput, 9(8):1735–1780, 1997.

[15] Y. Hoshen, R. J. Weiss, and K. W. Wilson. Speech acoustic modelingfrom raw multichannel waveforms. In IEEE International Conferenceon Acoustics, Speech and Signal Processing, 2015.

[16] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-drivenfacial animation by joint end-to-end learning of pose and emotion. InSIGGRAPH, 2017.

[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.In 3rd International Conference for Learning Representations, 2015.

[18] Y. LeCun and Y. Bengio. Convolutional networks for images, speech,and time-series. pages 255–258, 1998.

[19] S. R. Livingstone, K. Peck, and F. A. Russo. Ravdess: The ryersonaudio-visual database of emotional speech and song. In 22nd AnnualMeeting of the Canadian Society for Brain, Behaviour and CognitiveScience (CSBBCS), 2012.

[20] D. Palaz, R. Collobert, and M. Magimai-Doss. Estimating phonemeclass conditional probabilities from raw speech signal using convolu-tional neural networks. In Interspeech, 2013.

[21] H. X. Pham, S. Cheung, and V. Pavlovic. Speech-driven 3d facial an-imation with implicit emotional awareness: a deep learning approach.In The 1st DALCOM workshop, CVPR, 2017.

[22] H. X. Pham and V. Pavlovic. Robust real-time 3d face tracking fromrgbd videos under extreme pose, depth, and expression variations. In3DV, 2016.

[23] H. X. Pham, V. Pavlovic, J. Cai, and T. jen Cham. Robust real-timeperformance-driven 3d face tracking. In ICPR, 2016.

[24] Y. Qian, Y. Fan, and F. K. Soong. On the training aspects of deepneural network (dnn) for parametric tts synthesis. In ICASSP, pages3829–3833, 2014.

[25] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. rahman Mohamed,G. Dahl, and B. Ramabhadran. Deep convolutional neural networksfor large-scale speech tasks. Neural Network, 64:39–48, 2015.

[26] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, longshort-term memory, fully connected deep neural networks. In IEEEInternational Conference on Acoustics, Speech and Signal Processing,2015.

[27] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals.Learning the speech front-end with raw waveforms cldnns. InInterspeech, 2015.

[28] S. Sako, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Hmm-based text-to-audio-visual speech synthesis. In ICSLP, pages 25–28,2000.

[29] G. Salvi, J. Beskow, S. Moubayed, and B. Granstrom. Synface: speech-driven facial animation for virtual speech-reading support. URASIPjournal on Audio, speech, and music processing, 2009.

[30] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Schlizerman.Synthesizing obama: learning lip sync from audio. In SIGGRAPH,2017.

[31] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez,J. Hodgins, and I. Matthews. A deep learning approach for generalizedspeech animation. In SIGGRAPH, 2017.

[32] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speechemotion recognition using a deep convolutional recurrent network. InInterspeech, 2016.

[33] A. Wang, M. Emmi, and P. Faloutsos. Assembling an expressivefacial animation system. ACM SIGGRAPH Video Game Symposium(Sandbox), pages 21–26, 2007.

[34] L. Wang, X. Qian, W. Han, and F. K. Soong. Synthesizing photo-realtalking head via trajectoryguided sample selection. In Interspeech,pages 446–449, 2010.

[35] L. Wang, X. Qian, F. K. Soong, and Q. Huo. Text driven 3d photo-realistic talking head. In Interspeech, pages 3307–3310, 2011.

[36] Z. Wu, S. Zhang, L. Cai, and H. Meng. Real-time synthesis of chinesevisual speech and facial expressions using mpeg-4 fap features in athree-dimensional avatar. In Interspeech, pages 1802–1805, 2006.

[37] L. Xie and Z. Liu. Realistic mouth-synching for speech-driven talkingface using articulatory modeling. IEEE Trans Multimed, 9(23):500–510, 2007.

[38] H. Zen, A. Senior, and M. Schuster. Statistical parametric speechsynthesis using deep neural networks. In ICASSP, pages 7962–7966,2013.

[39] X. Zhang, L. Wang, G. Li, F. Seide, and F. K. Soong. A new languageindependent, photo realistic talking head driven by voice only. InInterspeech, 2013.