modality-based factorization for multimodal fusion · 2019. 7. 23. · 1 introduction multimodal...

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 260–269Florence, Italy, August 2, 2019. c©2019 Association for Computational Linguistics

260

Modality-based Factorization for Multimodal Fusion

Elham J. Barezi, Pascale FungCenter for Artificial Intelligence Research (CAiRE)Department of Computer Science and Engineering

The Hong Kong University of Science and Technology, Clear Water Bay, Hong [email protected],[email protected]

Abstract

We propose a novel method, Modality-basedRedundancy Reduction Fusion (MRRF), forunderstanding and modulating the relative con-tribution of each modality in multimodal infer-ence tasks. This is achieved by obtaining an(M + 1)-way tensor to consider the high-orderrelationships between M modalities and theoutput layer of a neural network model. Ap-plying a modality-based tensor factorizationmethod, which adopts different factors for dif-ferent modalities, results in removing informa-tion present in a modality that can be com-pensated by other modalities, with respect tomodel outputs. This helps to understand therelative utility of information in each modality.In addition it leads to a less complicated modelwith less parameters and therefore could beapplied as a regularizer avoiding overfitting.We have applied this method to three differentmultimodal datasets in sentiment analysis, per-sonality trait recognition, and emotion recog-nition. We are able to recognize relationshipsand relative importance of different modali-ties in these tasks and achieves a 1% to 4%improvement on several evaluation measurescompared to the state-of-the-art for all threetasks.

1 Introduction

Multimodal data fusion is a desirable method formany machine learning tasks where information isavailable from multiple source modalities, typicallyachieving better predictions through integration ofinformation from different modalities. Multimodalintegration can handle missing data from one ormore modalities. Since some modalities can in-clude noise, it can also lead to more robust predic-tion. Moreover, since some information may not bevisible in some modalities or a single modality maynot be powerful enough for a specific task, con-sidering multiple modalities often improves perfor-

mance (Potamianos et al., 2003; Soleymani et al.,2012; Kampman et al., 2018).

For example, humans assign personality traitsto each other, as well as to virtual characters byinferring personality from diverse cues, both be-havioral and verbal, suggesting that a model topredict personality should take into account multi-ple modalities such as language, speech, and visualcues.

Our method, Modality-based Redundancy Re-duction multimodal Fusion (MRRF), builds on re-cent work in mutimodal fusion utilizing first anouter product tensor of input modalities to bettercapture inter-modality dependencies (Zadeh et al.,2017) and a recent approach to reduce the num-ber of elements in the resulting tensor through lowrank factorization (Liu et al., 2018). Whereas thefactorization used in (Liu et al., 2018) utilizes asingle compression rate across all modalities, weinstead use Tuckers tensor decomposition (see theMethodology section), which allows different com-pression rates for each modality. This allows themodel to adapt to variations in the amount of usefulinformation between modalities. Modality-specificfactors are chosen by maximizing performance ona validation set.

Applying a modality-based factorization methodresults in removing redundant information dupli-cated across modalities and leading to fewer pa-rameters with minimal information loss. Throughmaximizing performance on a validation set, ourmethod can work as a regularizer, leading to a lesscomplicated model and reducing overfitting. In ad-dition, our modality-based factorization approachhelps to understand the differences in useful infor-mation between modalities for the task at hand.

We evaluate the performance of our approach us-ing sentiment analysis, personality detection, andemotion recognition from audio, text and videoframes. The method reduces the number of pa-

261

rameters which requires fewer training samples,providing efficient training for the smaller datasets,and accelerating both training and prediction. Ourexperimental results demonstrate that the proposedapproach can make notable improvements, in termsof accuracy, mean average error (MAE), correla-tion, and F1 score, especially for the applicationswith more complicated inter-modality relations.

We further study the effect of different com-pression rates for different modalities. Our resultson the importance of each modality for each tasksupports the previous results on the usefulness ofeach modality for personality recognition, emotionrecognition and sentiment analysis.

In the sequel, we first describe related work. Weelaborate on the details of our proposed methodin Methodology section. In the following sectionwe go on to describe our experimental setup. Inthe Results section, we compare the performanceof MRRF and state-of-the-art baselines on threedifferent datasets and discuss the effect of compres-sion rate on each modality. Finally, we provide abrief conclusion of the approach and the results.Supplementary materials describe the methodologyin greater detail.

Notation The operator ⊗ is the outer product op-erator where z1 ⊗ . . . ⊗ zM for zi ∈ Rdi leadsto a M-way tensor in Rd1×...×dM . The opera-tor ×k, for a given k, is k-mode product of atensor R ∈ Rr1×r2×...×rM and a matrix W ∈Rdk×rk as W ×k R, which results in a tensorR ∈ Rr1×...×rk−1×dk×rk+1×...×rM .

2 Related Work

Multimodal Fusion: Multimodal fusion (Ngiamet al., 2011) has a very broad range of applica-tions, including audio-visual speech recognition(Potamianos et al., 2003), classification of imagesand their captions (Srivastava and Salakhutdinov,2012), multimodal emotion recognition (Soleymaniet al., 2012), medical image analysis (James andDasarathy, 2014), multimedia event detection (Lanet al., 2014), personality trait detection (Kampmanet al., 2018), and sentiment analysis (Zadeh et al.,2017).

According to the recent work by (Baltrusaitiset al., 2018), the techniques for multimodal fu-sion can be divided into early, late and hybridapproaches. Early approaches combine the mul-timodal features immediately by simply concate-nating them (D’mello and Kory, 2015). Late fusion

combines the decision for each modality (eitherclassification, or regression), by voting (Morvantet al., 2014), averaging (Shutova et al., 2016) orweighted sum of the outputs of the learned models(Glodek et al., 2011; Shutova et al., 2016). Thehybrid approach combines the prediction by earlyfusion and unimodal predictions.

It has been observed that early fusion (featurelevel fusion) concentrates on the inter-modalityinformation rather than intra-modality informa-tion (Zadeh et al., 2017) due to the fact that inter-modality information can be more complicated atthe feature level and dominates the learning process.On the other hand, these fusion approaches are notpowerful enough to extract the inter-modality inte-gration model and they are limited to some simplecombining methods (Zadeh et al., 2017).

Zadeh et al. (2017) proposed combining nmodalities by computing an n-way tensor as atensor product of the n different modality repre-sentations followed by a flattening operation, inorder to include 1-st order to n-th order inter modal-ity relations. This is then fed to a neural networkmodel to make predictions. The authors show thattheir proposed method improves the accuracy byconsidering both inter-modality and intra-modalityrelations. However, the generated representationhas a very large dimension which leads to a verylarge hidden layer and therefore a huge number ofparameters.

The authors of (Poria et al., 2017a,b; Zadeh et al.,2018a,b) introduce attention mechanisms utilizingthe contextual information available from the ut-terances for each speaker. They require additionalinformation like the identity of the speaker, thesequence of the utterance-sentiments while inte-grating the multimodal data. Since these methods,despite our proposed method, need additional in-formation might not be available in the generalscenario, we do not include them in our experi-ments.

Low Rank Factorization: Recently (Liu et al.,2018) proposed a factorization approach in or-der to achieve a factorized version of the weightmatrix which leads to fewer parameters whilemaintaining model accuracy. They use a CAN-DECOMP/PARAFAC decomposition (Carroll andChang, 1970; Harshman, 1970) which follows Eq.1 in order to decompose a tensor W ∈ Rd1×...dM

262

to several 1-dimensional vectors wim ∈ Rdk :

W =r∑

i=1

λiwi1 ⊗ wi

2 ⊗ . . .⊗ wiM

=r∑

i=1

λi ⊗Mm=1 w

im

(1)

where⊗ is the outer product operator, λis are scalarweights to combine rank 1 decompositions. Thisapproach used the same compression rate for allmodalities, i.e. r is shared for all the modalities,and is not able to allow for varying compressionrates between modalities. Previous studies havefound that some modalities are more informativethan others (De Silva et al., 1997; Kampman et al.,2018), suggesting that allowing different compres-sion rates for different modalities should improveperformance.

3 Methodology

3.1 Tucker Factorization for MultimodalLearning

Modality-based Redundancy Reduc- tion Fu-sion (MRRF): We have used Tucker’s tensordecomposition method (Tucker, 1966; Hitchcock,1927) which decomposes an M -way tensor W ∈Rd1×d2×...×dM to a core tensorR ∈ Rr1×r2×...×rM

and M matrices Wi ∈ Rri×di , with ri ≤ di, as itcan be seen in Eq. 2.

W = R×1 W1 ×2 W2 ×3 . . .×M WM ,

W ∈ Rd1×d2×...×dM

R ∈ Rr1×r2×...×rM ,

Wi ∈ Rdi×ri

(2)

The operator ×k is a k-mode product of a tensorR ∈ Rr1×r2×...×rM and a matrix W ∈ Rdk×rk

as R ×k Wk, which results in a tensor R ∈Rr1×...×rk−1×dk×rk+1×...×rM .

For M modalities with representations D1, D2,. . . and DM of size d1, d2, . . . and dM , an M -modal tensor fusion approach as proposed by theauthors of (Zadeh et al., 2017) leads to a tensorD = D1 ⊗ D2 ⊗ . . . ⊗ Dm ∈ Rd1×d2×...×dM .The authors proposed flattening the tensor layerin the deep network which results in loss of theinformation included in the tensor structure. In thispaper, we propose to avoid the flattening and followEq. 3 with weight tensor W ∈ Rh×d1×d2×...×dM ,where leads to an output layer H of size h.

H = WD (3)

The above equation suffers from a large num-ber of parameters (O(

∏i=1 dih)) which requires

a large number of the training samples, huge timeand space, and easily overfits. In order to reduce thenumber of parameters, we propose to use Tucker’stensor decomposition (Tucker, 1966; Hitchcock,1927) as shown in Eq. 4, which works as a low-rank regularizer (Fazel, 2002).

W = R×1 W1 ×2 W2 ×3 . . .×M+1 WM+1,

W ∈ Rh×d1×d2×...×dM ,

R ∈ Rr1×r2×r3×...×rM ,

Wi ∈ Rri×di , i = {1, . . . ,M},WM+1 ∈ RrM+1×h

(4)The non-diagonal core tensor R maintain inter-

modality information after compression, despitethe factorization proposed by (Liu et al., 2018)which loses part of inter-modality information.

3.2 Proposed MRRF frameworkWe propose Modality-based Redundancy Reduc-tion Fusion (MRRF), a tensor fusion and factoriza-tion method allowing for modality specific com-pression rates, combining the power of tensor fu-sion methods with a reduced parameter complexity.Without loss of generality, we will consider thenumber of modalities to be 3 in this discussion.

Our method first forms an outer product tensorfrom input modalities D, then projects this via atensor W to a feature vector H passed as inputto a neural network which performs the desiredinference task.

H = WD (5)

The trainable projection tensor W represents alarge number of parameters, and in order to reducethis number, we propose to use Tucker’s tensordecomposition (Tucker, 1966; Hitchcock, 1927),which works as a low-rank regularizer (Fazel,2002). This results in a decomposition of W into acore tensor R of reduced dimensionality and threemodality specific matrices Wi.

W = R×1 W1 ×2 W2 ×3 W3 (6)

where ×k is a k-mode product of a tensor and amatrix. Equation 5 can then be re-written

Z = W1 ×1 W2 ×2 W3 ×3 D

H = ZR (7)

263

See Figure 1 for an overview of this process forthe case of three separate channels for audio, text,and video. In practice we flatten tensors Z andR toreduce this last operation to a matrix multiplication.Further details of the decomposition strategy canbe found in the supplementary materials.

Note that a simple outer product of the inputfeatures leads only to the high-order trimodal de-pendencies. In order to also obtain the unimodaland bimodal dependencies, the input feature vec-tors for each modality are padded by 1. This alsoprovides a constant element whose correspondingfactors in W act as a bias vector.

Algorithm 1 shows the whole MRRF process.

Algorithm 1 Tensor Factorization Layer.Input: n input modalities D1, D2, . . . , Dn ofsize d1, d2, . . . , dn, correspondingly.Initialization: factorization size for each modalityr1, r2, . . . , rn.

1: Compute tensor D = D1 ⊗D2 ⊗ . . .⊗Dn

2: Generate the layers for out = WD whichW = R ×1 W1 ×2 . . . ×M WM in order totransform the high-dimensional tensor D tothe output h.

3: Use Adam optimizer for the differentiable ten-sor factorization layer to find the unknown pa-rameters W1, W2, . . . , Wn, R.

Output: Factors for Weight Matrix W :W1, W2, . . . , Wn, R.

The original tensor fusion approach as proposedin (Zadeh et al., 2017) flattened the tensor D whichresults in loss of the information included in thetensor structure, which is avoided in our approach.Liu et al. (2018) developed a similar approach toours using a diagonal core tensor R, losing muchinter-modality information. Our non-diagonal coretensor maintains key inter-modality informationafter compression.

Note that the factorization step is task dependent,included in the deep network structure and learnedduring network training. Thus, for follow-up learn-ing tasks, we would learn a new factorization spe-cific to the task at hand, typically also estimatingoptimal compression ratios as described in the dis-cussion section. In this process, any shared, helpfulinformation is retained, as demonstrated by ourresults.

Analysis of parameter complexity: Followingour proposed approach, we have decomposed thetrainable W tensor to four substantially smallertrainable matrices (W1, W2, W3, R) leading toO(

∑Mi=1(di ∗ ri) +

∏Mi=1 ri ∗ h) parameters. Con-

cat fusion (CF) leads to a layer size of O(∑M

i=1 di)

and O(∑M

i=1 di ∗ h) parameters.The tensor fusion approach (TF), leads to a layer

size of O(∏M

i=1 di), and O(∏M

i=1 di ∗ h) parame-ters. The LMF approach (Liu et al., 2018) requirestraining O(

∑Mi=1 r ∗ h ∗ di) parameters, where r is

the rank used for all the modalities.It can be seen that the number of parameters

in the proposed approach is substantially fewerthan the simple tensor fusion (TF) approach andcomparable to the LMF approach.

4 Experimental Setup

4.1 DatasetsWe perform our experiments on the following mul-timodal datasets: CMU-MOSI (Zadeh et al., 2016),POM (Park et al., 2014), and IEMOCAP (Bussoet al., 2008) for sentiment analysis, speaker traitsrecognition, and emotion recognition, respectively.These tasks can be done by integrating both verbaland nonverbal behaviors of the persons.

The CMU-MOSI dataset is annotated on a seven-step scale as highly negative, negative, weakly neg-ative, neutral, weakly positive, positive, highly pos-itive which can be considered as a 7 class classifi-cation problem with 7 labels in the range [−3,+3].The dataset is an annotated dataset of 2199 opin-ion utterances from 93 distinct YouTube moviereviews, each containing several opinion segments.Segments average of 4.2 seconds in length.

The POM dataset is composed of 903 movie re-view videos. Each video is annotated with the fol-lowing speaker traits: confident, passionate, voicepleasant, dominant, credible, vivid, expertise, enter-taining, reserved, trusting, relaxed, outgoing, thor-ough, nervous, persuasive and humorous.

The IEMOCAP dataset is a collection of 151videos of recorded dialogues, with 2 speakers persession for a total of 302 videos across the dataset.Each segment is annotated for the presence of 9emotions (angry, excited, fear, sad, surprised, frus-trated, happy, disgust and neutral).

Each dataset consists of three modalities, namelylanguage, visual, and acoustic. The visual andacoustic features are calculated by taking the av-erage of their feature values over the word time

264

d1 x d2 x d3

W1D

W2

W3

1r1 x r2 x r3 x h

Z R

Z

d1

d2

1

1

d1 x d2 x d3

d3

=r1 x d1

r2 x d2

r3 x d3

= Z

r1r2r3 x h

r1r2r3

r1 x r2 x r3 r1 x r2 x r3

D R

Tensor Matrix Vector

Hidden Layer and O

utputInput Tensor Fusion Compressing Producing output of size H

Factorization

Flattening

3

1

2

_ _

Figure 1: Diagram of Modality-based Redundancy Reduction Multimodal Fusion (MRRF).

interval (Chen et al., 2017). In order to performtime alignment across modalities, the three modali-ties are aligned using P2FA (Yuan and Liberman,2008) at the word level.

Pre-trained 300-dimensional Glove word embed-dings (Chen et al., 2017) were used to extract thelanguage feature representations, which encodes asequence of the transcribed words into a sequenceof vectors.

Visual features for each frame (sampled at 30Hz)are extracted using the library Facet1 which in-cludes 20 facial action units, 68 facial landmarks,head pose, gaze tracking and HOG features (Zhuet al., 2006).

COVAREP acoustic analysis framework (Degot-tex et al., 2014) is used to extract low-level acous-tic features, including 12 Mel frequency cepstralcoefficients (MFCCs), pitch, voiced/unvoiced seg-mentation, glottal source, peak slope, and maximadispersion quotient features.

To evaluate model generalization, all datasets aresplit into training, validation, and test sets such thatthe splits are speaker independent, i.e., no speakersfrom the training set are present in the test sets.Table 1 illustrates the data splits for all the datasetsin detail.

1goo.gl/1rh1JN

Dataset CMU-MOSI IEMOCAP POMLevel Segment Segment Video

Train 1284 6373 600Valid 229 1775 100Test 686 1807 203

Table 1: The speaker independent data splits for train-ing, validation, and test sets

4.2 Model Architecture

Similarly to (Liu et al., 2018), we use a simplemodel architecture for extracting the representa-tions for each modality. We used three unimodalsub-embedding networks to extract representationsza, zv and zl for each modality, respectively. Foracoustic and visual modalities, the sub-embeddingnetwork is a simple 2-layer feed-forward neuralnetwork, and for language, we used a long short-term memory (LSTM) network (Hochreiter andSchmidhuber, 1997).

We tuned the layer sizes, the learning rates andthe compression rates, by checking the mean av-erage error for the validation set by grid search.We trained our model using the Adam optimizer(Kingma and Ba, 2014). All models were imple-mented with Pytorch (Paszke et al., 2017).

265

Dataset CMU-MOSI POM IEMOCAPMetric MAE Corr Acc-2 F1 Acc-7 MAE Corr Acc F1-Happy F1-Sad F1-Angry F1-Neutral

CF 1.140 0.52 72.3 72.1 26.5 0.865 0.142 34.1 81.1 81.2 65.1 44.1TFN 0.970 0.633 73.9 73.4 32.1 0.886 0.093 31.6 83.6 82.8 84.2 65.4LMF 0.912 0.668 76.4 75.7 32.8 0.796 0.396 42.8 85.8 85.9 89.0 71.7

MRRF 0.912 0.772 77.46 76.73 33.02 0.69 0.44 43.02 87.71 85.9 90.02 73.7

Table 2: Results for Sentiment Analysis on CMU-MOSI, emotion recognition on IEMOCAP and personality traitrecognition on POM. (CF, TF, and LMF stand for concat, tensor and low-rank fusion respectively).

5 Experimental Results and Comparingwith State-of-the-art

We compared our proposed method with three base-line methods. Concat fusion (CF) (Baltrusaitiset al., 2018) proposes a simple concatenation ofthe different modalities followed by a linear com-bination. The tensor fusion approach (TF) (Zadehet al., 2017) computes a tensor including uni-modal,bi-modal, and tri-modal combination information.LMF (Liu et al., 2018) is a tensor fusion methodthat performs tensor factorization using the samerank for all the modalities in order to reduce thenumber of parameters. Our proposed method aimsto use different factors for each modality.

In Table 2, we present mean average error(MAE), the correlation between prediction and truescores, binary accuracy (Acc-2), multi-class accu-racy (Acc-7) and F1 measure. The proposed ap-proach outperforms baseline approaches in nearlyall metrics, with marked improvements in Happyand Neutral recognition. The reason is that theinter-modality information for these emotions ismore complicated than the other emotions and re-quires a non-diagonal core tensor to extract thecomplicated information. It is worth to note thatfor the equivalent setting and equal ranks for allthe modalities, the result of the proposed methodis always marginally better than LMF method.

5.1 Investigating the Effect of CompressionRate on Each Modality

In this section, we aim to investigate the amountof redundant information in each modality. To dothis, after obtaining a tensor which includes thecombinations of all modalities with the equivalentsize, we factorize a single dimension of the ten-sor while keeping the size for the other modalitiesfixed. By observing how the performance changesby compression rate, one can find how much redun-dant information is contained in the correspondingmodality relative to the other modalities.

The results can be seen in Fig. 2, 3 and 4. Thehorizontal axis is the compressed size and the ver-

tical axis shows the accuracy for each modality.Note that due to the padding of each Di with 1, wehave used ri + 1 as the new embedding size.

The first point that could be perceived clearlyfrom the different modality diagrams is that eachof the modalities changes in a different way whengetting compressed, which means they each havea different amount of information that can not becompensated by the non-compressed modalities.In other words, a high accuracy when a modalityis highly compressed means that there is a lot ofredundant information in this modality — the in-formation loss resulting from factorization couldbe compensated by the other modalities so perfor-mance was not reduced.

Fig. 2 shows results for the CMU-MOSI senti-ment analysis dataset. For this dataset, a notabledecrease in accuracy can be seen by compressingthe video modality, while the audio and text modal-ities are not notably sensitive to compression. Thisshows that for sentiment analysis based on CMU-MOSI dataset, the information in Video modalitycannot be compensated by other modalities, how-ever most information in the audio and languagemodalities is covered in video modality. In otherwords, the video contains essential information forthis task whereas information from audio and lan-guage can be recovered from video.

Fig. 3 shows the average accuracy over 16 per-sonality types for the POM personality trait recog-nition dataset. For this dataset also, each of themodalities has a different behavior for differentcompression rates. We can see that the audiomodality includes more non-redundant informationfor personality recognition as accuracy is highlyaffected by audio compression. In addition, there isa notable accuracy reduction when the languagemodality is highly compressed, which shows asmall amount of non-redundant information forthis task. Note that the POM data does not containsufficient information for an effective analysis ofthe 16 personality sub types individually.

Fig. 4 shows the results for the IEMOCAP emo-

266

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy

compressed size for modality i (ri)

compressing one modality---MOSI

compressing only the audio modality

compressing only the video modality

compressing only the lang. modality

Figure 2: CMU-MOSI sentiment analysis dataset: Ef-fect of different compression rates on accuracy for sin-gle modalities.

0.25

0.27

0.29

0.31

0.33

0.35

0.37

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy


compressing one modality---POM




Figure 3: POM personality recognition dataset: Effectof different compression rates on accuracy for singlemodalities.

tion recognition dataset for each of the four emo-tional categories: happy, angry, sad, and neutral.Looking at the sad category, we see notable accu-racy reduction for small sizes (high compression)for all the modalities, showing that each containsat least some non-redundant information. How-ever, high compression of audio and especially lan-guage modalities results in strong accuracy reduc-tion whereas video compression results in relativelyminor reduction. It can be concluded that for thisemotion, the language modality has the most non-redundant information and the video modality verylittle — it’s information can be compensated bythe other two modalities. Moving on to the angryemotion, small sizes (high compression) result inaccuracy reduction for audio and language modali-ties, showing that they contain some non-redundant

information, with the audio modality containingmore. Again the information in video can be al-most completely compensated by the other twomodalities.

By comparing the highest accuracy values forvarious emotion categories, it is observed that neu-tral is hard to predict in comparison to the other cat-egories. Again, the audio and Language modalitiesboth include non-redundant information leading toa severe accuracy reduction with high compressionof these modalities, with video containing almostno information not compensated by audio and lan-guage.

The happy category is the easiest to predict emo-tion, and it slightly suffers for very small sizes ofaudio and video and language modalities, indicat-ing a small amount of non-redundant informationin all modalities.

6 Conclusion

We proposed a tensor fusion method for multi-modal media analysis by obtaining an M + 1-waytensor to consider the high-order relationships be-tween M input modalities and the output layer.Our modality-based factorization method removesthe redundant information in this high-order de-pendency structure and leads to fewer parameterswith minimal loss of information. In addition, amodality-based factorization approach helps to un-derstand the relative quantities of non-redundantinformation in each modality through investigationsensitivity to modality-specific compression rates.As the proposed compression method leads to aless complicated model, it can be applied as a regu-larizer which avoiding overfitting.

We have provided experimental results for com-bining acoustic, text, and visual modalities for threedifferent tasks: sentiment analysis, personality traitrecognition, and emotion recognition. We haveseen that the modality-based tensor compressionapproach improves the results in comparison tothe simple concatenation method, the tensor fusionmethod and tensor fusion using the same factoriza-tion rank for all modalities, as proposed in the LMFmethod. In other words, the proposed method en-joys the same benefits as the tensor fusion methodand avoids suffering from having a large number ofparameters, which leads to a more complex model,needs many training samples and is more proneto overfitting. We have investigated the effect ofthe compression rate on single modalities while

267

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy


compressing one modality---Sad




0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy


compressing one modality---Angry




0.75

0.77

0.79

0.81

0.83

0.85

0.87

0.89

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy


compressing one modality---Happy




0.6

0.62

0.64

0.66

0.68

0.7

0.72

0 1 2 3 4 5 6 7 8 12 16 20 24 28 32

Acc

ura

cy


compressing one modality-Neutral




Figure 4: IEMOCAP Emotion Recognition Dataset: Effect of different compression rates on accuracy for singlemodalities.

fixing the other modalities helping to understandthe amount of useful non-redundant information ineach modality. Moreover, we have evaluated ourmethod by comparing the results with state-of-the-art methods, achieving a 1% to 4% improvementacross multiple measures for the different tasks.

In future work, we will investigate the relationbetween dataset size and compression rate by ap-plying our method to larger datasets. This will helpto understand the trade-off between the model sizeand available training data, allowing more efficienttraining and avoiding under- and overfitting.

As the availability of data with more and moremodalities increases, both finding a trade-off be-tween cost and performance and effective and effi-cient utilization of available modalities will be vital.Exploring compression methods promises to helpidentify and remove highly redundant modalities.

Acknowledgments

This work was partially funded by grants#16214415 and #16248016 of the Hong Kong Re-search Grants Council, ITS/319/16FP of InnovationTechnology Commission, and RDC 1718050-0 ofEMOS.AI. Thanks to Dr. Ian D. Wood and PeymanMomeni for their helpful commnets.

ReferencesTadas Baltrusaitis, Chaitanya Ahuja, and Louis-

Philippe Morency. 2018. Multimodal machine learn-ing: A survey and taxonomy. IEEE Transactions onPattern Analysis and Machine Intelligence.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jean-nette N Chang, Sungbok Lee, and Shrikanth SNarayanan. 2008. Iemocap: Interactive emotionaldyadic motion capture database. Language re-sources and evaluation, 42(4):335.

268

J Douglas Carroll and Jih-Jie Chang. 1970. Analysisof individual differences in multidimensional scal-ing via an n-way generalization of eckart-young de-composition. Psychometrika, 35(3):283–319.

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-trusaitis, Amir Zadeh, and Louis-Philippe Morency.2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceed-ings of the 19th ACM International Conference onMultimodal Interaction, pages 163–171. ACM.

Liyanage C De Silva, Tsutomu Miyasato, and RyoheiNakatsu. 1997. Facial emotion recognition usingmulti-modal information. In Information, Commu-nications and Signal Processing, 1997. ICICS., Pro-ceedings of 1997 International Conference on, vol-ume 1, pages 397–401. IEEE.

Gilles Degottex, John Kane, Thomas Drugman, TuomoRaitio, and Stefan Scherer. 2014. Covarepa collabo-rative voice analysis repository for speech technolo-gies. In Acoustics, Speech and Signal Processing(ICASSP), 2014 IEEE International Conference on,pages 960–964. IEEE.

Sidney K D’mello and Jacqueline Kory. 2015. A re-view and meta-analysis of multimodal affect detec-tion systems. ACM Computing Surveys (CSUR),47(3):43.

Maryam Fazel. 2002. Matrix rank minimization withapplications. Ph.D. thesis, PhD thesis, Stanford Uni-versity.

Michael Glodek, Stephan Tschechne, Georg Lay-her, Martin Schels, Tobias Brosch, Stefan Scherer,Markus Kachele, Miriam Schmidt, Heiko Neumann,Gunther Palm, et al. 2011. Multiple classifier sys-tems for the classification of audio-visual emotionalstates. In Affective Computing and Intelligent Inter-action, pages 359–368. Springer.

RA Harshman. 1970. Foundations of the parafac pro-cedure: Models and conditions for an” explanatory”multimodal factor analysis. UCLA Working Papersin Phonetics, 16:1–84.

Frank L Hitchcock. 1927. The expression of a tensor ora polyadic as a sum of products. Studies in AppliedMathematics, 6(1-4):164–189.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Alex Pappachen James and Belur V Dasarathy. 2014.Medical image fusion: A survey of the state of theart. Information Fusion, 19:4–19.

Onno Kampman, Elham J Barezi, Dario Bertero, andPascale Fung. 2018. Investigating audio, video, andtext fusion methods for end-to-end automatic person-ality prediction. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), volume 2, pages606–611.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Zhen-Zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, andAlexander G Hauptmann. 2014. Multimedia classifi-cation and event detection using double fusion. Mul-timedia tools and applications, 71(1):333–347.

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multi-modal fusion with modality-specific factors. arXivpreprint arXiv:1806.00064.

Emilie Morvant, Amaury Habrard, and Stephane Ay-ache. 2014. Majority vote of diverse classifiersfor late fusion. In Joint IAPR International Work-shops on Statistical Techniques in Pattern Recog-nition (SPR) and Structural and Syntactic PatternRecognition (SSPR), pages 153–162. Springer.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, JuhanNam, Honglak Lee, and Andrew Y. Ng. 2011.Multimodal deep learning. In Proceedings of the28th International Conference on Machine Learning(ICML-11), ICML ’11, pages 689–696, New York,NY, USA. ACM.

Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,Kenji Sagae, and Louis-Philippe Morency. 2014.Computational analysis of persuasiveness in socialmultimedia: A novel dataset and multimodal predic-tion approach. In Proceedings of the 16th Interna-tional Conference on Multimodal Interaction, pages50–57. ACM.

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,Navonil Majumder, Amir Zadeh, and Louis-PhilippeMorency. 2017a. Context-dependent sentiment anal-ysis in user-generated videos. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages873–883.

Soujanya Poria, Erik Cambria, Devamanyu Haz-arika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017b. Multi-level multiple at-tentions for contextual multimodal sentiment analy-sis. In 2017 IEEE International Conference on DataMining (ICDM), pages 1033–1038. IEEE.

Gerasimos Potamianos, Chalapathy Neti, GuillaumeGravier, Ashutosh Garg, and Andrew W Senior.2003. Recent advances in the automatic recogni-tion of audiovisual speech. Proceedings of the IEEE,91(9):1306–1326.

Ekaterina Shutova, Douwe Kiela, and Jean Maillard.2016. Black holes and white rabbits: Metaphor iden-tification with visual features. In Proceedings of the

http://dl.acm.org/citation.cfm?id=3104482.3104569

269

2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 160–170.

Mohammad Soleymani, Maja Pantic, and Thierry Pun.2012. Multimodal emotion recognition in responseto videos. IEEE transactions on affective computing,3(2):211–223.

Nitish Srivastava and Ruslan R Salakhutdinov. 2012.Multimodal learning with deep boltzmann machines.In Advances in Neural Information Processing Sys-tems 25, pages 2222–2230. Curran Associates, Inc.

Ledyard R Tucker. 1966. Some mathematical noteson three-mode factor analysis. Psychometrika,31(3):279–311.

Jiahong Yuan and Mark Liberman. 2008. Speaker iden-tification on the scotus corpus. Journal of the Acous-tical Society of America, 123(5):3878.

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-bria, and Louis-Philippe Morency. 2017. Tensorfusion network for multimodal sentiment analysis.arXiv preprint arXiv:1707.07250.

Amir Zadeh, Paul Pu Liang, Navonil Mazumder,Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. 2018a. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAIConference on Artificial Intelligence.

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra-teek Vij, Erik Cambria, and Louis-Philippe Morency.2018b. Multi-attention recurrent network for humancommunication comprehension. In Thirty-SecondAAAI Conference on Artificial Intelligence.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal cor-pus of sentiment intensity and subjectivity anal-ysis in online opinion videos. arXiv preprintarXiv:1606.06259.

Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, andShai Avidan. 2006. Fast human detection using acascade of histograms of oriented gradients. In Com-puter Vision and Pattern Recognition, 2006 IEEEComputer Society Conference on, volume 2, pages1491–1498. IEEE.

http://papers.nips.cc/paper/4683-multimodal-learning-with-deep-boltzmann-machines.pdf

modality-based factorization for multimodal fusion · 2019. 7. 23. · 1 introduction multimodal...

Documents