does my multimodal model learn cross-modal interactions ... · modeling expressive cross-modal...

17
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 861–877, November 16–20, 2020. c 2020 Association for Computational Linguistics 861 Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! Jack Hessel Allen Institute for AI [email protected] Lillian Lee Cornell University [email protected] Abstract Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, some- times high-performing black-box algorithms turn out to be mostly exploiting unimodal sig- nals in the data. We propose a new diagnos- tic tool, empirical multimodally-additive func- tion projection (EMAP), for isolating whether or not cross-modal interactions improve per- formance for a given model on a given task. This function projection modifies model pre- dictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of- the-art benchmarks), we find that, in many cases, removing cross-modal interactions re- sults in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to con- sideration of cross-modal feature interactions. We hence recommend that researchers in mul- timodal machine learning report the perfor- mance not only of unimodal baselines, but also the EMAP of their best-performing model. 1 Introduction Given the presumed importance of reasoning across modalities in multimodal machine learn- ing tasks, we should evaluate a model’s ability to leverage cross-modal interactions. But such evalu- ation is not straightforward; for example, an early Visual Question-Answering (VQA) challenge was later “broken” by a high-performing method that ignored the image entirely (Jabri et al., 2016). One response is to create multimodal-reasoning datasets that are specifically and cleverly balanced to resist language-only or visual-only models; ex- amples are VQA 2.0 (Goyal et al., 2017), NLVR2 Multimodally-additive models Linear model An image+text ensemble Kernel SVM LXMERT Neural Net Empirical Multimodally Additive Projection Figure 1: We introduce EMAP, a diagnostic for clas- sifiers that take in t extual and v isual inputs. Given a (black-box) trained model, EMAP computes the pre- dictions of an image/text ensemble that best approx- imates the full model predictions via empirical func- tion projection. Although the projected predictions lose visual-textual interaction signals exploited by the orig- inal model, they often perform suprisingly well. (Suhr et al., 2019), and GQA (Hudson and Man- ning, 2019). However, a balancing approach not always desirable. For example, if image+text data is collected from an online social network (such as for popularity prediction or sentiment analy- sis), post-hoc rebalancing may obscure trends in the original data-generating processs. So, what al- ternative diagnostic tools are available for better understanding what models learn? The main tool utilized by prior work is model comparison. In addition to comparing against text-only and image-only baselines, often, two multimodal models with differing representational capacity (e.g., a cross-modal attentional neural network vs. a linear model) are trained and their performance compared. The argument commonly made is that if model A, with greater expres- sive capacity, outperforms model B, then the per- formance differences can be at least partially at- tributed to that increased expressivity. But is that a reliable argument? Model perfor-

Upload: others

Post on 06-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 861–877,November 16–20, 2020. c©2020 Association for Computational Linguistics

861

Does my multimodal model learn cross-modal interactions?It’s harder to tell than you might think!

Jack HesselAllen Institute for AI

[email protected]

Lillian LeeCornell University

[email protected]

AbstractModeling expressive cross-modal interactionsseems crucial in multimodal tasks, such asvisual question answering. However, some-times high-performing black-box algorithmsturn out to be mostly exploiting unimodal sig-nals in the data. We propose a new diagnos-tic tool, empirical multimodally-additive func-tion projection (EMAP), for isolating whetheror not cross-modal interactions improve per-formance for a given model on a given task.This function projection modifies model pre-dictions so that cross-modal interactions areeliminated, isolating the additive, unimodalstructure. For seven image+text classificationtasks (on each of which we set new state-of-the-art benchmarks), we find that, in manycases, removing cross-modal interactions re-sults in little to no performance degradation.Surprisingly, this holds even when expressivemodels, with capacity to consider interactions,otherwise outperform less expressive models;thus, performance improvements, even whenpresent, often cannot be attributed to con-sideration of cross-modal feature interactions.We hence recommend that researchers in mul-timodal machine learning report the perfor-mance not only of unimodal baselines, but alsothe EMAP of their best-performing model.

1 Introduction

Given the presumed importance of reasoningacross modalities in multimodal machine learn-ing tasks, we should evaluate a model’s ability toleverage cross-modal interactions. But such evalu-ation is not straightforward; for example, an earlyVisual Question-Answering (VQA) challenge waslater “broken” by a high-performing method thatignored the image entirely (Jabri et al., 2016).

One response is to create multimodal-reasoningdatasets that are specifically and cleverly balancedto resist language-only or visual-only models; ex-amples are VQA 2.0 (Goyal et al., 2017), NLVR2

Multimodally-additivemodels

Linear model

An image+text ensemble

Kernel SVMLXMERT

Neural Net

Empirical MultimodallyAdditiveProjection

Figure 1: We introduce EMAP, a diagnostic for clas-sifiers that take in textual and visual inputs. Given a(black-box) trained model, EMAP computes the pre-dictions of an image/text ensemble that best approx-imates the full model predictions via empirical func-tion projection. Although the projected predictions losevisual-textual interaction signals exploited by the orig-inal model, they often perform suprisingly well.

(Suhr et al., 2019), and GQA (Hudson and Man-ning, 2019). However, a balancing approach notalways desirable. For example, if image+text datais collected from an online social network (suchas for popularity prediction or sentiment analy-sis), post-hoc rebalancing may obscure trends inthe original data-generating processs. So, what al-ternative diagnostic tools are available for betterunderstanding what models learn?

The main tool utilized by prior work is modelcomparison. In addition to comparing againsttext-only and image-only baselines, often, twomultimodal models with differing representationalcapacity (e.g., a cross-modal attentional neuralnetwork vs. a linear model) are trained and theirperformance compared. The argument commonlymade is that if model A, with greater expres-sive capacity, outperforms model B, then the per-formance differences can be at least partially at-tributed to that increased expressivity.

But is that a reliable argument? Model perfor-

Page 2: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

862

mance comparisons are an opaque tool for anal-ysis, especially for deep neural networks: per-formance differences versus baselines, frequentlysmall in magnitude, can often be attributed to hy-perparameter search schemes, random seeds, thenumber of models compared, etc. (Yogatamaand Smith, 2015; Dodge et al., 2019). Thus,while model comparisons are an acceptable start-ing point for demonstrating whether or not amodel is learning an interesting set of (or any!)cross-modal factors, they provide rather indirectevidence.

We propose Empirical Multimodally-Additive1

function Projection (EMAP) as an additional di-agnostic for analyzing multimodal classificationmodels. Instead of comparing two different mod-els, a single multimodal classifier’s predictions areprojected onto a less-expressive space: the resultis equivalent to a set of predictions made by theclosest possible ensemble of text-only and visual-only classifiers. The projection process is com-putationally efficient, has no hyperparameters totune, can be implemented in a few lines of code,is provably unique and optimal, and works on anymultimodal classifier: we apply it to models rang-ing from polynomial kernel SVMs to deep, pre-trained, Transformer-based self-attention models.

We first verify that EMAPs do degrade per-formance for synthetic cases and for visual ques-tion answering cases where datasets have beenspecifically designed to require cross-modal rea-soning. But we then examine a test suite ofseveral recently-proposed multimodal predictiontasks that have not been specifically balanced inthis way. We first achieve state-of-the-art per-formance for all of the datasets using a linearmodel. Next, we examine more expressive inter-active models, e.g., pretrained Transformers, capa-ble of cross-modal attention. While these modelssometimes outperform the linear baseline, EMAPreveals that performance gains are (usually) notdue to multimodal interactions being leveraged.Takeaways: For future work on multimodal clas-sification tasks, we recommend authors reportthe performance of: 1) unimodal baselines; 2)any multimodal models they consider; and, criti-cally, 3) the empirical multimodally-additive pro-jection (EMAP) of their best performing multi-modal model (see §6 for our full recommenda-tions).

1In §3, we more formally introduce additivity.

2 Related Work

Constructed multimodal classification tasks. Inaddition to image question answering/reasoningdatasets already mentioned in §1, other multi-modal tasks have been constructed, e.g., videoQA (Lei et al., 2018; Zellers et al., 2019), visualentailment (Xie et al., 2018), hateful multimodalmeme detection (Kiela et al., 2020), and tasksrelated to visual dialog (de Vries et al., 2017).In these cases, unimodal baselines are shown toachieve lower performance relative to their expres-sive multimodal counterparts.Collected multimodal corpora. Recent compu-tational work has examined diverse multimodalcorpora collected from in-vivo social processes,e.g., visual/textual advertisements (Hussain et al.,2017; Ye and Kovashka, 2018; Zhang et al., 2018),images with non-literal captions in news arti-cles (Weiland et al., 2018), and image/text in-structions in cooking how-to documents (Alikhaniet al., 2019). In these cases, multimodal classi-fication tasks are often proposed over these cor-pora as a means of testing different theories fromsemiotics (Barthes, 1988; O’Toole, 1994; Lemke,1998; O’Halloran, 2004, inter alia); unlike manyVQA-style datasets, they are generally not specif-ically balanced to force models to learn cross-modal interactions.

Without rebalancing, should we expect cross-modal interactions to be useful for these multi-modal communication corpora? Some semioti-cians posit: yes! Meaning multiplication (Barthes,1988) between images and text suggests, as sum-marized by Bateman (2014):

under the right conditions, the value of a com-bination of different modes of meaning can beworth more than the information (whatever thatmight be) that we get from the modes when usedalone. In other words, text ‘multiplied by’ im-ages is more than text simply occurring with oralongside images.

Jones et al. (1979) provide experimental evi-dence of conditional, compositional interactionsbetween image and text in a humor setting, con-cluding that “it is the dynamic interplay be-tween picture and caption that describes the mul-tiplicative relationship” between modalities. Tax-onomies of the specific types of compositional re-lationships image-text pairs can exhibit have beenproposed (Marsh and Domas White, 2003; Mar-tinec and Salway, 2005).

Page 3: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

863

Model Interpretability. In contrast to meth-ods that design more interpretable algorithms forprediction (Lakkaraju et al., 2016; Ustun andRudin, 2016), several researchers aim to explainthe behavior of complex, black-box predictors onindividual instances (Strumbelj and Kononenko,2010; Ribeiro et al., 2016). The most relatedof these methods to the present work is Ribeiroet al. (2018), who search for small sets of “an-chor” features that, when fixed, largely determinea model’s output prediction on the input points.While similar in spirit to ours and other meth-ods that “control for” a fixed subset of features(e.g., Breiman (2001)), their work 1) focuses onlyon high-precision, local explanations on single in-stances; 2) doesn’t consider multimodal models(wherein feature interactions are combinatoriallymore challenging in comparison to unimodal mod-els); and 3) wouldn’t guarantee consideration ofmultimodal “anchors.”

3 EMAP

Background. We consider models f that assignscores to textual-visual pairs (t, v), where t is apiece of text (e.g., a sentence), and v is an im-age.2 In multi-class classification settings, valuesf(t, v) ∈ Rd typically serve as per-class scores.In ranking settings f(t1, v1) may be compared tof(t2, v2) via a ranking objective.

We are particularly interested in the types ofcompositionality that f uses over its visual andtextual inputs to produce scores. Specifically, wedistinguish between additive models (Hastie andTibshirani, 1987) and interactive models (Fried-man, 2001; Friedman and Popescu, 2008).3 Afunction f of a representation z of n input featuresis additive in I if it decomposes as

f(z) = fI(zI) + f\I(z\I),

where (setting I = {1, . . . , n} for convenience)I ⊂ I indexes a subset of the features, \I = I \ I ,and for any S ⊂ I, zS is the restriction of z toonly those features whose indices appear in S.

In our case, because features are the union ofTextual and Visual predictors, we say that a model

2The methods introduced here can be easily extended be-yond just text/image pairs (e.g., to videos, audio, etc.), and tomore than two modalites.

3The multimedia commonly makes a related distinctionbetween early fusion (joint multimodal processing) and latefusion (ensembles) (Snoek et al., 2005).

is multimodally additive if it decomposes as

f(t, v) = fT (t) + fV (v) (1)

Additive multimodal models are simply ensem-bles of unimodal classifiers and as such, may beconsidered underwhelming to multiple communi-ties. A recent ACL paper, for example, refers tosuch ensembles as “naive.” On the semiotics side,the conditionality implied by meaning multiplica-tion (Barthes, 1988) — that the joint semanticsof an image/caption depends non-linearly on itsaccompaniment — cannot be modeled additively:multimodally additive models posit that each im-age, independent of its text pairing, contributes afixed score to per-class logits (and vice versa).

In contrast, multimodally interactive models arethe set of functions that cannot be decomposed asin Equation 1 — that is, f ’s output conditionallydepends on its inputs in a non-additive fashion.Machine learning models. One canonical multi-modally additive model is a linear model trainedover a concatenation of textual and visual features[t; v], i.e.,

f(t, v) = wT [t; v] + b = wTt t︸︷︷︸fT (t)

+wTv v + b︸ ︷︷ ︸fV (v)

. (2)

We later detail several multimodally interactivemodels, including multi-layer neural networks,polynomial kernel SVMs, pretrained Transformerattention-based models, etc. However, eventhough interactive models are theoretically capa-ble of modeling a more expressive set of relation-ships, it’s not clear that they will learn to exploitthis additional expressivity, particularly when sim-pler patterns suffice.Empirical multimodally-additive projections:EMAP. Given a fixed, trained model f (usu-ally one theoretically capable of modeling visual-textual interactions) and a set of N labelled data-points {(vi, ti, yi)}Ni=1, we aim to answer: does futilize cross-modal feature interactions to producemore accurate predictions for these datapoints?

Hooker (2004) gives a general method for com-puting the projection of a function f onto a set offunctions with a specified ANOVA structure (seealso Sobol (2001); Liu and Owen (2006)): our al-gorithmic contributions are to extend the methodto multimodal models with d > 1 dimensionaloutputs, and to prove that the multimodal empiri-cal approximation is optimal. Specifically: we are

Page 4: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

864

Algorithm 1 Empirical Multimodally-Additive Projection

(EMAP); worked example in Appendix G.Input: a trained model f that outputs logits; a set oftext/visual pairs D = {(ti, vi)}Ni=1

Output: the predictions of f , the empirical projection off onto the set of multimodally-additive functions, on D.

fcache = 0N×N×d; predsf = 0N×d

for i, j ∈ {1, 2, . . . , N} × {1, 2, . . . , N} dofcache(i, j) = f(ti, vj)

end forµ ∈ Rd = 1

N2

∑i,j fcache(i, j)

for i ∈ {1, 2, . . . , N} doprojt =

1N

∑Nj=1 fcache(i, j)

projv = 1N

∑Nj=1 fcache(j, i)

predsf [i] = projt + projv − µend forreturn predsf

interested in f , the following projection of f ontothe set of multimodally-additive functions:

f(t, v) = Ev[f(t, v)]︸ ︷︷ ︸fT (t)

+Et[f(t, v)]︸ ︷︷ ︸fV (v)

− Et,v[f(t, v)]︸ ︷︷ ︸µ

where Ev[f(t, v)], a function of t, is the partialdependence of f on t, i.e., fT (t) = Ev[f(t, v)] =∫f(t, v)p(v)dv,4 and similarly for the other ex-

pectations. An empirical approximation of thepartial dependence function for a given ti can becomputed by looping over all observations:

fT (ti) =1

N

N∑j=1

f(ti, vj) .

We similarly arrive at fV (vi) and µ, yielding

f(ti, vi) = fT (ti) + fV (vi) + µ (3)

which is what EMAP, Algorithm 1, computes foreach (ti, vi). Note that Algorithm 1 involves eval-uating the original model f on allN2 〈vi, tj〉 pairs— even mismatched image/text pairs that do notoccur in the observed data. In practice, we rec-ommend only computing this projection over theevaluation set.5 Once the predictions of f and fare computed over the evaluation points, then theirperformance can be compared according to stan-dard evaluation metrics, e.g., accuracy or AUC.

4Both Friedman (2001) and Hooker (2004) argue thatthis expectation should be taken over the marginal p(v) ratherthan the conditional p(v|t).

5 When even restricting to the evaluation set would be tooexpensive, as in the case of the R-POP data we experimentwith later, one can repeatedly run EMAP on randomly-drawn500-instance (say) samples from the test set.

In Appendix A, we prove that the (mean-centered) sum of empirical partial dependencefunctions is optimal with respect to squared error,that is:

Claim. Subject to the constraint that f is multi-modally additive, Algorithm 1 produces a uniqueand optimal solution to

argminf values

∑i,j

‖f(ti, vj)− f(ti, vj)‖22. (4)

4 Sanity Check: EMAP Hurts in Tasksthat Require Interaction

In §5.1, we will see that EMAP provides a verystrong baseline for “unbalanced” multimodal clas-sification tasks. But first, we first seek to verifythat EMAP degrades model performance in casesthat are designed to require cross-modal interac-tions.

Synthetic data

We generate a set of “visual”/“textual”/label data(v, t, y) according to the following process:6

1. Sample random projection V ∈ Rd1×d and T ∈ Rd2×d

from U(−.5, .5).2. Sample v, t ∈ Rd ∼ N(0, 1); normalize to unit length.3. If |v · t| > δ proceed, else, return to the previous step.4. If v · t > 0, then y = 1, else y = 0.5. Return the data point (V v, T t, y).

This function challenges models to learnwhether or not the dot product of two randomlysampled vectors in d dimensions is positive or neg-ative — a task that, by construction, requires mod-eling the multiplicative interaction of the featuresin these vectors. To complicate the task, the vec-tors are randomly projected to two “modalities” ofdifferent dimensions, d1 and d2 respectively.

We trained a linear model, a polynomial kernelSVM, and a feed-forward neural network on thisdata: the results are in Table 1. The linear model isadditive and thus incapable of learning any mean-ingful pattern on this data. In contrast, the ker-nel SVM and feed-forward NN, interactive mod-els, are able to fit the test data almost perfectly.However, when we apply EMAP to the interactivemodels, as expected, their performance drops torandom.

6We sample 5K points in an 80/10/10 train/val/test splitwith 〈d, d1, d2, δ〉 = 〈100, 2000, 1000, .25〉, though similarresults were obtained with different parameter settings.

Page 5: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

865

Linear (A) Poly (I) NN (I)

Test Acc 52.8% 99.6% 99.0%�

+ EMAP 52.8% 49.4% 53.8%

Table 1: Prediction accuracy on synthetic dataset us-ing additive (A) models, interactive (I) models, andtheir EMAP projections. Random guessing achieves50% accuracy. Under EMAP, the interactive modelsdegrade to (close to) random, as desired. See §5 fortraining details.

Balanced VQA TasksOur next sanity check is to verify that EMAPhurts the performance of interactive models on tworeal multimodal classification tasks that are specif-ically balanced to require modeling cross-modalfeature interactions: VQA 2.0 (Goyal et al., 2017)and GQA (Hudson and Manning, 2019).

First, we fine-tuned LXMERT (Tan and Bansal,2019), a multimodally-interactive, pretrained, 14-layer Transformer model (See §5 for full descrip-tion) that achieves SOTA on both datasets. TheLXMERT authors frame question-answering asa multi-class image/text-pair classification prob-lem — 3.1K candidate answers for VQA2, and1.8K for GQA. In Table 2, we compare, in cross-validation, the means of: 1) accuracy of LXMERT,2) accuracy of the EMAP of LXMERT, and 3) ac-curacy of simply predicting the most common an-swer for all questions. As expected, EMAP de-creases accuracy on VQA2/GQA by 30/19 abso-lute accuracy points, respectively: this suggestsLXMERT is utilizing feature interactions to pro-duce more accurate predictions on these datasets.On the other hand, performance of LXMERT’sEMAP remains substantially better than constantprediction, suggesting that LXMERT’s logits dononetheless leverage some unimodal signals inthis data.7

5 “Unbalanced” Datasets + Tasks

We now return to our original setting: multimodalclassification tasks that have not been specificallyformulated to force cross-modal interactions. Ourgoal is to explore what additional insights EMAPcan add on top of standard model comparisons.

We consider a suite of 7 tasks, summarizedin Table 3. These tasks span a wide variety of

7EMAPed LXMERT’s performance is comparable toLSTM-based, text-only models, which achieve 44.3/41.1 ac-curacy on the full VQA2/GQA test set, respectively.

LXMERT →EMAP Const Pred

VQA2 70.3 40.5 23.4GQA 60.3 41.0 18.1

Table 2: As expected, for VQA2 and GQA, the meanaccuracy of LXMERT is substantially higher than itsempirical multimodally additive projection (EMAP).Shown are averages over k = 15 random subsamplesof 500 dev-set instances.

goals, sizes, and structures: some aim to clas-sify semiotic properties of image+text posts, e.g.,examining the extent of literal image/text over-lap (I-SEM, I-CTX, T-VIS); others are post-hocannotated according to taxonomies of social in-terest (I-INT, T-ST1, T-ST2); and one aims todirectly predict community response to content(R-POP).8 In some cases, the original authorsemphasize the potentially complex interplay be-tween image and text: Hessel et al. (2017) won-der if “visual and the linguistic interact, some-times reinforcing and sometimes counteractingeach other’s individual influence;” Kruk et al.(2019) discuss meaning multiplication, emphasiz-ing that “the text+image integration requires infer-ence that creates a new meaning;” and Vempalaand Preotiuc-Pietro (2019) attribute performancedifferences between their “naive” additive base-line and their interactive neural model to the no-tion that “both types of information and their in-teraction are important to this task.”

Our goal is not to downplay the importance ofthese datasets and tasks; it may well be the casethat conditional, compositional interactions occurbetween images and text, but the models we con-sider do not yet take full advantage of them (wereturn to this point in item 6). Our goal, rather,is to provide diagnostic tools that can provide ad-ditional clarity on the remaining shortcomings ofcurrent models and reporting schemes.

Additive and Unimodal Models

The additive model we consider is the linear modelfrom Equation 2 trained over the concatenation ofimage and text representations.9 To represent im-

8In Appendix B, we include descriptions of our repro-duction efforts for each dataset/task, but please see the origi-nal papers for fuller descriptions of the data/tasks.

9For all linear models in this work, we select optimalhyperparameters according to grid search, optimizing valida-tion model performance for each cross-validation split sepa-rately. We optimize: regularization type (L1, L2, vs. L1/L2),

Page 6: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

866

Original paper Task (structure) Abbrv. # image+text pairs we recovered

Kruk et al. (2019) Instagram�

intent (7-way classification) I-INT 1.3K�

semiotic (3-way clf) I-SEM 1.3K�

contextual (3-way clf) I-CTX 1.3KVempala and Preotiuc-Pietro (2019) Twitter visual-ness (4-way clf) T-VIS 3.9KHessel et al. (2017) Reddit popularity (pairwise ranking) R-POP 87.2KBorth et al. (2013) Twitter sentiment (binary clf) T-ST1 .6KNiu et al. (2016) Twitter sentiment (binary clf) T-ST2 4.0K

Table 3: The tasks we consider are not specifically balanced to force the learning of cross-modal interactions.

ages, we extract a feature vector from a pretrainedEfficientNet B410 (Tan and Le, 2019). To repre-sent text, we extract RoBERTa (Liu et al., 2019)token features, and mean pool.11 Our single-modal baselines are linear models fit over Effi-cientNet/RoBERTa features directly.

Interactive ModelsKernel SVM. We train an SVM with a poly-nomial kernel using RoBERTa text features andEfficientNet-B4 image features as input. A poly-nomial kernel endows the model with capacityto model multiplicative interactions between fea-tures.12

Neural Network. We train a feed-forward neu-ral network using the RoBERTa/EfficientNet-B4features as input. Following Chen et al. (2017),we first project text and image features via anaffine transform layer to representations t and v,respectively, of the same dimension. Then, we ex-tract new features, feeding the concatenated fea-ture vector [t; v; v− t; v� t] to a multi-layer, feed-forward network.13

Fine-tuning a Pretrained Transformer. Wefine-tuned LXMERT (Tan and Bansal, 2019) forour tasks. LXMERT represents images using 36predicted bounding boxes, each of which is as-

regularization strength (10**(-7,-6,-5,-4,-3,-2,-1,0,1.0)), andloss type (logistic vs. squared hinge). We train models usinglightning (Blondel and Pedregosa, 2016).

10For reproducibility, we used ResNet-18 features for theKruk et al. (2019) datasets; more detail in Appendix E.

11Feature extraction approaches are known to producecompetitive results relative to full fine-tuning (Devlin et al.,2019, §5.3); in some cases, mean pooling has been found tobe similarly competitive relative to LSTM pooling (Hesseland Lee, 2019).

12We again use grid search to optimize: polynomial ker-nel degree (2 vs. 3), regularization strength (10**(-5,-4,-3,-2,-1,0)), and gamma (1, 10, 100).

13Parameters are optimized with the Adam optimizer(Kingma and Ba, 2015). We decay the learning rate whenvalidation loss plateaus. The hyperparameters optimized ingrid search are: number of layers (2, 3, 4), initial learningrate (.01, .001, .0001), activation function (relu vs. gelu), hid-den dimension (128, 256), and batch norm (use vs. don’t).

signed a feature vector by a Faster-RCNN model(Ren et al., 2015; Anderson et al., 2018). Thismodel uses ResNet-101 (He et al., 2016) as abackbone and is pretrained on Visual Genome(Krishna et al., 2017). Bounding box featuresare fed through LXMERT’s 5-layer Transformer(Vaswani et al., 2017) model. Text is processedby a 9-layer Transformer. Finally, a 5-layer, cross-modal transformer processes the outputs of theseunimodal encoders jointly, allowing for featureinteractions to be learned. LXMERT’s parame-ters are pre-trained on several cross-modal rea-soning tasks, e.g., masked image region/languagetoken prediction, cross-modal matching, and vi-sual question answering. LXMERT achieves highperformance on balanced tasks like VQA 2.0:thus, we know this model can learn compositional,cross-modal interactions in some settings.14

LXMERT + logits: We also trained versions ofLXMERT where fixed logits from a pretrainedlinear model (described above) are fed in to thefinal classifier layer. In this case, LXMERT isonly tasked with learning the residual between thestrong additive model and the labels. Our intu-ition was that this augmentation might enable themodel to focus more closely on learning inter-active, rather than additive, structure in the fine-tuning process.

5.1 Results

Our main prediction results are summarized in Ta-ble 4. For all tasks, the performance of our base-line additive linear model is strong, but we are usu-ally able to find an interactive model that outper-forms this linear baseline, e.g., in the case of T-ST2, a polynomial kernel SVM outperforms thelinear model by 4 accuracy points. This observa-tion alone seems to provide evidence that models

14We follow the original authors’ fine-tuning recommen-dations, but also optimize the learning rate according to val-idation set performance for each cross-validation split/taskseparately between 1e-6, 5e-6, 1e-5, 5e-5, and 1e-4.

Page 7: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

867

I-INT I-SEM I-CTX T-VIS R-POP T-ST1 T-ST2

Metric AUC AUC AUC Weighted F1 ACC AUC ACCCross-val Setup 5-fold 5-fold 5-fold 10-fold 15-fold 5-fold 5-foldConstant Pred. 50.0 50.0 50.0 17.2 50.0 50.0 66.2Prev. SOTA 85.3 69.1 78.8 44 62.7 N/A 70.5

Our image-only 73.6 56.5 61.0 47.2 59.1 73.3 67.2Our text-only 89.9 71.8 81.2 37.6 61.1 69.0 73.1

Neural Network (I) 90.4 69.2 78.5 51.1 63.5 71.1 79.9Polykernel SVM (I) 91.3 74.4 81.5 50.8 – 72.1 80.9FT LXMERT (I) 83.0 68.5 76.3 53.0 63.0 66.4 78.6�

+ Linear Logits (I) 89.9 73.0 80.7 53.4 64.1 75.5 80.3

Linear Model (A) 90.4 72.8 80.9 51.3 63.7 75.6 76.1Our Best Interactive (I) 91.3 74.4 81.5 53.4 64.2 75.5 80.9�

+ EMAP (A) 91.1 74.2 81.3 51.0 64.1 75.9 80.7

Table 4: Prediction results for 7 multimodal classification tasks. First block: the evaluation metric, setup, constantprediction performance, and previous state-of-the-art results (we outperform these baselines mostly because weuse RoBERTa). Second block: the performance of our image only/text only linear models. Third block: thepredictive performance of our (I)nteractive models. Fourth block: comparison of the performance of the best(I)nteractive model to the (A)dditive linear baseline. Crucially, we also report the EMAP of the best interactivemodel, which reveals whether or not the performance gains of the (I)nteractive model are due to modeling cross-modal interactions, or not. Italics=computed using 15 fold cross-validation over each cross-validation split (seefootnote 5). Bolded values are within half a point of the best model.

are taking advantage of some cross-modal inter-actions for performance gains. Previous analysesmight conclude here, arguing that cross-modal in-teractions are being utilized by the model mean-ingfully. But is this necessarily the case?

We utilize EMAP as an additional model diag-nostic by projecting the predictions of our best-performing interactive models. Surprisingly, forI-INT, I-SEM, I-CTX, T-ST1, T-ST2, and R-POP, EMAP results in essentially no performancedegradation. Thus, even (current) expressive in-teractive models are usually unable to leveragecross-modal feature interactions to improve per-formance. This observation would be obfus-cated without the use of additive projections, eventhough we compared to a strong linear baselinethat achieves state-of-the-art performance for eachdataset. This emphasizes the importance of notonly comparing to additive/linear baselines, butalso to the EMAP of the best performing model.

In total, for these experiments, we observe a sin-gle case, LXMERT + Linear Logits trained on T-VIS, wherein modeling cross-modal feature inter-actions appears to result in noticeable performanceincreases — here, the EMAPed model is 2.4 F1points worse.

How much does EMAP change a model’s pre-dictions? For these datasets, a model and itsEMAP usually make very similar predictions. Forall datasets except T-VIS (where EMAP degradesperformance) the best model and its EMAP agreeon the most likely label in more than 95% of caseson average. For comparison, retraining the fullmodels with different random seeds results in a96% agreement on average. (Full results are inAppendix C.)

6 Implication FAQs

Q1: What is your recommended experimen-tal workflow for future work examining mul-timodal classification tasks?A: With respect to automated classifiers, we rec-ommend reporting the performance of:

1. A constant and/or random predictorto provide perspective on the interplaybetween the label distribution and the

evaluation metrics.2. As-strong-as-possible single-modal modelsf(t, v) = g(t) and f(t, v) = h(v)

to understand how well the task can beaddressed unimodally with present techniques.

Page 8: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

868

3. An as-strong-as-possible multimodally addi-tive model f(t, v) = fT (t) + fV (v)

to understand multimodal model performancewithout access to sophisticated cross-modal

reasoning capacity.4. An as-strong-as-possible multimodally interac-

tive model, e.g., LXMERT,to push predictive performance as far as

possible.5. The EMAP of the strongest multimodally inter-

active model.to determine whether or not the best

interactive model is truly using cross-modalinteractions to improve predictive

performance.

We hope this workflow can be extended with addi-tional, newly developed model diagnostics goingforward.Q2: Should papers proposing new tasks be re-jected if image-text interactions aren’t shown tobe useful?A: Not necessarily. The value of a newly proposedtask should not depend solely on how well currentmodels perform on it. Other valid ways to demon-strate dataset/task efficacy: human experiments,careful dataset design inspired by prior work, orreal-world use-cases.Q3: Can EMAP tell us anything fundamentalabout the type of reasoning required to addressdifferent tasks themselves?A: Unfortunately, no more than model compar-isons can (at least for real datasets). EMAP isa tool that, like model comparison, provides in-sights about how specific, fixed models performon specific, fixed datasets. In FAQ 5, we attemptto bridge this gap in a toy setting where we areable to fully enumerate the sample space.Q4: Can EMAP tell us anything about individ-ual instances?A: Yes; but with caveats. While a model’s behav-ior on individual cases is difficult to draw conclu-sions from, EMAP can be used to identify singleinstances within evaluation datasets for which themodel is performing enough non-additive com-putation to change its ultimate prediction, i.e.,for a given (t, v), it’s easy to compare f(t, v) toEMAP(f(t, v)): these correspond to the inputsand outputs respectively of Algorithm 1. An ex-ample instance-level qualitative evaluation of T-VIS is given in Appendix F.

Q5: Do the strong EMAP results imply thatmost functions of interest are actually multi-modally additive?Short A: No.Long A: A skeptical reader might argue that,while multimodally-additive models cannot ac-count for cross-modal feature interactions, suchfeature interactions may not be required for cross-modal reasoning. While we authors are not awareof an agreed-upon definition,15 we will assumethat “cross-modal reasoning tasks” are those thatchallenge models to compute (potentially arbi-trary) logical functions of multimodal inputs. Un-der this assumption, we show that additive modelscannot fit (nor well-approximate) most non-triviallogical functions.

Consider the case of multimodal boolean targetfunctions f(t, v) ∈ {0, 1}, and assume our im-age/text input feature vectors each consist of n bi-nary features, i.e., t, v = 〈t1, . . . tn〉, 〈v1, . . . vn〉where ti, vi ∈ {0, 1}. Our goal will be to measurehow well multimodally additive models can fit ar-bitrary logical functions f . To simplify our analy-sis in this idealized case: we assume 1) access to atraining set consisting of all 22n input vector pairs,and only measure training accuracy (vs. the hardertask of generalization) and 2) “perfect” unimodalrepresentations in the sense that t, v contain all ofthe information required to compute f(t, v) (thisis not the case for, e.g., CNN/RoBERTa featuresfor actual tasks).

For very small cross-modal reasoning tasks,additive models can suffice. At n = 1 , thereare 16 possible binary functions f(t1, v1), and14/16 can be perfectly fit by a function of theform fT (t1) + fV (v1) (the exceptions being XORand XNOR). For n = 2, non-trivial functions arestill often multimodally additively representable;an arguably surprising example is this one:(t2 ∧ ¬v2) ∨ (t1 ∧ t2 ∧ v1) ∨ (¬t1 ∧ ¬v1 ∧ ¬v2).But for cross-modal reasoning tasks with morevariables,16 even in this toy setting, multimodally-

15Suhr et al. (2019), for example, do not define “visualreasoning with natural language,” but do argue that sometasks offer a promising avenue to study “reasoning that re-quires composition” via visual-textual grounding.

16As a lower bound on the number of variables required ina more realistic case, consider a cross-modal reasoning taskwhere questions are posed regarding the counts of, say, 1000object types in images. It’s likely that a separate image vari-able representing the count of each possible type of objectwould be required. While an oversimplification, for mostcross-modal reasoning tasks, 2n = 6 variables is still rela-tively small.

Page 9: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

869

additive models ultimately fall short: for the n = 3case, we sampled over 5 million random logicalcircuits and none were perfectly fit by the additivemodels. To better understand how well additivemodels can approximate various logical functions,we fit two types of them for the n > 2 case: 1)the EMAP of the input function f directly, and 2)AdaBoost (Freund and Schapire, 1995) with theweak learners restricted to unimodal models.17

For a reference interactive model, we employAdaBoost without the additivity restriction.

5 10# Variables

50

60

70

80

90

100

AUC

Ada-FullAda-AddEMAPRandom

Figure 2: Perf. of ad-ditive models on fit-ting logical circuits.

Figure 2 plots the AUCof the 3 models on thetraining set; we sample10K random logical cir-cuits for each problemsize. As the numberof variables increases, theperformance of the addi-tive models quickly de-creases (though full Ad-aBoost gets 100% accu-racy in all cases). While these experiments areonly for a toy case, they show that EMAPs, andadditive models generally, have very limited ca-pacity to compute or approximate logical func-tions of multimodal variables.

7 Conclusion and Future Work

The last question on our FAQ list in §6 leaves uswith the following conundrum: 1) Additive mod-els are incapable of most cross-modal reasoning;but 2) for most of the unbalanced tasks we con-sider, EMAP finds an additive approximation thatmakes nearly identical predictions to the full, in-teractive model. We postulate the following poten-tial explanations, pointing towards future work:

• Hypothesis 1: These unbalanced tasks don’t re-quire complex cross-modal reasoning. This pur-ported conclusion cannot account for gaps be-tween human and machine performance: if anadditive model underperforms relative to humanjudgment, the gap could plausibly be explainedby cross-modal feature interactions. But evenin cases where an additive model matches or ex-ceeds human performance on a fixed dataset, ad-ditive models may still be insufficient. The merefact that unimodal and additive models can often

17AdaBoost is chosen because it has strong theoreticalguarantees to fit to training data: see Appendix D.

be disarmed by adding valid (but carefully se-lected) instances post hoc (as in, e.g., Kiela et al.(2020)) suggests that their inductive bias can si-multaneously be sufficient for train/test general-ization, but also fail to capture the spirit of thetask. Future work would be well-suited to ex-plore 1) methods for better understanding whichdatasets (and individual instances) can be rebal-anced and which cannot; and 2) the non-trivialtask of estimating additive human baselines tocompare against.

• Hypothesis 2: Modeling feature interactions canbe data-hungry. Jayakumar et al. (2020) showthat feed-forward neural networks can require avery high number of training examples to learnfeature interactions. So, we may need mod-els with different inductive biases and/or muchmore training data. Notably, the feature inter-actions learned even in balanced cases are oftennot interpretable (Subramanian et al., 2019).

• Hypothesis 3: Paradoxically, unimodal modelsmay be too weak. Without expressive enoughsingle-modal processing methods, opportuni-ties for learning cross-modal interaction patternsmay not be present during training. So, improve-ments in unimodal modeling could feasibly im-prove feature interaction learning.

Concluding thoughts. Our hope is that futurework on multimodal classification tasks reportnot only the predictive performance of their bestmodel + baselines, but also the EMAP of thatmodel. EMAP (and related algorithms) has prac-tical implications beyond image+text classifica-tion: there are straightforward extensions to non-visual/non-textual modalities, to classifiers usingmore than 2 modalities as input, and to single-modal cases where one wants to check for featureinteractions between two groups of features, e.g.,premise/hypothesis in NLI.

Acknowledgments

In addition to the anonymous reviewers and meta-reviewer, the authors would like to thank Ajay Di-vakaran, Giles Hooker, Jon Kleinberg, Julia Kruk,Tianze Shi, Ana Smith, Alane Suhr, and GregoryYauney for the helpful discussions and feedback.We thank the NVidia Corporation for providingsome of the GPUs used in this study. JH per-formed this work while at Cornell University. Par-tial support for this work was kindly provided by agrant from The Zillow Corporation, and a Google

Page 10: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

870

Focused Research Award. Any opinions, findings,conclusions, or recommendations expressed hereare those of the authors and do not necessarily re-flect the view of the sponsors.

ReferencesMalihe Alikhani, Sreyasi Nag Chowdhury, Gerard

de Melo, and Matthew Stone. 2019. CITE: A cor-pus of image–text discourse relations. In NAACL.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InCVPR.

Roland Barthes. 1988. Image-music-text. Macmillan.

John Bateman. 2014. Text and image: A critical intro-duction to the visual/verbal divide. Routledge.

Mathieu Blondel and Fabian Pedregosa. 2016.Lightning: large-scale linear classifica-tion, regression and ranking in Python.https://doi.org/10.5281/zenodo.200504.

Damian Borth, Rongrong Ji, Tao Chen, ThomasBreuel, and Shih-Fu Chang. 2013. Large-scale vi-sual sentiment ontology and detectors using adjec-tive noun pairs. In ACM MM.

Leo Breiman. 2001. Random forests. Machine Learn-ing, 45:5–32.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In ACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL.

Jesse Dodge, Suchin Gururangan, Dallas Card, RoySchwartz, and Noah A. Smith. 2019. Show yourwork: Improved reporting of experimental results.In EMNLP.

Yoav Freund and Robert E. Schapire. 1995. Adecision-theoretic generalization of on-line learningand an application to boosting. In European confer-ence on computational learning theory. Springer.

Jerome H. Friedman. 2001. Greedy function approx-imation: A gradient boosting machine. Annals ofstatistics, 29(5).

Jerome H. Friedman and Bogdan E. Popescu. 2008.Predictive learning via rule ensembles. The Annalsof Applied Statistics, 2(3):916–954.

Yash Goyal, Tejas Khot, Douglas Summers-Stay,Dhruv Batra, and Devi Parikh. 2017. Making theV in VQA matter: Elevating the role of image un-derstanding in visual question answering. In CVPR.

Trevor Hastie and Robert Tibshirani. 1987. General-ized additive models: some applications. Journal ofthe American Statistical Association, 82(398):371–386.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In CVPR.

Jack Hessel and Lillian Lee. 2019. Something’s brew-ing! Early prediction of controversy-causing postsfrom discussion features. In NAACL.

Jack Hessel, Lillian Lee, and David Mimno. 2017.Cats and captions vs. creators and the clock: Com-paring multimodal content to context in predictingrelative popularity. In The Web Conference.

Giles Hooker. 2004. Discovering additive structure inblack box functions. In KDD.

Drew A. Hudson and Christopher D. Manning. 2019.GQA: A new dataset for real-world visual reasoningand compositional question answering. CVPR.

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang,Keren Ye, Christopher Thomas, Zuha Agha, NathanOng, and Adriana Kovashka. 2017. Automatic un-derstanding of image and video advertisements. InCVPR.

Allan Jabri, Armand Joulin, and Laurens VanDer Maaten. 2016. Revisiting visual question an-swering baselines. In ECCV. Springer.

Siddhant M. Jayakumar, Wojciech M. Czarnecki, JacobMenick, Jonathan Schwarz, Jack Rae, Simon Osin-dero, Yee Whye Teh, Tim Harley, and Razvan Pas-canu. 2020. Multiplicative interactions and where tofind them. In ICLR.

James M. Jones, Gary Alan Fine, and Robert G. Brust.1979. Interaction effects of picture and caption onhumor ratings of cartoons. The Journal of SocialPsychology, 108(2):193–198.

Douwe Kiela, Hamed Firooz, Aravind Mohan, VedanujGoswami, Amanpreet Singh, Pratik Ringshia, andDavide Testuggine. 2020. The hateful memes chal-lenge: Detecting hate speech in multimodal memes.arXiv preprint arXiv:2005.04790.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In ICLR.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A. Shamma,Michael S. Bernstein, and Li Fei-Fei. 2017. Vi-sual genome: Connecting language and vision us-ing crowdsourced dense image annotations. IJCV,123(1):32–73.

Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, DanJurafsky, and Ajay Divakaran. 2019. Integrating textand image: Determining multimodal document in-tent in instagram posts. In EMNLP.

Page 11: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

871

Himabindu Lakkaraju, Stephen H. Bach, and JureLeskovec. 2016. Interpretable decision sets: A jointframework for description and prediction. In KDD.

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L.Berg. 2018. TVQA: Localized, compositional videoquestion answering. In EMNLP.

Jay Lemke. 1998. Multiplying meaning. Reading sci-ence: Critical and functional perspectives on dis-courses of science, pages 87–113.

Ruixue Liu and Art B. Owen. 2006. Estimating meandimensionality of analysis of variance decomposi-tions. Journal of the American Statistical Associa-tion, 101(474):712–721.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692.

Emily E Marsh and Marilyn Domas White. 2003. Ataxonomy of relationships between images and text.Journal of Documentation, 59(6):647–672.

Radan Martinec and Andrew Salway. 2005. A systemfor image–text relations in new (and old) media. Vi-sual communication, 4(3):337–371.

Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb El-Saddik. 2016. Sentiment analysis on multi-view so-cial data. In MultiMedia Modeling, page 15–27.

Kay O’Halloran. 2004. Multimodal discourse analy-sis: Systemic functional perspectives. A&C Black.

Michael O’Toole. 1994. The language of displayed art.Fairleigh Dickinson Univ Press.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster R-CNN: Towards real-time ob-ject detection with region proposal networks. InNeurIPS.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Model-agnostic interpretability ofmachine learning. In Human Interpretability in Ma-chine Learning Workshop at ICML.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Anchors: High-precision model-agnostic explanations. In AAAI.

Cees G.M. Snoek, Marcel Worring, and Arnold W.M.Smeulders. 2005. Early versus late fusion in seman-tic video analysis. In ACM Multimedia.

Ilya M Sobol. 2001. Global sensitivity indices for non-linear mathematical models and their Monte Carloestimates. Mathematics and computers in simula-tion, 55(1-3):271–280.

Erik Strumbelj and Igor Kononenko. 2010. An efficientexplanation of individual classifications using gametheory. JMLR.

Sanjay Subramanian, Sameer Singh, and Matt Gardner.2019. Analyzing compositionality of visual ques-tion answering. In NeurIPS Workshop on VisuallyGrounded Interaction and Language (ViGIL).

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang,Huajun Bai, and Yoav Artzi. 2019. A corpus forreasoning about natural language grounded in pho-tographs. In ACL.

Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In EMNLP.

Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Re-thinking model scaling for convolutional neural net-works. arXiv preprint arXiv:1905.11946.

Berk Ustun and Cynthia Rudin. 2016. Supersparselinear integer models for optimized medical scoringsystems. Machine Learning.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NeurIPS.

Alakananda Vempala and Daniel Preotiuc-Pietro. 2019.Categorizing and inferring the relationship betweenthe text and image of Twitter posts. In ACL.

Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron Courville.2017. Guesswhat?! Visual object discovery throughmulti-modal dialogue. In CVPR.

Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto,Wolfgang Effelsberg, and Laura Dietz. 2018.Knowledge-rich image gist understanding beyondliteral meaning. Data & Knowledge Engineering,117:114–132.

Ning Xie, Farley Lai, Derek Doran, and Asim Ka-dav. 2018. Visual entailment task for visually-grounded language learning. arXiv preprintarXiv:1811.10582.

Keren Ye and Adriana Kovashka. 2018. Advise: Sym-bolism and external knowledge for decoding adver-tisements. In ECCV.

Dani Yogatama and Noah A. Smith. 2015. Bayesianoptimization of text representations. In EMNLP.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2019. From recognition to cognition: Visualcommonsense reasoning. In CVPR.

Mingda Zhang, Rebecca Hwa, and Adriana Kovashka.2018. Equal but not the same: Understanding theimplicit relationship between persuasive images andtext. In The British Machine Vision Conference(BMVC).

Page 12: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

872

A Theoretical Analysis of Algorithm 1

We assume we are given a (usually evaluation)dataset D = {〈ti, vi〉}ni=1 and a trained model fthat maps 〈ti, vi〉 pairs to a d-dimensional vectorof scores. We seek a multimodally-additive func-tion f that matches the values of f on any 〈ti, vj〉for which there exist v′, t′ such that 〈ti, v′〉 ∈ Dand 〈t′, vj〉 ∈ D; that is, 〈ti, vj〉 represents anytext-image pair we could construct if we decou-pled the existing pairs in D.18

For simplicity, first assume that d = 1(we handle the d > 1 case later). Since fis multimodally-additive by assumption, ∃ ft, fvsuch that f(ti, vj) = ft(ti) + fv(vj). Our goalis to find the “best” 2n values ft(ti), fv(vj), or —writing fij for f(ti, vj), τi for ft(ti), and φj forfv(vj) for notational convenience — to find τi, φjminimizing

L =1

2

∑i

∑j

(fij − τi − φj)2 (5)

Claim 1. L is convex.

Proof. The first-order partial derivatives are:

∂L∂τi

= n · τi +∑j

(φj − fij)

∂L∂φj

= n · φj +∑i

(τi − fij)

and the HessianH is

H =

[nI 11 nI

], I,1 ∈ Rn×n

It suffices to show thatH is positive semi-definite,i.e., for any z ∈ R2n, zTHz ≥ 0. Indeed,

18Note that multimodally-additive models do not rely onparticular ti, vj couplings, as this family of functions decom-poses as f(ti, vj) = ft(ti) + fv(vj); thus, a multimodally-additive f should be able to fit any image-text pair we couldconstruct from D, not just the image-text pairs we observe.

zTHz = zT

nz1 +2n∑

j=n+1zj

nz2 +2n∑

j=n+1zj

...

nzn +2n∑

j=n+1zj

nzn+1 +n∑i=1

zi

nzn+2 +n∑i=1

zi

...

nz2n +n∑i=1

zi

=

n∑i=1

nz2i + 2n∑j=n+1

zizj

+

2n∑j=n+1

(nz2j +

n∑i=1

zjzi

)

=n∑i=1

2n∑j=n+1

(z2i + 2zizj + z2j

)=

n∑i=1

2n∑j=n+1

(zi + zj)2 ≥ 0

Now, for the optimal solutions to our minimiza-tion problem, we can set the first-order partialderivatives to 0 and solve for our 2n parametersτi, φj . These solutions will correspond to globalminima due to the convexity result we establishedabove. It turns out to be equivalent to find solu-tions to:

H

τ1...τnφ1...φn

=

n∑k=1

f1k

...n∑k=1

fnk

n∑k=1

fk1

...n∑k=1

fkn

(6)

We can do this by finding one solution s to theabove, and then analyzing the nullspace of H,

Page 13: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

873

which will turn out to be the 1-dimensional sub-space spanned by

r = 〈1, 1, . . . 1,︸ ︷︷ ︸n

−1,−1, . . .− 1︸ ︷︷ ︸n

That Hr = 0 can be verified by direct calcula-tion. Then, all solutions will have the form s+ crfor any c ∈ R.19

Claim 2. Algorithm 1 computes a solution toEquation 6 as a byproduct.

Proof. Algorithm 1 computes s = 〈τi, φj〉 as:

τi =1

n

∑k

fik −1

n2

∑i

∑j

fij (7)

φj =1

n

∑k

fkj (8)

s is a solution to Equation 6, as can be verified bydirect substitution.

Claim 3. The rank of H is 2n − 1, which impliesthat its nullspace is 1-dimensional.

Proof. Solutions toHx = λx occur when:

λ = n, 0 =n∑i=1

xi and 0 =2n∑

j=n+1

xj

λ = 2n, x = 1

λ = 0, x = r

So, zero is an eigenvalue ofH with multiplicity 1,which shows thatH’s rank is 2n− 1.20

Because the nullspace of H is 1-dimensional,all solutions to Equation 6 are given by s+ cr forany c ∈ R. Returning to the notation of the origi-nal problem, we see that all optimal solutions aregiven by:

τi =1

n

∑k

fik −1

n2

∑i

∑j

fij + c (9)

φj =1

n

∑k

fkj − c (10)

19Proof: Assume that s′ is a solution of Equation 6. s′−swill be in the nullspace of H. Clearly, s′ = s + (s′ − s), sos′ can be written as s+ x for x in the nullspace ofH.

20We can make explicit the eigenbasis for the λ = n so-lutions. Let M ∈ Rn×(n−1) be In−1 with an additional rowof −1 concatenated as the nth row. The eigenbasis is givenby the columns of [

M 00 M

].

Claim 4. Algorithm 1 produces a unique solutionfor the values of f .

Proof. We have shown that Algorithm 1 producesan optimal solution, and have derived the paramet-ric form of all optimal solutions in Equation 9 andEquation 10. Note that Algorithm 1 outputs τi+φj(rather than τi, φj individually). This cancels outthe free choice of c. Thus, any algorithm that out-puts optimal τi + φj will have the same output asAlgorithm 1.

Extension to multiple dimensions. So far, wehave shown that Algorithm 1 produces an optimaland unique solution for Equation 4, but only incases where fij , τi, φj ∈ R. In general, we need toshow the algorithm works for multi-dimensionaloutputs, too. The full loss function includes a sum-mation over dimension as:

L =1

2

∑i

∑j

∑d

(fijd − τid − φjd)2 (11)

This objective decouples entirely over dimensiond, i.e., the loss can be rewritten as:

1

2

(∑ij

(fij1 − τi1 − φj1)2︸ ︷︷ ︸L1

+

∑ij

(fij2 − τi2 − φj2)2︸ ︷︷ ︸L2

+ . . .

∑ij

(fijd − τid − φjd)2︸ ︷︷ ︸Ld

)

Furthermore, notice that the parameters in Li aredisjoint from the parameters in Lj if i 6= j.Thus, to minimize the multidimensional objectivein Equation 11, it suffices to minimize the ob-jective for each Li independently, and recombinethe solutions. This is precisely what Algorithm 1does.

B Dataset Details and ReproducibilityEfforts

B.1 I-INT, I-SEM, I-CTXThis data is available from https://github.com/karansikka1/documentIntent_emnlp19. We use the same 5 random splitsprovided by the authors for evaluation. The

Page 14: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

874

authors provide ResNet18 features, which weuse for our non-LXMERT experiments insteadof EfficientNet-B4 features. After contactingthe authors, they extracted bottom-up-top-downFasterRCNN features for us, so we were ableto compare to LXMERT. State of the art perfor-mance numbers are derived from the above githubrepo; these differ slightly from the values reportedin the original paper because the github versionsare computed without image data augmentation.

B.2 T-VISThis data is available from https://github.com/danielpreotiuc/text-image-relationship/. The raw images arenot available, so we queried the Twitter API forthem. The corpus has 4472 tweets in it initially,but we were only able to re-collect 3905 tweets(87%) when we re-queried the API. Tweets canbe missing for a variety of reasons, e.g., thetweet being permanently deleted, or the account’sowner making the their account private at the timeof the API request. A handful of tweets wereavailable, but were missing images when we triedto re-collect them. This can happen when theimage is a link to an external page, and the imageis deleted from the external page.

B.3 R-POPThis data is available from http://www.cs.cornell.edu/˜jhessel/cats/cats.html. We just use the pics sub-reddit data. We attempted to rescrape the picsimages from the imgur urls. We were able tore-collect 87215/88686 of the images (98%).Images can be missing if they have been, e.g.,deleted from imgur. We removed any pairs withmissing images from the ranking task; we trainedon 42864/44343 (97%) of the original pairs. Thedata is distributed with training/test splits. Fromthe training set for each split, we reserve 3K pairsfor validation. The state of the art performancenumbers are taken from the original releasingwork.

B.4 T-ST1This data is available from http://www.ee.columbia.edu/ln/dvmm/vso/download/twitter_dataset.htmland consists of 603 tweets (470 positive, 133negative). The authors distribute data with 5 foldspre-specified for cross-validation performance

reporting. However, we note that the originalpaper’s best model achieves 72% accuracy inthis setting, but a constant prediction baselineachieves higher performance: 470/(470+133) =78%. Note that the constant prediction baselinelikely performs worse according to metrics otherthan accuracy, but only accuracy is reported. Weattempted to contact the authors of this study butdid not receive a reply. We also searched for ad-ditional baselines for this dataset, but were unableto find additional work that uses this dataset inthe same fashion. Thus, given the small size ofthe dataset, lack of reliable measure of SOTAperformance, and label imbalance, we decided toreport ROC AUC prediction performance.

B.5 T-ST2

This data is available from https://www.mcrlab.net/research/mvsa-sentiment-analysis-on-multi-view-social-data/. We use the MVSA-Single dataset because human annotators examineboth the text and image simultaneously; we chosenot to use MVSA-Multiple because human anno-tators do not see the tweet’s image and text at thesame time. However, the dataset download linkonly comes with 4870 labels, instead of the 5129described in the original paper. We contacted theauthors of the original work about the missingdata, but did not receive a reply.

We follow the preprocessing steps detailed inXu and Mao (2017) to derive a training dataset.After preprocessing, we are left with 4041 datapoints, whereas prior work compares with 4511points after preprocessing. The preprocessingconsists of removing points that are (positive, neg-ative), (negative, positive), or (neutral, neutral),which we believe matches the description of thepreprocessing in that work. We contacted theauthors for details, but did not receive a reply.The state-of-the-art performance number for thisdataset is from Xu et al. (2018).

C Are EMAPs just regularizers?

One reason why EMAPs may often offer strongperformance is by acting as a regularizer: pro-jecting to a less expressive hypothesis space mayreduce variance/overfitting. But in many cases,the original model and its EMAP achieve sim-ilar predictive accuracy. This suggests two ex-planations: either the EMAPed version makes

Page 15: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

875

I-INT I-SEM I-CTX T-VIS R-POP T-ST1 T-ST2

Metric AUC AUC AUC Weighted F1 ACC AUC ACCNum Classes 7 3 3 4 2 2 2Setup 5-fold 5-fold 5-fold 10-fold 15-fold 5-fold 5-foldBest Interactive Poly Poly Poly LXMERT LXMERT LXMERT Poly

Original Perf. 91.3 74.4 81.5 53.4 64.2 75.5 80.9Original EMAP 91.1 74.2 81.3 51.0 64.1 75.9 80.7DiffSeed Perf. 91.3 74.5 81.4 53.2 64.1 75.3 81.3

Match Orig + EMAP 95.6 95.9 97.4 85.5 96.3 98.0 96.7Match Orig + DiffSeed 99.9 99.1 100.0 75.5 87.6 92.4 97.9% Inst. Orig. Better 51.2 52.0 51.5 55.2 51.2 5/12 cases 53.0

(≈ 6/12 = 50%)

Table 5: Consistency results. The first block provides details about the task and the model that performed beston it. The second block gives the performance (italicized results represent cross-validation EMAP computationresults; see footnote 5). The third block gives the percent of time the original model’s prediction is the same as forEMAP, and, for comparison, the percent of time the original model’s predictions match the identical model trainedwith a different random seed: in all cases except for T-VIS, the original model and the EMAP make the sameprediction in more than 95% of cases. The final row gives the percent of instances (among instances for which theoriginal model and the EMAP disagree) that the original model is correct. Except for T-VIS, when the EMAP andthe original model disagree, each is right around half the time.

significantly different predictions with respect theoriginal model (e.g., because it is better regular-ized), but it happens that those differing predic-tions “cancel out” in terms of final prediction ac-curacy; or, the original, unprojected functions arequite close to additive, anyway, and the EMAPdoesn’t change the predictions all that much.

We differentiate between these two hypothesesby measuring the percent of instances for whichthe EMAP makes a different classification predic-tion than the full model. Table 5 gives the re-sults: in all cases except T-VIS, the match betweenEMAP and the original model is above 95%. Forreference, we retrained the best performing mod-els with different random seeds, and measured theperformance difference under this change.

When EMAP changes the predictions of theoriginal model, does the projection generallychange to a more accurate label, or a less accu-rate one? We isolate the instances where a labelswap occurs, and quantify this using: (num origbetter) / (num orig better + num proj better). Inmost cases, the effect of projecting improves anddegrades performance roughly equally, at least ac-cording to this metric. For T-VIS, however, theoriginal model is better in over 55% of cases: thisis also reflected in the corresponding F-scores.

D Logic Experiment Details

In §6, we describe experiments using AdaBoost.We chose AdaBoost (Freund and Schapire, 1995)to fit to the training set because of its strong con-vergence guarantees. In short: if AdaBoost canfind a weak learner at each iteration (that is: if itcan find a candidate classifier with above-randomperformance) it will be able to fit the training set.A more formal statement of AdaBoost’s propertiescan be found in the original work.

The AdaBoost classifiers we consider use de-cision trees with max depth of 15 as base esti-mators. We choose a relatively large depth be-cause we are not concerned with overfitting: weare just measuring training fit. The additive ver-sion of AdaBoost we consider is identical to thefull AdaBoost classifier, except, at each iteration,either the image features or the text features indi-vidually are considered.

E Additional Reproducibility Info

Computing Hardware. Linear models andfeed-forward neural networks were trained us-ing consumer-level CPU/RAM configurations.LXMERT was fine-tuned on single, consumerGPUs with 12GB of vRAM.Runtime. The slowest algorithm we used wasLXMERT (Tan and Bansal, 2019), and the biggest

Page 16: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

876

Orig: ✓ Text RepresentedEMAP: ✕ Text Not Represented

Throwback to my favorite dog w/ eyebrows

Orig: ✓ Text RepresentedEMAP: ✕ Text Not Represented

Someone brought a bear into the doggie day care

Heartfeltand teary eyed thanks and respect to all warriors..

Orig: ✓ Text Not RepresentedEMAP: ✕ Text Represented

Orig: ✓ Text Not RepresentedEMAP: ✕ Text Represented

Intro to the composition I'm working on! Really loving the horn solo

Figure 3: Examples of cases from for which EMAP degrades the performance of LXMERT + Logits. All casesare labelled as “image does not add meaning” in the original corpus. Text of tweets may be gently modified forprivacy reasons.

dataset we fine-tuned on was R-POP. LXMERTwas fine-tuned on the order of 1500 times. De-pending on the dataset, the fine-tuning processtook between 10 minutes and an hour. Overall,we estimate that we spent on the order of 50-100GPU days doing this work.Number of Parameters. The RoBERTa featureswe used were 2048-dimensional, and the Efficient-NetB4 features were 1792 dimensional. The linearmodels and feed forward neural networks operatedon those. We cross-validated the number of layersand the width, but the maximal model, for thesecases, has on the order of millions of parameters.The biggest model we used was LXMERT, whichhas a comparable memory footprint to the originalBERT Base model.

F Qualitative Analysis of T-VIS

To demonstrate the potential utility of EMAPin qualitative examinations, we identified the in-dividual instances in T-VIS for which EMAPchanges the test-time predictions of the LXMERT+ Linear Logits model. Recall that in this dataset,EMAP hurts performance.

In introducing this task, Vempala and Preotiuc-Pietro (2019) propose categorizing image+texttweets into four categories: “Some or all of thecontent words in the text are represented in the im-age” (or not) × “Image has additional content thatrepresents the meaning of the text and the image”(or not).

As highlighted in Table 5, when EMAP changesthe prediction of the full model (14.5% of cases),the prediction made is incorrect more often not:among label swapping cases, when the EMAP orthe original model is correct, the original predic-tion is correct in 55% of the cases.

The most common label swaps of this formare between the classes: “image does not add” ×{“text is represented”, “text is not represented”};as shorthand for these two classes, we will write

IDTR (“image doesn’t add, text represented”) andIDTN (“image doesn’t add, text not represented”).Across the 10 cross-validation splits, EMAP in-correctly maps the original model’s correct pre-diction of INTR → IDTN 255 times. For refer-ence, there are 165 cases where EMAP maps theincorrect INTR prediction of the original model tothe correct IDTN label. So, when EMAP makesthe change INTR → IDTN, in 60% of cases thefull model is correct. Similarly, EMAP incorrectlymaps the original model’s correct prediction ofINTN → IDTR 77 times (versus 48 correct map-pings, original model correct in 62% of cases).

Figure 3 gives some instances from T-VISwhere images do not add meaning and the EMAPprojection causes a swap from a correct to an in-correct prediction. While it’s difficult to draw con-clusions from single instances, some preliminarypatterns emerge. There are a cluster of animal im-ages coupled with captions that may be somewhatdifficult to map. In the case of the dog with eye-brows, the image isn’t centered on the animal, andit might be difficult to identify the eyebrows with-out the prompt of the caption (hence, interactionsmight be needed). Similarly, the bear-looking-dog case is difficult: the caption doesn’t explic-itly mention a dog, and the image itself depicts ananimal that isn’t canonically canine-esque; thus,modeling interactions between the image and thecaption might be required to fully disambiguatethe meaning.

Figure 3 also depicts two cases where the orig-inal model predicts that the text is not represented(but EMAP does). They are drawn from a clusterof similar cases where the captions seem indirectlyconnected to the image. Consider the 4th, mu-sic composition example: just looking at the text,one could envision a more literal manifestation:i.e., a person playing a horn. Similarly, lookingjust at the screenshot of the music production soft-ware, one could envision a more literal caption,

Page 17: Does my multimodal model learn cross-modal interactions ... · Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However,

877

e.g., “producing a new song with this software.”But, the real connection is less direct, and mightrequire additional cross-modal inferences. Othercommon cases of INTR→ IDTN are “happy birth-day” messages coupled with images of their in-tended recipients and selfies taken at events (e.g.,sports games), with descriptions (but not direct vi-sual depictions).

G Worked Example of EMAP

We adopt the notation of Appendix A and give anconcrete worked example of EMAP. Consider thefollowing f output values, which are computed onthree image+text pairs for a binary classificationtask. We assume that f outputs an un-normalizedlogit that can be passed to a scaling function likethe logistic function σ for a probability estimateover the binary outcome, e.g., f11 = −1.3 andσ(f11) = P (y = 1) ≈ .21.

f11 = −1.3 f12 = .3 f13 = −.2f21 = .8 f22 = 3 f23 = 1.1f31 = 1.1 f32 = −.1 f33 = .7

We can write this equivalently in matrix form:f11 f12 f13f21 f22 f23f31 f32 f33

=

−1.3 0.3 −0.20.8 3.0 1.11.1 −.1 0.7

Note that the mean logit predicted by f overthese 3 examples is .6. We use Equation 8to compute the φjs; this is equivalent to tak-ing a column-wise mean of this matrix, whichyields (approximately) [.2, 1.07, .53]. Similarly,we can use Equation 7, equivalent to taking arow-wise mean of this matrix, which yields (ap-proximately) [−.4, 1.63, .57], and then subtract theoverall mean .6 to achieve [−1, 1.03,−.03]. Fi-nally, we can sum these two results to compute[f11, f22, f33] = [−.8, 2.1, .5]. These predictionsare the closest approximations to the full evalua-tions [f11, f22, f33] = [−1.3, 3, .7] for which thegeneration function obeys the additivity constraintover the three input pairs.

ReferencesYoav Freund and Robert E. Schapire. 1995. A

decision-theoretic generalization of on-line learningand an application to boosting. In European confer-ence on computational learning theory. Springer.

Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In EMNLP.

Alakananda Vempala and Daniel Preotiuc-Pietro. 2019.Categorizing and inferring the relationship betweenthe text and image of Twitter posts. In ACL.

Nan Xu and Wenji Mao. 2017. MultiSentiNet: A deepsemantic network for multimodal sentiment analy-sis. In CIKM.

Nan Xu, Wenji Mao, and Guandan Chen. 2018. A co-memory network for multimodal sentiment analysis.In SIGIR.