quantum-inspired multimodal representationceur-ws.org/vol-2441/paper18.pdftechnical presentation...

2
antum-inspired Multimodal Representation Qiuchi Li [email protected] Department of Information Engineering University of Padova Padova, Italy Massimo Melucci [email protected] Department of Information Engineering University of Padova Padova, Italy ABSTRACT We introduce our work in progress that targets on building mul- timodal representation under quantum inspiration. The challenge for multimodal representation falls on a fusion strategy to capture the interaction between different modalities of data. As the most successful approaches, neural networks lack a mechanism of ex- plicitly showing how different modalities are related to each other. We address this issue by seeking inspirations from Quantum The- ory (QT), which has been demonstrated advantageous in explicitly capturing the correlations between textual features. In this paper, we give an overview of the related works and present the proposed methodology. KEYWORDS multimodal data fusion, quantum physics, neural networks 1 INTRODUCTION In human communication, messages are often conveyed through a combination of different modalities, such as visual, audio and linguistic modalities. In order for an automatic understanding of the multimodal messages, one needs to fuse the information from different modalities to construct a joint multimodal representation. The challenge falls on how to characterise the interactions between different modalities, which can be complicated in some scenarios. In the example shown in Fig. 1, the visual-linguistic query cannot be understood solely by either the text or image individually, but the relation between the image and text must be correctly recognized. This is where the significance and challenge sit for multimodal representation learning. Most existing research focus on neural network-based fusion strategies for constructing multimodal representation, ranging from the earlier Hidden Markov Model (HMM)-based models [7] to different RNN variants [1] and more recently tensor-based ap- proaches [12] and seq-to-seq structures [8]. Despite their strong accuracy performances, the interplay between different data modal- ities are encoded in an inherent way by the neural network compo- nents, making it difficult for humans to understand the contribu- tions of each modality in a particular task. Quantum Theory (QT) provides a well-established theoretical and mathematical formalism for describing the physical world on a microscopic level. Beyond physics, QT frameworks have been ap- plied to many research areas [2, 9, 10, 15], among which successful Copyright @ 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IIR 2019, September 16–18, 2019, Padova, Italy Figure 1: Example of a Multi-modal query. results have been observed in text-based language understand- ing [10, 15]. In this context, quantum-inspired frameworks have the natural advantage of explicitly capturing the correlations between features by the concepts of quantum superposition and quantum entanglement. This inspires us to adopt quantum-inspired frameworks for con- structing multimodal representation, and propose effective and interpretable multimodal fusion methods. In particular, we propose to capture the interactions within a single modality with quantum superposition, and model the cross-modal interactions by means of quantum entanglement. In this way, we establish a pipeline that extracts the interactions within multi-modal data in a way un- derstandable from a quantum perspective. We expect to obtain comparable performances to state-of-the-art systems in concrete multimodal tasks. 2 PROPOSED METHODOLOGY Here we introduce our proposed approach for building multimodal data representation inspired by quantum theory. Essentially, we represent multimodal data as many-body systems composed of different data modalities as subsystems. The interaction of different modalities is inherently captured by the notion of entanglement between the subsystems. We propose to build a complex-valued learning network to implement the quantum theoretical framework, which facilitates learning the cross-modal interactions in a data- driven fashion. 2.1 Complex-valued Unimodal Representation Complex values are essential for the mathematical formalism of quantum physics. However, most existing quantum-inspired models for text representation are based on the real vector space, ignor- ing the complex-valued nature of quantum notions. Recently, our prior works [3, 4, 11] leveraged quantum superposition to model 49

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantum-inspired Multimodal Representationceur-ws.org/Vol-2441/paper18.pdfTechnical Presentation WSDM 18, February 5-9, 2018, Marina Del Rey, CA, USA 342 Figure 1: Example of a Multi-modal

Quantum-inspired Multimodal RepresentationQiuchi Li

[email protected] of Information Engineering

University of PadovaPadova, Italy

Massimo [email protected]

Department of Information EngineeringUniversity of Padova

Padova, Italy

ABSTRACTWe introduce our work in progress that targets on building mul-timodal representation under quantum inspiration. The challengefor multimodal representation falls on a fusion strategy to capturethe interaction between different modalities of data. As the mostsuccessful approaches, neural networks lack a mechanism of ex-plicitly showing how different modalities are related to each other.We address this issue by seeking inspirations from Quantum The-ory (QT), which has been demonstrated advantageous in explicitlycapturing the correlations between textual features. In this paper,we give an overview of the related works and present the proposedmethodology.

KEYWORDSmultimodal data fusion, quantum physics, neural networks

1 INTRODUCTIONIn human communication, messages are often conveyed througha combination of different modalities, such as visual, audio andlinguistic modalities. In order for an automatic understanding ofthe multimodal messages, one needs to fuse the information fromdifferent modalities to construct a joint multimodal representation.The challenge falls on how to characterise the interactions betweendifferent modalities, which can be complicated in some scenarios. Inthe example shown in Fig. 1, the visual-linguistic query cannot beunderstood solely by either the text or image individually, but therelation between the image and text must be correctly recognized.This is where the significance and challenge sit for multimodalrepresentation learning.

Most existing research focus on neural network-based fusionstrategies for constructingmultimodal representation, ranging fromthe earlier Hidden Markov Model (HMM)-based models [7] todifferent RNN variants [1] and more recently tensor-based ap-proaches [12] and seq-to-seq structures [8]. Despite their strongaccuracy performances, the interplay between different data modal-ities are encoded in an inherent way by the neural network compo-nents, making it difficult for humans to understand the contribu-tions of each modality in a particular task.

Quantum Theory (QT) provides a well-established theoreticaland mathematical formalism for describing the physical world on amicroscopic level. Beyond physics, QT frameworks have been ap-plied to many research areas [2, 9, 10, 15], among which successful

Copyright@ 2019 for this paper by its authors. Use permitted under Creative CommonsLicense Attribution 4.0 International (CC BY 4.0).IIR 2019, September 16–18, 2019, Padova, Italy

Web Search of Fashion Items with MultimodalQueryingKatrien Laenen

KU LeuvenHuman Computer Interaction

Heverlee, [email protected]

Susana ZoghbiKU Leuven

Human Computer InteractionHeverlee, Belgium

[email protected]

Marie-Francine MoensKU Leuven

Human Computer InteractionHeverlee, Belgium

[email protected]

ABSTRACTIn this paper, we introduce a novel multimodal fashion search para-digm where e-commerce data is searched with a multimodal querycomposed of both an image and text. In this setting, the queryimage shows a fashion product that the user likes and the querytext allows to change certain product attributes to fit the productto the user’s desire. Multimodal search gives users the means toclearly express what they are looking for. This is in contrast tocurrent e-commerce search mechanisms, which are cumbersomeand often fail to grasp the customer’s needs. Multimodal searchrequires intermodal representations of visual and textual fashion at-tributes which can be mixed and matched to form the user’s desiredproduct, and which have a mechanism to indicate when a visualand textual fashion attribute represent the same concept. With aneural network, we induce a common, multimodal space for visualand textual fashion attributes where their inner product measurestheir semantic similarity. We build a multimodal retrieval modelwhich operates on the obtained intermodal representations andwhich ranks images based on their relevance to a multimodal query.We demonstrate that our model is able to retrieve images that bothexhibit the necessary query image attributes and satisfy the querytexts. Moreover, we show that our model substantially outperformstwo state-of-the-art retrieval models adapted to multimodal fashionsearch.ACM Reference format:Katrien Laenen, Susana Zoghbi, and Marie-Francine Moens. 2018. WebSearch of Fashion Items with Multimodal Querying. In Proceedings of WSDM2018: The Eleventh ACM International Conference on Web Search and DataMining, Marina Del Rey, CA, USA, February 5–9, 2018 (WSDM 2018), 9 pages.https://doi.org/10.1145/3159652.3159716

1 INTRODUCTIONToday’s consumers have become very exigent. When shoppingonline, they have in mind a specific clothing item in a particularcolor and style, and they want to find it without too much effort.However, current e-commerce search mechanisms are often toolimited to provide this kind of service. A common way for searching1Image material adapted from www.amazon.com

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] 2018, February 5–9, 2018, Marina Del Rey, CA, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5581-0/18/02. . . $15.00https://doi.org/10.1145/3159652.3159716

Figure 1: Example of a multimodal query 1. It consists of aquery image and a query text that alters the query image.The query text mentions two fashion attributes: short andlace. The query image is knee length and does not have alace type appearance.

products in a webshop is to navigate through a product categoryhierarchy. Users end up in a subcategory which they need to searchcompletely, which is time-consuming. They need to go throughmany irrelevant products, without the guarantee of actually find-ing the product they are looking for. To narrow down the search,users can sometimes select certain filters. However, desired prod-uct attributes might not be amongst the available filters. With atext-based search approach, users can describe the desired productby entering keywords into a search bar. The webshop then findsrelevant products by matching these keywords to the words in theproduct descriptions. It is often difficult for users to write the rightkeywords that will induce the search engine to provide the productsthey are interested in. For example, some users might be interestedin “jeans with holes”, but the relevant products are described as“distressed jeans”. Additionally, this approach hampers the searchfor product attributes which are not mentioned in the product de-scriptions. Alternatively but rarely, webshops offer image-basedsearch where the user uploads an image of the desired product andreceives visually similar products. Recently, there is an increasinginterest of users in this kind of image-based search. One of the mainreasons is the growing usage of visual social media such as Pinterestand Instagram, where users see products they want to buy. Usingan image as a query allows users to convey much more informationabout the desired product than with a textual query. Additionally,another advantage of image-based search over text-based search isthat the language of images is universal. However, the user mightbe interested in changing or adding attributes to the product inthe query image to obtain very specific results. For example, for animage of a red dress, the user likes the sleeve length to be different.

Technical Presentation WSDM’18, February 5-9, 2018, Marina Del Rey, CA, USA

342

Figure 1: Example of a Multi-modal query.

results have been observed in text-based language understand-ing [10, 15]. In this context, quantum-inspired frameworks have thenatural advantage of explicitly capturing the correlations betweenfeatures by the concepts of quantum superposition and quantumentanglement.

This inspires us to adopt quantum-inspired frameworks for con-structing multimodal representation, and propose effective andinterpretable multimodal fusion methods. In particular, we proposeto capture the interactions within a single modality with quantumsuperposition, and model the cross-modal interactions by meansof quantum entanglement. In this way, we establish a pipeline thatextracts the interactions within multi-modal data in a way un-derstandable from a quantum perspective. We expect to obtaincomparable performances to state-of-the-art systems in concretemultimodal tasks.

2 PROPOSED METHODOLOGYHere we introduce our proposed approach for building multimodaldata representation inspired by quantum theory. Essentially, werepresent multimodal data as many-body systems composed ofdifferent data modalities as subsystems. The interaction of differentmodalities is inherently captured by the notion of entanglementbetween the subsystems. We propose to build a complex-valuedlearning network to implement the quantum theoretical framework,which facilitates learning the cross-modal interactions in a data-driven fashion.

2.1 Complex-valued Unimodal RepresentationComplex values are essential for the mathematical formalism ofquantum physics. However, most existing quantum-inspiredmodelsfor text representation are based on the real vector space, ignor-ing the complex-valued nature of quantum notions. Recently, ourprior works [3, 4, 11] leveraged quantum superposition to model

49

Page 2: Quantum-inspired Multimodal Representationceur-ws.org/Vol-2441/paper18.pdfTechnical Presentation WSDM 18, February 5-9, 2018, Marina Del Rey, CA, USA 342 Figure 1: Example of a Multi-modal

IIR 2019, September 16–18, 2019, Padova, Italy Q. Li, M. Melucci

Figure 2: Our proposed multimodal representation frame-work.

correlations between textual features, and the complex-valued rep-resentation leads to improved performance and enhanced inter-pretability. We attempt to employ the concept of superposition formodeling intra-modal interactions, and investigate the complex-valued embedding approach to capture the interactions within othermodalities.

2.2 Tensor-based Approaches for CapturingInter-modal Interactions

We represent multimodal data as a many-body quantum system inentanglement. The mathematical formulation will be a complex-valued tensor constructed from unimodal complex-valued vectorsby tensor-based approaches. Tensor-based models have been ap-plied for classification [5] and matching [14] tasks, but they avoiddirectly computing the tensor by decomposing the tensor and learn-ing the decomposed weights via neural networks. Explicit tensorcombination of different data modalities into a holistic multimodalrepresentation remains unexplored. [6] proposed a framework toinvestigate the entanglement between user and document for rel-evance feedback, but it remains a challenge on how to apply it tomultimodal data.

2.3 Quantum-inspired Framework forMultimodal Sentiment Analysis

We focus on the multimodal sentiment analysis task, and work withthe benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1].The task is to classify sentiment of a video into 2, 5 or 7 classeswith textual, visual and acoustic features. As is shown in Fig. 2,our framework represents unimodal data as a set of pure statesthrough complex-value embedding, and constructs the many-bodystate of an video utterance through tensor-based approaches. Fi-nally, quantum-like measurement operators are implemented forsentiment classification. The whole process is implemented into acomplex-valued neural network, and the parameters in the pipelinecan be learned from labeled data in an end-to-end manner.

The network is born with an advantage in interpretability com-pared to classical neural networks, in that the role of each com-ponent is made explicit prior to the network training phase. We

are working on deploying the network to CMU-MOSI and CMU-MOSEI, and we expect to see comparable values to state-of-the-artmodels in terms of effectiveness.

ACKNOWLEDGMENTSThis PhD project is supported by the European Union‘s Horizon2020 research and innovation programme under theMarie Sklodowska-Curie grant agreement No. 721321.

REFERENCES[1] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-

Philippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEIDataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56thAnnual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Com-putational Linguistics.

[2] Jerome R. Busemeyer and Peter D. Bruza. Quantum Models of Cognition andDecision. Cambridge University Press, New York, NY, USA, 1st edition, 2012.

[3] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-InspiredComplex Word Embedding. Proceedings of The Third Workshop on RepresentationLearning for NLP, pages 50–57, 2018.

[4] Qiuchi Li, BenyouWang, and MassimoMelucci. CNM: An Interpretable Complex-valued Network for Matching. In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Papers), pages 4139–4148, Min-neapolis, Minnesota, June 2019. Association for Computational Linguistics.

[5] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang,AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient Low-rank Multi-modal Fusion With Modality-Specific Factors. In Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers),pages 2247–2256, Melbourne, Australia, July 2018. Association for ComputationalLinguistics.

[6] Massimo Melucci. Towards modeling implicit feedback with quantum entangle-ment. Quantum Interaction 2008, 2008.

[7] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards MultimodalSentiment Analysis: Harvesting Opinions from the Web. In Proceedings of the13th International Conference on Multimodal Interfaces, ICMI ’11, pages 169–176,New York, NY, USA, 2011. ACM.

[8] Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos.Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Senti-ment Analysis. arXiv:1807.03915 [cs, stat], July 2018. arXiv: 1807.03915.

[9] C. J. van Rijsbergen. The Geometry of Information Retrieval. Cambridge UniversityPress, New York, NY, USA, 2004.

[10] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling Term De-pendencies with Quantum Language Models for IR. In Proceedings of the 36thInternational ACM SIGIR Conference on Research and Development in InformationRetrieval, SIGIR ’13, pages 653–662, New York, NY, USA, 2013. ACM.

[11] Benyou Wang, Qiuchi Li, Massimo Melucci, and Dawei Song. Semantic HilbertSpace for Text Representation Learning. In The World Wide Web Conference,WWW ’19, pages 3293–3299, New York, NY, USA, 2019. ACM. event-place: SanFrancisco, CA, USA.

[12] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-PhilippeMorency. Tensor Fusion Network for Multimodal Sentiment Analysis.arXiv:1707.07250 [cs], July 2017. arXiv: 1707.07250.

[13] Amir Zadeh, Rown Zellers, and Eli Pincus. MOSI: Multimodal Corpus of Senti-ment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.

[14] Peng Zhang, Zhan Su, Lipeng Zhang, BenyouWang, and Dawei Song. A QuantumMany-body Wave Function Inspired Language Modeling Approach. In Proceed-ings of the 27th ACM International Conference on Information and KnowledgeManagement, CIKM ’18, pages 1303–1312, New York, NY, USA, 2018. ACM.

[15] Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, andBenyou Wang. A quantum-inspired multimodal sentiment analysis framework.Theoretical Computer Science, April 2018.

50