ignacio iacobacci et al (acl 2016). - uni stuttgart · 2017-09-15 · maximilian köper and kim-anh...

1
Optimizing Visual Representations in Semantic Multi-Modal Models with Dimensionality Reduction, Denoising and Contextual Information Maximilian Köper and Kim-Anh Nguyen and Sabine Schulte im Walde Institut für Maschinelle Sprachverarbeitung Universität Stuttgart, Germany {maximilian.koeper,kim-anh.nguyen,schulte}@ims.uni-stuttgart.de Abstract We improve visual representations for multi-modal semantic models by Applying standard dimensionality re- duction and denoising techniques Proposing a novel technique Con- textVision that takes corpus-based textual information into account when enhancing visual embeddings We explore our contribution in a visual and a multi-modal setup and evaluate on benchmark word similarity and related- ness tasks Overview FC6 FC7 5 1 Input Images (from Bing,Google,Flickr) Convolutional Neural Network (GoogLeNet,AlexNet,VGGNet) Contribution Corpus occurrences Textual representation Evaluation (SimLex-999 & MEN) Die Elefanten bilden eine Familie der Rüsseltiere. Diese Familie um- fasst alle heute noch lebenden Vertreter der Rüsseltiere. Elefanten sind die größten noch lebenden Landtiere. Schon bei der Geburt wiegt ein Kalb bis zu 100 Kilogramm. Der Elefant ist ein Tier der Superlative: Bis zu vier Meter kann er hoch werden, und mit bis zu 7,5 Tonnen Gewicht ist er das schwerste noch lebende Landsäugetier. Elefanten sind einfach gigantisch! count or predict combine evaluate evaluate Elefant = 2.57 8.71 0.67 4.32 2.71 5.50 4.31 8.01 9.59 2.31 ... extract features elefant 3.jpg 1.23 3.04 4.10 7.82 9.45 1.12 3.31 8.08 9.29 3.48 ... elefant 1.jpg elefant 2.jpg elefant 3.jpg ... elefant n.jpg centroid Optimize Visual Repre- sentation NMF, SVD, DEN, CV W1 W2 Rating man child 4.13 bread cheese 1.95 god priest 4.50 monster demon 6.95 ... ... ... Mid- Fusion Text+Vision Motivation & Method Computational models across tasks potentially prot from combining corpus-based textual information with perceptional information Word meanings are grounded in the external environment. Sensorimotor experience cannot be learned only based on linguistic symbols, cf. the grounding problem Harnad (1990) Recent advances in computer vision (deep learning) led to the development of better visual representations. Features are extracted from convolutional neural networks (CNNs) Dimension reduction techniques & denoising improve performance when applied to word representations Bullinaria and Levy (2012), Nguyen et al. (2016) What about visual representations ? Singular Value Decomposition (SVD): a matrix algebra opera- tion that can be used to reduce matrix dimensionality yielding a new high-dimensional space Non-negative matrix factoriza- tion (NMF) is a a matrix fac- torisation approach where the reduced matrix contains only non-negative real numbers denoising methods (DEN) use a non-linear, parameterized, feedforward neural network as a lter on word embeddings to reduce noise Our novel idea ContextVision (CV) strengthens visual vector representations by performing negative sampling using visual representation and corpus con- texts. Results (only visual) AN GLN VGGN SimLex MEN SimLex MEN SimLex MEN D .324 .560 .314 .513 .312 .545 SVD .324 .557 .316 .513 .314 .544 NMF .329 .610* .341* .612* .330 .631* DEN .356* .582* .342* .564* .343* .599* CV .364* .583* .358* .582* .357* .603* D .271 .434 .244 .366 .262 .422 SVD .270 .424 .245 .364 .264 .418 NMF .284 .560* .280* .556* .288 .581* DEN .276 .566* .273* .526* .280 .570* CV .310* .573* .287* .589* .312* .540* D .354 .526 .358 .517 .346 .535 SVD .355 .527 .359 .518 .348 .536 NMF .353 .596* .367 .608* .366 .609* DEN .343 .559* .361 .555* .356 .560* CV .352 .561* .362 .573* .374 .556* Average gain/loss: SL MEN B SVD 0.11 -0.20 -0.05 NMF 1.71 10.49 6.10 DEN 1.63 7.34 4.48 CV 3.23 8.29 5.76 Average gain/loss: SL MEN B SVD 0.11 -0.20 -0.05 NMF 1.71 10.49 6.10 DEN 1.63 7.34 4.48 CV 3.23 8.29 5.76 Conclusion We successfully applied dimension- ality reduction as well as denoising techniques. Except for SVD, all inves- tigated methods showed signicant improvements in single- and multi- modal setups on the task of predicting similarity and relatedness. Multi-Modal Setup Varying a weight threshold (α). Similarity is computed as follows: sim(x, y )= α · ling (x, y ) + (1 - α) · vis(x, y ) 0.0 0.2 0.4 0.6 0.8 1.0 0.30 0.35 0.40 0.45 α Default SVD NMF DEN CV Only-Text SL: +AN 0.0 0.2 0.4 0.6 0.8 1.0 0.70 0.72 0.74 0.76 0.78 α Default SVD NMF DEN CV Only-Text MEN: +AN

Upload: others

Post on 31-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ignacio Iacobacci et al (ACL 2016). - Uni Stuttgart · 2017-09-15 · Maximilian Köper and Kim-Anh Nguyen and Sabine Schulte im Walde Institut für Maschinelle Sprachverarbeitung

Optimizing Visual Representations in SemanticMulti-Modal Models with DimensionalityReduction, Denoising and Contextual

InformationMaximilian Köper and Kim-Anh Nguyen and Sabine Schulte im Walde

Institut für Maschinelle Sprachverarbeitung

Universität Stuttgart, Germany

{maximilian.koeper,kim-anh.nguyen,schulte}@ims.uni-stuttgart.de

Abstract•We improve visual representations for

multi-modal semantic models by

→Applying standard dimensionality re-duction and denoising techniques

→Proposing a novel technique Con-textVision that takes corpus-based

textual information into account when

enhancing visual embeddings

•We explore our contribution in a visualand a multi-modal setup and evaluate on

benchmark word similarity and related-ness tasks

Overview

FC6 FC7

5

1

Input Images(from Bing,Google,Flickr)

Convolutional Neural Network(GoogLeNet,AlexNet,VGGNet)

Contribution

Corpus occurrences Textual representation Evaluation (SimLex-999 & MEN)

Die Elefanten bilden eine Familie der Rüsseltiere. Diese Familie um-fasst alle heute noch lebenden Vertreter der Rüsseltiere. Elefanten sinddie größten noch lebenden Landtiere. Schon bei der Geburt wiegt einKalb bis zu 100 Kilogramm. Der Elefant ist ein Tier der Superlative:Bis zu vier Meter kann er hoch werden, und mit bis zu 7,5 TonnenGewicht ist er das schwerste noch lebende Landsäugetier. Elefantensind einfach gigantisch!

1

count orpredict

combine evaluate

evaluate

~Elefant =

2.578.710.674.322.715.504.318.019.592.31...

extract features

elefant 3.jpg

1.233.044.107.829.451.123.318.089.293.48...

~elefant 1.jpg~elefant 2.jpg~elefant 3.jpg. . .~elefant n.jpg

→ centroid

OptimizeVisualRepre-

sentationNMF, SVD,DEN, CV

W1 W2 Ratingman child 4.13bread cheese 1.95god priest 4.50

monster demon 6.95... ... ...

1

Mid-FusionText+Vision

Motivation & Method•Computational models across tasks potentially pro�t from combining corpus-based textual

information with perceptional information

→Word meanings are grounded in the external environment. Sensorimotor experience cannot

be learned only based on linguistic symbols, cf. the grounding problem Ignacio Iacobacci et al (ACL 2016).Harnad (1990)

•Recent advances in computer vision (deep learning) led to the development of better visual

representations. Features are extracted from convolutional neural networks (CNNs)

•Dimension reduction techniques & denoising improve performance when applied to word

representations Ignacio Iacobacci et al (ACL 2016).Bullinaria and Levy (2012), Nguyen et al. (2016)

→What about visual representations ?

• Singular Value Decomposition

(SVD): a matrix algebra opera-

tion that can be used to reduce

matrix dimensionality yielding

a new high-dimensional space

•Non-negative matrix factoriza-

tion (NMF) is a a matrix fac-

torisation approach where the

reduced matrix contains only

non-negative real numbers

• denoising methods (DEN) use

a non-linear, parameterized,

feedforward neural network as

a �lter on word embeddings to

reduce noise

•Our novel idea ContextVision(CV) strengthens visual vector

representations by performing

negative sampling using visual

representation and corpus con-

texts.

Results (only visual)AlexNet GoogLeNet VGGNet

SimLex MEN SimLex MEN SimLex MEN

bing

Default .324 .560 .314 .513 .312 .545

SVD .324 .557 .316 .513 .314 .544

NMF .329 .610* .341* .612* .330 .631*DEN .356* .582* .342* .564* .343* .599*

CV .364* .583* .358* .582* .357* .603*

flickr

Default .271 .434 .244 .366 .262 .422

SVD .270 .424 .245 .364 .264 .418

NMF .284 .560* .280* .556* .288 .581*DEN .276 .566* .273* .526* .280 .570*

CV .310* .573* .287* .589* .312* .540*

google

Default .354 .526 .358 .517 .346 .535

SVD .355 .527 .359 .518 .348 .536

NMF .353 .596* .367 .608* .366 .609*DEN .343 .559* .361 .555* .356 .560*

CV .352 .561* .362 .573* .374 .556*

Average gain/loss:

SimLexMEN Both

SVD0.11 -0.20 -0.05

NMF 1.71 10.49 6.10

DEN1.63 7.34 4.48

CV3.23 8.29 5.76

Average gain/loss:

SimLexMEN Both

SVD0.11 -0.20 -0.05

NMF 1.71 10.49 6.10

DEN1.63 7.34 4.48

CV3.23 8.29 5.76

ConclusionWe successfully applied dimension-

ality reduction as well as denoising

techniques. Except for SVD, all inves-

tigated methods showed signi�cantimprovements in single- and multi-

modal setups on the task of predicting

similarity and relatedness.

Multi-Modal Setup

•Varying a weight threshold (α). Similarity is computed as

follows:

sim(x, y) = α · ling(x, y) + (1− α) · vis(x, y)

0.0 0.2 0.4 0.6 0.8 1.00.30

0.35

0.40

0.45

α

Default SVD NMFDEN CV Only-Text

•SimLex: bing+AlexNet

0.0 0.2 0.4 0.6 0.8 1.00.70

0.72

0.74

0.76

0.78

α

Default SVD NMFDEN CV Only-Text

•MEN: flickr+AlexNet